How to crawl a site after logging in with a username and password

I wrote a web browser that scans a website with a key, but I want to enter my specified website and filter the information by keyword. How to achieve this. I am posting my code so far, I have done it.

public class DB { public Connection conn = null; public DB() { try { Class.forName("com.mysql.jdbc.Driver"); String url = "jdbc:mysql://localhost:3306/test"; conn = DriverManager.getConnection(url, "root","root"); System.out.println("conn built"); } catch (SQLException e) { e.printStackTrace(); } catch (ClassNotFoundException e) { e.printStackTrace(); } } public ResultSet runSql(String sql) throws SQLException { Statement sta = conn.createStatement(); return sta.executeQuery(sql); } public boolean runSql2(String sql) throws SQLException { Statement sta = conn.createStatement(); return sta.execute(sql); } @Override protected void finalize() throws Throwable { if (conn != null || !conn.isClosed()) { conn.close(); } } } public class Main { public static DB db = new DB(); public static void main(String[] args) throws SQLException, IOException { db.runSql2("TRUNCATE Record;"); processPage("http://m.naukri.com/login"); } public static void processPage(String URL) throws SQLException, IOException{ //check if the given URL is already in database; String sql = "select * from Record where URL = '"+URL+"'"; ResultSet rs = db.runSql(sql); if(rs.next()){ }else{ //store the URL to database to avoid parsing again sql = "INSERT INTO `test`.`Record` " + "(`URL`) VALUES " + "(?);"; PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS); stmt.setString(1, URL); stmt.execute(); //get useful information Connection.Response res = Jsoup.connect("http://www.naukri.com/").data("username","jeet.chatterjee.88@gmail.com","password","Letmein321") .method(Method.POST) .execute(); //http://m.naukri.com/login Map<String, String> loginCookies = res.cookies(); Document doc = Jsoup.connect("http://m.naukri.com/login") .cookies(loginCookies) .get(); if(doc.text().contains("")){ System.out.println(URL); } //get all links and recursively call the processPage method Elements questions = doc.select("a[href]"); for(Element link: questions){ if(link.attr("abs:href").contains("naukri.com")) processPage(link.attr("abs:href")); } } } } 

And the table structure is also

  CREATE TABLE IF NOT EXISTS `Record` ( `RecordID` INT(11) NOT NULL AUTO_INCREMENT, `URL` text NOT NULL, PRIMARY KEY (`RecordID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ; 

Now I want to use my username and password for this crawl so that the crawler can log in to the site dynamically and crawl information based on the keyword .. Lets say that my username is lucifer and the password is lucifer123

+2
java web-crawler jsoup
Jan 23 '15 at 12:43
source share
1 answer

your approach is for access without access to a website. usually works for web services, and sites are all in terms of state. u authenticate once and after that, they use the session key stored in your cookie to authenticate. therefore it is necessary. u should send the parameters that your browser sends. try to keep track of what your browser sends to the site using firebug, and play it in your code

- update -

 Jsoup.connect("url") .cookie("cookie-name", "cookie-value") .header("header-name", "header-value") .data("data-name","data-value"); 

u can add multiple cookies | heading | data. and there is a function for adding values ​​from Map .

to find out what needs to be installed, add a fire error to your browser, they all have a default developer console that can be launched using F12 . go to url and want to get the data and just add everything to your jsoup request. I have added some images from your site capture

I noted the important role in red.

u can get the required cookies in your code by sending this information to the site and receiving cookies from it and after receiving response.cookies you attach these cookies to each request u make;)

ps: change your ASAP password

+2
Jan 27 '15 at 15:23
source share



All Articles