How to get content in web crawl

enter image description here

Hello! I am trying to implement this pseudo code for a spider algorithm to study the web. I need an idea for my next pseudo-code step: “ use SpiderLeg to retrieve the content ”, I have a method in another SpiderLeg class that has a way to get all the URLs of this web page, but wonders how I can use it in this class ?

// method to crawl web and print out all URLs that the spider visit
public List<String> crawl(String url, String keyword) throws IOException{
    String currentUrl;
    // while list of unvisited URLs is not empty
    while(unvisited != null ){
        // take URL from list 
        currentUrl = unvisited.get(0);
       //using spiderLeg to fetch content   
        SpiderLeg leg = new SpiderLeg();
    }
    return unvisited;
}

Hurrah!! Let's try it ... However, I tried it without using the DS queue, it almost works, but does not stop the program when looking for any word.

And when he finds, he shows only the link of the web page to not all the URLs where he finds the word. I wonder if this can be done?

private static final int MAX_PAGES_TO_SEARCH = 10;
  private Set<String> pagesVisited = new HashSet<String>();
  private List<String> pagesToVisit = new LinkedList<String>();



public void crawl(String url, String searchWord)
  {
      while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH)
      {
          String currentUrl;
      SpiderLeg leg = new SpiderLeg();
      if(this.pagesToVisit.isEmpty())
      {
          currentUrl = url;
          this.pagesVisited.add(url);
      }
      else
      {
          currentUrl = this.nextUrl();
      }
      leg.getHyperlink(currentUrl);
      boolean success = leg.searchForWord(searchWord);
      if(success)
      {
          System.out.println(String.format("**Success** Word %s found at %s", searchWord, currentUrl));
          break;
      }
      this.pagesToVisit.addAll(leg.getLinks());
  }
  System.out.println("\n**Done** Visited " + this.pagesVisited.size() + " web page(s)");
  }
+4
1

, URL- , URL-, -queue, URL-, HTML (SpiderLeg).

URL- , , , URL- URL- , . , , URL-, .

0

Source: https://habr.com/ru/post/1608913/


All Articles