Good afternoon.
I have a blocker problem with my web crawler project. The logic is simple. First one is created Runnable, it loads an html-document, looks through all the links, and then creates new objects on all the funded links Runnable. Each newly created one Runnable, in turn, creates new objects Runnablefor each link and executes them.
The problem is that it ExecutorServicenever stops.
CrawlerTest.java
public class CrawlerTest {
public static void main(String[] args) throws InterruptedException {
new CrawlerService().crawlInternetResource("https://jsoup.org/");
}
}
CrawlerService.java
import java.io.IOException;
import java.util.Collections;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CrawlerService {
private Set<String> uniqueUrls = Collections.newSetFromMap(new ConcurrentHashMap<String, Boolean>(10000));
private ExecutorService executorService = Executors.newFixedThreadPool(8);
private String baseDomainUrl;
public void crawlInternetResource(String baseDomainUrl) throws InterruptedException {
this.baseDomainUrl = baseDomainUrl;
System.out.println("Start");
executorService.execute(new Crawler(baseDomainUrl));
executorService.awaitTermination(10, TimeUnit.MINUTES);
System.out.println("End");
}
private class Crawler implements Runnable {
private String urlToCrawl;
public Crawler(String urlToCrawl) {
this.urlToCrawl = urlToCrawl;
}
public void run() {
try {
findAllLinks();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private void findAllLinks() throws InterruptedException {
if (uniqueUrls.add(urlToCrawl)) {
System.out.println(urlToCrawl);
Document htmlDocument = loadHtmlDocument(urlToCrawl);
Elements findedLinks = htmlDocument.select("a[href]");
for (Element link : findedLinks) {
String absLink = link.attr("abs:href");
if (absLink.contains(baseDomainUrl) && !absLink.contains("#")) {
executorService.execute(new Crawler(absLink));
}
}
}
}
private Document loadHtmlDocument(String internetResourceUrl) {
Document document = null;
try {
document = Jsoup.connect(internetResourceUrl).ignoreHttpErrors(true).ignoreContentType(true)
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0")
.timeout(10000).get();
} catch (IOException e) {
System.out.println("Page load error");
e.printStackTrace();
}
return document;
}
}
}
This application takes about 20 seconds to scan jsoup.org for all unique links. But wait 10 minutes executorService.awaitTermination(10, TimeUnit.MINUTES);
and then I see a dead main thread and a still working performer.
Themes
ExecutorService ?
, , executorService.execute .