The fastest way to get multiple web pages in Java

Question

The fastest way to get multiple web pages in Java

I am trying to write a quick HTML scraper, and at this point I am just focusing on maximizing my throughput without parsing. I cached the IP addresses of the urls:

public class Data { private static final ArrayList<String> sites = new ArrayList<String>(); public static final ArrayList<URL> URL_LIST = new ArrayList<URL>(); public static final ArrayList<InetAddress> ADDRESSES = new ArrayList<InetAddress>(); static{ /* add all the URLs to the sites array list */ // Resolve the DNS prior to testing the throughput for(int i = 0; i < sites.size(); i++){ try { URL tmp = new URL(sites.get(i)); InetAddress address = InetAddress.getByName(tmp.getHost()); ADDRESSES.add(address); URL_LIST.add(new URL("http", address.getHostAddress(), tmp.getPort(), tmp.getFile())); System.out.println(tmp.getHost() + ": " + address.getHostAddress()); } catch (MalformedURLException e) { } catch (UnknownHostException e) { } } } }

My next step was to check the speed with 100 URLs, extracting them from the Internet, reading the first 64 KB and moving on to the next URL. I create a thread pool of FetchTaskConsumer and I tried to run several threads (from 16 to 64 on the i7 Quad Core), this is what each consumer looks like:

 public class FetchTaskConsumer implements Runnable{ private final CountDownLatch latch; private final int[] urlIndexes; public FetchTaskConsumer (int[] urlIndexes, CountDownLatch latch){ this.urlIndexes = urlIndexes; this.latch = latch; } @Override public void run() { URLConnection resource; InputStream is = null; for(int i = 0; i < urlIndexes.length; i++) { int numBytes = 0; try { resource = Data.URL_LIST.get(urlIndexes[i]).openConnection(); resource.setRequestProperty("User-Agent", "Mozilla/5.0"); is = resource.getInputStream(); while(is.read()!=-1 && numBytes < 65536 ) { numBytes++; } } catch (IOException e) { System.out.println("Fetch Exception: " + e.getMessage()); } finally { System.out.println(numBytes + " bytes for url index " + urlIndexes[i] + "; remaining: " + remaining.decrementAndGet()); if(is!=null){ try { is.close(); } catch (IOException e1) {/*eat it*/} } } } latch.countDown(); } }

At best, I can go through 100 URLs in 30 seconds, but literature suggests that I have to go through 300,150 URLs per second. Please note that I have access to Gigabit Ethernet, although I am currently testing at home on my 20 megabyte connection ... in any case, the connection is never used to the full.

I tried using Socket connections directly, but I have to do something wrong, because it is even slower! Any suggestions on how I can improve bandwidth?

PS
I have a list of about 1 million popular URLs, so I can add more URLs if 100 is not enough for testing.

Update:
The literature I refer to is documents related to the Najork web guru, Najork states:

Processed 891 million URLs in 17 days
This is ~ 606 downloads per second [on] 4 Compaq DS20E Alpha Servers [s] 4 GB main memory [,] 650 GB of disk space [and] 100 Mb / s.
Data transfer speed over the Internet provider 160 Mbps

So, this is actually 150 pages per second, not 300. My Core i7 computer has 4 GB of RAM, and I'm nowhere close to that. I did not see anything that indicated that they used in particular.

Update:
Ok, count ... the final results! It turns out that 100 URLs are too low for the test. I came across 1024 URLs, 64 threads, I set a timeout of 2 seconds for each fetch, and I was able to get up to 21 pages per second (in fact, my connection is about 10.5 Mbps, so 21 pages per second * 64KB per page is about 10.5 Mbps). Here's what a mercenary looks like:

 public class FetchTask implements Runnable{ private final int timeoutMS = 2000; private final CountDownLatch latch; private final int[] urlIndexes; public FetchTask(int[] urlIndexes, CountDownLatch latch){ this.urlIndexes = urlIndexes; this.latch = latch; } @Override public void run() { URLConnection resource; InputStream is = null; for(int i = 0; i < urlIndexes.length; i++) { int numBytes = 0; try { resource = Data.URL_LIST.get(urlIndexes[i]).openConnection(); resource.setConnectTimeout(timeoutMS); resource.setRequestProperty("User-Agent", "Mozilla/5.0"); is = resource.getInputStream(); while(is.read()!=-1 && numBytes < 65536 ) { numBytes++; } } catch (IOException e) { System.out.println("Fetch Exception: " + e.getMessage()); } finally { System.out.println(numBytes + "," + urlIndexes[i] + "," + remaining.decrementAndGet()); if(is!=null){ try { is.close(); } catch (IOException e1) {/*eat it*/} } } } latch.countDown(); } }

+6

java performance url web-crawler

Kiril Apr 16 '11 at 17:30

source share

2 answers

Focusing on your accomplishments this time. I tried using your code myself and found that I got about 3 pages per second to access the main sites. If I accessed my own web server loading static pages, I maximized my system.

On the Internet today, a main site usually takes more than a second to create a page. Looking at the packets they are sending to me now, the page arrives in several TCP / IP packets. From here, in the UK, it takes 3 years to download to download www.yahoo.co.jp, 2 seconds to download amazon.com, but facebook.com takes less than 0.1 seconds. The difference is that the front page of facebook.com is static and the other two are dynamic. For people, the critical factor is the time to the first byte, that is, when the browser can start to do something, and not the time to the 65536th byte. No one optimizes this :-)

So what does this mean for you? As you focus on popular pages, I think you also focus on dynamic pages that just don't send as fast as static pages. Since the sites I looked at send multiple packages for their pages, this means that you are extracting many pages at the same time, and therefore the packages may collide with each other on the network.

A batch collision occurs when two websites send you a data packet at the same time. At some point, the input from the two websites should be copied into one wire on your computer. When two packets arrive at each other, the router combining them rejects both, and instructs the two senders to resend after a different short delay. This effectively slows down the operation of both sites.

So:

1) Pages are not generated these days. 2) On an Ethernet network, it is not possible to handle multiple simultaneous downloads. 3) Static websites (which used to be more common) were much faster and used fewer packages than dynamic websites.

All this means that the maximum connection is very difficult.

You can try the same test as me by placing 1000 files at 64 KB and seeing how quickly your code can load them. For me, your code worked fine.

+1

Simon G. Apr 16 '11 at 22:00

source share

Simon G. · Accepted Answer · 2011-04-16T17:58:51+0000

Are you confident in your amounts?

300 URLs per second, each URL reading 64 kilobytes

This requires: 300 x 64 = 19,200 kilobytes / s

Conversion to bits: 19,200 kilobytes / s = (8 * 19,200) kilobytes / s

So we have: 8 * 19,200 = 153,600 kilobits bps

Then up to Mb / s: 153,600 / 1024 = 150 megabits / s

... and yet you only have a 20 Mbps channel.

However, I suppose many of the URLs you retrieve are no more than 64 KB in size, so the end-to-end call appears faster than your feed. You are not slow, you are fast!

The fastest way to get multiple web pages in Java

More articles: