I am trying to write a quick HTML scraper, and at this point I am just focusing on maximizing my throughput without parsing. I cached the IP addresses of the urls:
public class Data { private static final ArrayList<String> sites = new ArrayList<String>(); public static final ArrayList<URL> URL_LIST = new ArrayList<URL>(); public static final ArrayList<InetAddress> ADDRESSES = new ArrayList<InetAddress>(); static{
My next step was to check the speed with 100 URLs, extracting them from the Internet, reading the first 64 KB and moving on to the next URL. I create a thread pool of FetchTaskConsumer and I tried to run several threads (from 16 to 64 on the i7 Quad Core), this is what each consumer looks like:
public class FetchTaskConsumer implements Runnable{ private final CountDownLatch latch; private final int[] urlIndexes; public FetchTaskConsumer (int[] urlIndexes, CountDownLatch latch){ this.urlIndexes = urlIndexes; this.latch = latch; } @Override public void run() { URLConnection resource; InputStream is = null; for(int i = 0; i < urlIndexes.length; i++) { int numBytes = 0; try { resource = Data.URL_LIST.get(urlIndexes[i]).openConnection(); resource.setRequestProperty("User-Agent", "Mozilla/5.0"); is = resource.getInputStream(); while(is.read()!=-1 && numBytes < 65536 ) { numBytes++; } } catch (IOException e) { System.out.println("Fetch Exception: " + e.getMessage()); } finally { System.out.println(numBytes + " bytes for url index " + urlIndexes[i] + "; remaining: " + remaining.decrementAndGet()); if(is!=null){ try { is.close(); } catch (IOException e1) {} } } } latch.countDown(); } }
At best, I can go through 100 URLs in 30 seconds, but literature suggests that I have to go through 300,150 URLs per second. Please note that I have access to Gigabit Ethernet, although I am currently testing at home on my 20 megabyte connection ... in any case, the connection is never used to the full.
I tried using Socket connections directly, but I have to do something wrong, because it is even slower! Any suggestions on how I can improve bandwidth?
PS
I have a list of about 1 million popular URLs, so I can add more URLs if 100 is not enough for testing.
Update:
The literature I refer to is documents related to the Najork web guru, Najork states:
Processed 891 million URLs in 17 days
This is ~ 606 downloads per second [on] 4 Compaq DS20E Alpha Servers [s] 4 GB main memory [,] 650 GB of disk space [and] 100 Mb / s.
Data transfer speed over the Internet provider 160 Mbps
So, this is actually 150 pages per second, not 300. My Core i7 computer has 4 GB of RAM, and I'm nowhere close to that. I did not see anything that indicated that they used in particular.
Update:
Ok, count ... the final results! It turns out that 100 URLs are too low for the test. I came across 1024 URLs, 64 threads, I set a timeout of 2 seconds for each fetch, and I was able to get up to 21 pages per second (in fact, my connection is about 10.5 Mbps, so 21 pages per second * 64KB per page is about 10.5 Mbps). Here's what a mercenary looks like:
public class FetchTask implements Runnable{ private final int timeoutMS = 2000; private final CountDownLatch latch; private final int[] urlIndexes; public FetchTask(int[] urlIndexes, CountDownLatch latch){ this.urlIndexes = urlIndexes; this.latch = latch; } @Override public void run() { URLConnection resource; InputStream is = null; for(int i = 0; i < urlIndexes.length; i++) { int numBytes = 0; try { resource = Data.URL_LIST.get(urlIndexes[i]).openConnection(); resource.setConnectTimeout(timeoutMS); resource.setRequestProperty("User-Agent", "Mozilla/5.0"); is = resource.getInputStream(); while(is.read()!=-1 && numBytes < 65536 ) { numBytes++; } } catch (IOException e) { System.out.println("Fetch Exception: " + e.getMessage()); } finally { System.out.println(numBytes + "," + urlIndexes[i] + "," + remaining.decrementAndGet()); if(is!=null){ try { is.close(); } catch (IOException e1) {} } } } latch.countDown(); } }