Concurrency of many GET requests

Is there an efficient way to parallelize a large number of GET requests in Java. I have a file with 200,000 lines, each of which requires a GET request from Wikimedia. And then I have to write part of the answer to the shared file. I pasted the main part of my code below as a link.

while ((line = br.readLine()) != null) { count++; if ((count % 1000) == 0) { System.out.println(count + " tags parsed"); fbw.flush(); bw.flush(); } //System.out.println(line); String target = new String(line); if (target.startsWith("\"") && (target.endsWith("\""))) { target = target.replaceAll("\"", ""); } String url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=xml&rvprop=timestamp&rvlimit=1&rvdir=newer&titles="; url = url + URLEncoder.encode(target, "UTF-8"); URL obj = new URL(url); HttpURLConnection con = (HttpURLConnection) obj.openConnection(); // optional default is GET con.setRequestMethod("GET"); //add request header //con.setRequestProperty("User-Agent", USER_AGENT); int responsecode = con.getResponseCode(); //System.out.println("Sending 'Get' request to URL: " + url); BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream())); String inputLine; StringBuffer response = new StringBuffer(); while ((inputLine = in.readLine()) != null) { response.append(inputLine); } Document doc = loadXMLFromString(response.toString()); NodeList x = doc.getElementsByTagName("revisions"); if (x.getLength() == 1) { String time = x.item(0).getFirstChild().getAttributes().item(0).getTextContent().substring(0,10).replaceAll("-", ""); bw.write(line + "\t" + time + "\n"); } else if (x.getLength() == 2) { String time = x.item(1).getFirstChild().getAttributes().item(0).getTextContent().substring(0, 10).replaceAll("-", ""); bw.write(line + "\t" + time + "\n"); } else { fbw.write(line + "\t" + "NULL" + "\n"); } } 

I googled, and there seem to be two options. One of them is to create threads, and the other is to use something called the Contractor. Could someone give a little guidance on which would be more appropriate for this task?

+4
source share
3 answers

If you really need to do this with GET requests, I recommend using ThreadPoolExecutor with a small (2 or 3) thread pool to avoid overloading wikipedia servers. This will avoid a lot of coding ...

Also consider using the Apache HttpClient libraries (with persistent connections!).


But it is much better to use the database load option. Depending on what you are doing, you can choose one of the small downloads. This page discusses various options.

Note. Wikipedia prefers people loading database dumps (iterators) rather than pounding on their web servers.

+5
source

What you need:

  • Have a producer thread that reads each line and adds it to the queue.
  • ThreadPool in which each thread takes a URL and executes a GET request
  • Gets the response and adds it to the queue.
  • You have another consumer thread that checks the queue and adds it to the file.
0
source

As stated above, you should measure the number of concurrent GET requests based on server capacity. If you want to stick with the JVM but want to use Groovy, here is a very short example of concurrent GET requests.

Initially, there is a list of URLs that you want to get. After execution, the task list contains all the results available through the get () method for further processing. It is simply printed here as an example.

 import groovyx.net.http.AsyncHTTPBuilder def urls = [ 'http://www.someurl.com', 'http://www.anotherurl.com' ] AsyncHTTPBuilder http = new AsyncHTTPBuilder(poolSize:urls.size()) def tasks = [] urls.each{ tasks.add(http.get(uri:it) { resp, html -> return html }) } tasks.each { println it.get() } 

Please note that you must take care of timeouts, errors, etc. for the production environment.

0
source

Source: https://habr.com/ru/post/1494429/


All Articles