How can I implement multithreading in java to process 2 million text files?

Question

How can I implement multithreading in java to process 2 million text files?

I need to process 2 million text files and create triples there.

Suppose I have a txt file xyz.txt (one of the files with 2 million input), it is processed as shown below:

start(xyz.txt)---->module1(xyz.tpd)------>module2(xyz.adv)-------->module3(xyz.tpl)

offer me the logic or concept so that I can faster and optimize the process on x64 Windows 4GB systems.

module1 (working): it parses the txt file using the .bat file in which the parser is called, this is a separate system thread, and after 15 seconds it starts parsing another txt file again, etc ...

module2 (working): it takes a .tpd file as input and creates a .adv file. module3 (working): it takes the .adv file as input and generates .tpl (triples).

Should I start streams from txt files or at some other point ?? I am afraid that if I get hung up on switching contexts.

Does anyone have a better logic so that I can try .....?

+4

java performance multithreading

Roshan Jun 04 '13 at 7:07

source share

5 answers

Most importantly, you need to write a program, view it and see where the bottleneck is. It is more than likely that disk I / O will be a bottleneck and no multithreading will solve your problems.

In this case, the use of two (three? Four?) Separate hard drives can lead to a greater increase in speed than the best multi-threaded solution.

In addition, the general rule is that you should optimize your application only when you have working code, and you really know what to optimize . Profile, profile, profile.

Considering future multithreaded optimizations when writing in the order; the architecture must be flexible enough to allow future optimization.

+3

Dariusz Jun 04 '13 at 7:11

source share

Not much has been said here about your hardware environment; but the main solution would be to use a fixed size ExecutorService , where the size will be, firstly, the number of your execution blocks:

 private static final int NR_CPUS = Runtime.getRuntime().availableProcessors(); // Then: final ExecutorService executor = Executors.newFixedThreadPool(NR_CPUS);

Then, for each file, you can create a Runnable to process it and send it to the thread pool using its .execute() method.

Note that .execute() is asynchronous; if the submitted runnable cannot be started right now, it will be queued.

+1

fge Jun 04 '13 at 7:20

source share

.. sounds like a typical batch application needed for data integration. Although I am not going to create hyperlinks without fully understanding your needs, but perhaps you need a solution that should work in one virtual machine and for a period of time when you would like to expand the solution for several VMs / machines. and maybe we are not dealing with PB data to begin with. try Spring Batch will not only solve the problem in this context, you will learn how to structure your thoughts (think vocabulary!) to solve such problems.

0

Nitin tripathi Jun 04 '13 at 9:10

source share

As a starting point, I would create one I / O thread and a pool of CPU threads. The input / output stream reads in text files and offer them in BlockingQueue , and the CPU take threads read files from BlockingQueue and process them. Then profile the application to find out how many processor threads you should use to keep up with the I / O stream (you can also dynamically determine this, for example, start from one CPU stream and start from another when the BlockingQueue exceeds the threshold, maybe something like 20 files). You may find that you only need one CPU stream to keep up with the I / O stream, in which case your program is tied to IO, and you will need to, for example, place text files next to each other on disk ( so that you can use sequential readings for everyone except the first file) or put them on separate disks to speed up the application; one idea is to ZipInputStream files together and read them using ZipInputStream - this will reduce the number of disk accesses when reading files, and also reduce the amount of data you need to read

0

Zim-Zam O'Pootertoot Jun 04 '13 at 13:39

source share

Sumit desai · Accepted Answer · 2013-06-04T07:08:41+0000

Use ThreadPoolExecutor . Configure settings, such as the number of active threads and others, according to your environment and system.

How can I implement multithreading in java to process 2 million text files?

More articles: