How to implement multiple threads in Java to load individual table data?

How can I implement multiple threads with multiple / same connections so that you can quickly load one large data table.

In fact, in my application, I load a table with 12 lacs (1 lac = 100,000) records, which takes at least 4 hours to load at normal connection speed and more than hours with a slow connection.

Thus, it is necessary to execute multiple threads in Java to load individual table data using multiple / identical objects (connections). But I don’t know how to do it.

How to place a record pointer in multiple streams, then how to add all stream records to one large file?

Thanks at Advance

+1
java multithreading
Nov 30 '11 at 12:10
source share
4 answers

First of all, is it not recommended to extract and upload such huge data to the client. If you need data to display, you do not need more records that fit into your screen. You can paginate pages and get one page at a time. If you extract it and process it in your memory, then you will probably have a shortage of memory on your client.

If you need to do this at all regardless of the proposal, you can create several threads with separate connections to the database, where each thread will stretch part of the data (from 1 to many pages). If you say 100K records and 100 threads, then each stream can pull out 1K records. It is also not recommended to have 100 threads with 100 open database connections. This is just an example. Limit the number of threads to some optimal value, and also limit the number of records that pull the thread. You can limit the number of records retrieved from the database based on rownum.

+3
Dec 02 2018-11-12T00:
source share

As Vikas noted, if you upload gigabytes of data to the client side, you are really doing something wrong, since he said that you won’t need to upload more records that can fit into your screen, If, however, you just need to do this is from time to time to duplicate the database or the backup purpose, just use the database export functionality of your DBMS and upload the exported file using DAP (or your favorite download accelerator).

+2
Dec 02 2018-11-12T00:
source share

There seem to be several ways to "multithreaded read from a full table."

Zeroth way: if your problem is only that “I finished reading RAM, reading this entire table in memory”, you can try to process one row at a time (or a packet of rows), then process the next batch, etc. Thus, avoiding loading the entire table into memory (but still a single thread, possibly slow).

The first way: to have one thread that queries the entire table, placing individual rows in a queue that feeds multiple worker threads. [NB that setting the sample size for your JDBC connection may be useful here if you want this first stream to go as fast as possible]. Disadvantage: only one thread queries the source database at a time, which may not "maximize" your database. Pro: you do not execute repeated queries, so the sort order should not change halfway (for example, if your query is selected * from table_name, the return order is somewhat random, but if you return it all from the same resultset / query, you will not get duplicates) . You will not have any random duplicates or anything like that. Here the tutorial does it this way.

The second way: pagination, basically each thread somehow knows which piece it should select ( XXX in this example), so it knows: "I have to query the table as select * from table_name order by something start with XXX limit 10 " Then each thread basically processes (in this case) 10 at a time [XXX is a common variable among the threads incremented by the calling thread].

The problem is “order by something”, which means that for each database query you have to order the whole table, which may or may not be possible, and can be expensive, especially near the end of the table. If it is indexed, this should not be a problem. The danger here is that if there are “gaps” in the data, you will make some useless queries, but they will probably still be fast. If you have an identifier column and it is mostly in contact, you can, for example, "crop" based on the identifier.

If you have another column that you can turn off, for example, a date column with a known amount for each date, and it will be indexed, then you can avoid "ordering" instead by date, for example select * from table_name where date < XXX and date > YYY (also without a sentence restriction), although you could use stream restriction sentences to work through a specific unique date range, updating as it is sorted and sorted, since this is a smaller range, less pain).

The third way: you execute a query to “reserve” rows from a table, for example update table_name set lock_column = my_thread_unique_key where column is nil limit 10 , followed by a select * from table_name where lock_column = my_thread_unique_key . Drawback: Are you sure your database does this as a single atomic operation? If not, then it is possible that the two questions asked will collide or something like that, causing duplicates or partial lots. Be careful. Perhaps synchronize your process around select and update queries, or lock the table and / or rows accordingly. Something like this to avoid a possible collision (for example, postgres requires the special SERIALIZABLE option).

The fourth way: (associated with the third) is mostly useful if you have large spaces and you want to avoid "useless" queries: create a new table that "numbers" your initial table with an increase index [basically a temporary table]. Then you can split this table with chunks of continuous identifier and use it to refer to the rows in the first. Or, if you have a column already in the table (or can add one) for use only for batch processing purposes, you can assign a batch identifier for rows, for example, update table_name set batch_number = rownum % 20000 , then each row has a batch number assigned themselves, flows can be assigned to a batch (or assigned "every ninth batch" or whatever). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get a drift). Then you must have an adjacent set of numbers in order to perform the shutdown.

Fifth way: (not sure if I really recommend this, but) assign each row a “random” float during insertion. Then, if you know the approximate size of the database, you can separate its part, for example, 100, and you want 100 batches "where x <0.01 and X> = 0.02" or the like. (An idea inspired by how wikipedia can get a "random" page - assigns an arbitrary float value to each line during insertion).

What you really want to avoid is some sort of change in the sort order halfway. For example, if you don’t specify a sort order and just execute a query like select * from table_name start by XXX limit 10 from several threads, it is possible that the database will [because there is no specified sorting element] change the order in which it returns rows to halfway [for example, if new data is added] means that you can skip lines or not.

Using Hibernate's scrollable results to slowly read 90 million records also contains some related ideas (for example, for hibernate users).

Another option: if you know that some column (for example, "id") is mostly in contact, you can just iterate over it in "pieces" (get max, and then iterate through the pieces). Or some other column, which is kind of "chunkable".

+1
Apr 07 '15 at 22:13
source share

I just felt compelled to answer this old post.

Please note that this is a typical scenario for Big Data, not only for collecting data in several streams, but also for further processing of this data in several streams. Such approaches do not always require that all data is accumulated in memory, they can be processed in groups and / or sliding windows, and you just need to either accumulate the result or transfer the data further (another permanent storage).

In order to process the data in parallel, usually a partition scheme or a partition scheme is applied to the source data. If the data is raw text, it may be a random sizer, cut somewhere in the middle. For databases, the partitioning scheme is nothing more than an additional condition that applies to your request in order to enable paging. It could be something like:

  • Driver program: divide my data into parts and run 4 workers.
  • 4 x (Worker Program): Give me part 1..4 of 4 data

This may mean (pseudo) sql, for example:

 SELECT ... FROM (... Subquery ...) WHERE date = SYSDATE - days(:partition) 

In the end, it's all pretty arbitrary, nothing super advanced.

0
May 09 '15 at 20:02
source share



All Articles