There seem to be several ways to "multithreaded read from a full table."
Zeroth way: if your problem is only that “I finished reading RAM, reading this entire table in memory”, you can try to process one row at a time (or a packet of rows), then process the next batch, etc. Thus, avoiding loading the entire table into memory (but still a single thread, possibly slow).
The first way: to have one thread that queries the entire table, placing individual rows in a queue that feeds multiple worker threads. [NB that setting the sample size for your JDBC connection may be useful here if you want this first stream to go as fast as possible]. Disadvantage: only one thread queries the source database at a time, which may not "maximize" your database. Pro: you do not execute repeated queries, so the sort order should not change halfway (for example, if your query is selected * from table_name, the return order is somewhat random, but if you return it all from the same resultset / query, you will not get duplicates) . You will not have any random duplicates or anything like that. Here the tutorial does it this way.
The second way: pagination, basically each thread somehow knows which piece it should select ( XXX in this example), so it knows: "I have to query the table as select * from table_name order by something start with XXX limit 10 " Then each thread basically processes (in this case) 10 at a time [XXX is a common variable among the threads incremented by the calling thread].
The problem is “order by something”, which means that for each database query you have to order the whole table, which may or may not be possible, and can be expensive, especially near the end of the table. If it is indexed, this should not be a problem. The danger here is that if there are “gaps” in the data, you will make some useless queries, but they will probably still be fast. If you have an identifier column and it is mostly in contact, you can, for example, "crop" based on the identifier.
If you have another column that you can turn off, for example, a date column with a known amount for each date, and it will be indexed, then you can avoid "ordering" instead by date, for example select * from table_name where date < XXX and date > YYY (also without a sentence restriction), although you could use stream restriction sentences to work through a specific unique date range, updating as it is sorted and sorted, since this is a smaller range, less pain).
The third way: you execute a query to “reserve” rows from a table, for example update table_name set lock_column = my_thread_unique_key where column is nil limit 10 , followed by a select * from table_name where lock_column = my_thread_unique_key . Drawback: Are you sure your database does this as a single atomic operation? If not, then it is possible that the two questions asked will collide or something like that, causing duplicates or partial lots. Be careful. Perhaps synchronize your process around select and update queries, or lock the table and / or rows accordingly. Something like this to avoid a possible collision (for example, postgres requires the special SERIALIZABLE option).
The fourth way: (associated with the third) is mostly useful if you have large spaces and you want to avoid "useless" queries: create a new table that "numbers" your initial table with an increase index [basically a temporary table]. Then you can split this table with chunks of continuous identifier and use it to refer to the rows in the first. Or, if you have a column already in the table (or can add one) for use only for batch processing purposes, you can assign a batch identifier for rows, for example, update table_name set batch_number = rownum % 20000 , then each row has a batch number assigned themselves, flows can be assigned to a batch (or assigned "every ninth batch" or whatever). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get a drift). Then you must have an adjacent set of numbers in order to perform the shutdown.
Fifth way: (not sure if I really recommend this, but) assign each row a “random” float during insertion. Then, if you know the approximate size of the database, you can separate its part, for example, 100, and you want 100 batches "where x <0.01 and X> = 0.02" or the like. (An idea inspired by how wikipedia can get a "random" page - assigns an arbitrary float value to each line during insertion).
What you really want to avoid is some sort of change in the sort order halfway. For example, if you don’t specify a sort order and just execute a query like select * from table_name start by XXX limit 10 from several threads, it is possible that the database will [because there is no specified sorting element] change the order in which it returns rows to halfway [for example, if new data is added] means that you can skip lines or not.
Using Hibernate's scrollable results to slowly read 90 million records also contains some related ideas (for example, for hibernate users).
Another option: if you know that some column (for example, "id") is mostly in contact, you can just iterate over it in "pieces" (get max, and then iterate through the pieces). Or some other column, which is kind of "chunkable".