I have a table to which rows are only added (not updated or deleted) with transactions (I will explain why this is important), and I need to extract new, previously unprocessed rows of this table, every minute using cron.
How do i do this? In any programming language (I use Perl, but that doesn't matter.)
I list how I thought about how to solve this problem, and ask you to show me the correct one (there should be one ...)
The first method that appeared in my head was to save (in a file) the largest auto_incrementing identifier from the selected lines, so the next minute I can get with: WHERE id > $last_id
. But this may skip the lines. Since new lines are inserted into the transaction, it is possible that a transaction that saves the line with id = 5 makes a transaction that saves the line with id = 4. Therefore, it is possible that the cron script retrieves line 5, but not line 4, and when line 4 will be fixed one second later, it will never be fetched (since 4 is not> 5, but $ last_id).
Then I thought that I could force the cron job to get all the lines that have a date field in the last two minutes, check which of these lines was received again in the previous cron job run (for this I will need to save somewhere which identifiers lines were received), compare and process only new ones. Unfortunately, this is difficult, and also does not solve the problem that arises if any transaction insertion takes two minutes and HALF minutes to fix the database for some strange reason, which will cause the date to be too old for the next iterate the cron job to retrieve.
Then I thought about installing Message Queuing (MQ), such as RabbitMQ or whatever. The same process that performs the insert transaction will notify RabbitMQ of the new line, and RabbitMQ will notify of the constantly running process that processes new lines. Thus, instead of inserting a package of lines at the last minute, this process will receive new lines one by one as they are written. This sounds good, but it has too many points of failure - RabbitMQ can be shut down for a second (for example, when rebooting), in which case the insert transaction will be transmitted without receiving the process that has ever received a new line. Thus, a new line will be skipped. Not good.
I just thought about one more solution: about the reception processes (there 30 of them, performing the same task using exactly the same data, therefore the same lines are processed 30 times, one each receiving process) can write in another that they processed the row X when they were processed, and then, when the time comes, they can query all the rows in the main table that do not exist in the has_processed table with the OUTER JOIN query. But I believe (correct me if I am mistaken) that such a request will consume a lot of CPU and HD on the database server, since it will have to compare the entire list of identifiers of two tables to find new records (and the table is huge and getting bigger and bigger ) It would be fast if there was only one receiving process - then I could add an indexed field with the name "has_read" in the main table, which would make finding new rows extremely fast and easy on the database server.
What is the right way to do this? What do you suggest? The question is simple, but the solution seems difficult (for me) to find.
Thanks.