The scalability of processing a large amount of database compilation data in PHP, many times a day

Question

The scalability of processing a large amount of database compilation data in PHP, many times a day

I will soon be working on a project that is creating a problem for me.

This will require regular intervals throughout the day to process tens of thousands of records, possibly more than a million. Processing will include several (potentially complex) formulas and the creation of several random factors, writing some new data into a separate table, and updating the source records with some results. This is necessary for all entries, ideally every three hours. Each new user on the site will add from 50 to 500 entries, which should be processed in this way, so the number will not be sustainable.

The code has not been written yet, as I am still in the development process, mainly because of this problem. I know that I will need to use cron jobs, but I am worried that processing records of this size may cause the site to freeze, execute slowly, or simply drop my hosting company every three hours.

I would like to know if anyone has any experience or advice on such topics? I have never worked with such a value before, and as far as I know, this will be trivial for the server and will not present a big problem. While ALL records are processed before the next three-hour period, I don’t care if they are not processed at the same time (although, ideally, all records belonging to a specific user should be processed in one batch), so I "wondered if I should process batches every 5 minutes, 15 minutes, an hour, no matter what works, and what is the best way to approach this (and make it scalable so that it is true for all users)?

+4

optimization php mysql cron database-design

Eph Jan 15 '11 at 10:32

source share

6 answers

The problem with many MySQL table updates that are used on the website is that the data update kills your query cache. This means that it will significantly slow down your site, even after the update is complete.

The solution that we used earlier is to have two MySQL databases (on different servers too, in our case). Only one of them is actively used by the web server. The other is just a reserve and is used for such updates. These two servers copy their data to each other.

Decision:

Replication stopped.
The website is encouraged to use Database1.
These big updates that you mentioned are performed in the database2.
Many commonly used queries are executed once in Database2 to warm up the query cache.
The server is prompted to use Database2.
Replication starts again. Database2 is now mainly used for reading (both on the website and on replication), so there are not so many delays on sites.

+1

Goleztrol Jan 15 '11 at 10:41

source share

it can be a cone, using many servers, where each server can make X-records / hour, the more records you will use in the future, the more servers you will need, otherwise you can eventually process a million records until the last 2-3 or even the fourth treatment is not finished yet ...

0

Ronan dejhero Jan 15 '11 at 10:40

source share

You might want to consider which database to use. Maybe a relational database is not suitable for this?

The only way to find out is to actually do some tests that mimic what you are going to do, though.

0

Stefan hinger Jan 15 '11 at 10:40

source share

In this situation, I would consider using Gearman (which also has a PHP extension, but can be used with many languages)

0

Paul Jan 15 '11 at 11:03

source share

Do it all on the server side using a stored procedure that selects a subset of the data and then processes the data internally.

Here is an example that uses a cursor to select data ranges:

drop procedure if exists batch_update; delimiter # create procedure batch_update ( in p_from_id int unsigned, -- range of data to select for each batch in p_to_id int unsigned ) begin declare v_id int unsigned; declare v_val double(10,4); declare v_done tinyint default 0; declare v_cur cursor for select id, val from foo where id between = p_from_id and p_to_id; declare continue handler for not found set v_done = 1; start transaction; open v_cur; repeat fetch v_cur into v_id, v_val; -- do work... if v_val < 0 then update foo set... else insert into foo... end if; until v_done end repeat; close v_cur; commit; end # delimiter ; call batch_update(1,10000); call batch_update(10001, 20000); call batch_update(20001, 30000);

If you can’t use cursors at all, great, but my main suggestion is to move the logic from your application level back to the data level. I suggest you create a prototype of the stored procedure in your database, and then run some tests. If the procedure is completed in a few seconds, I do not see that you have a lot of problems, especially if you use innodb tables with transactions.

Here is another example that might seem interesting, although it works on a much larger dataset of over 50 million rows:

Optimal MySQL settings for queries that provide large amounts of data?

Hope this helps :)

0

Jon black Jan 15 '11 at 12:58

source share

Alfred · Accepted Answer · 2011-01-15T10:50:16+0000

Below I will tell you how I approach this problem (but it will cost you money and may be an undesirable solution):

You should use VPS (quick list of some cheap VPS ). But I think you should do some more research , finding the best VPS for your needs, if you want to achieve your goal without trying to figure out your hosting company (I'm sure you will) .
You should not use cronjob, but use a message queue, such as beanstalkd , to queue your messages (tasks) and execute offline processing instead. When using Message Queuing, you can also throttle your processing if necessary.

Not necessarily, but I would solve it this way.

If performance was really a key issue, I would have two VPS instances (at least). one VPS instance to process the HTTP request from users visiting your site and one VPS instance to perform the offline processing you want. This way your users / visitor will not notice any heavy offline processing that you do.
I also probably will not use PHP to work offline due to the nature of the lock. I would use something like node.js to do this kind of processing because it doesn't block anything in node.js, which will be much faster.
I also probably won’t store the data in a relational database, but use the redis zipper as a data store. node_redis is an incredibly fast client for node.js

The scalability of processing a large amount of database compilation data in PHP, many times a day

More articles: