A scalable way to register page request data from a PHP application?

Question

A scalable way to register page request data from a PHP application?

The web application that I am developing (in PHP) requires the ability to register every page request.

Like a normal normal access_log, it will store data such as url request, source IP address, date / time, but I also need to save the user ID of the logged in user (which is stored in the php session variable).

This data will then be requested to create reports for the entire site or each user account, as required at a later stage, such as the total number of visits / unique visits, pageviews for a certain period of time, geolocation of ip address and viewing places, the most active days days, the most active members, etc.

The obvious task would be to have mysql instructions on every page, but if the application gets thousands of req / sec, this will be the bottleneck in the database, so I am considering alternative, scalable ways to do this without much infrastructure requirements.

A few of my ideas:

1) Work on how Nginx will be able to log user_id from the session / application into a regular access_log web server, which can be analyzed and loaded into the database periodically (at night). This seems a bit hacked and will need to be done on every web server as the system scales.

2) Register each page request in Redis, which has a high write speed - the problem is the inability to request a date later.

3) Register each page request in Memcache / Redis, acting as a cache (or message queue), and from there it will be regularly extracted, inserted into MySQL and deleted.

4) Could something like MongoDB have more query capabilities?

I wonder how you approach this, and if someone has experience with a similar application (or has stumbled upon something on the Internet).

I'm also interested in thoughts on how data can be properly structured for storage in memcache / redis.

thanks

+5

logging mongodb nginx redis analytics

Ian chilton May 17 '11 at 9:59

source share

3 answers

Submit registration information to syslog-ng :)

0

Daniel May 22, '11 at 20:42

source share

When you use MongoDB for logging, the problem is a lock conflict with high write bandwidth. Although the MongoDB tab is the fire and swell style by default, calling a large number of inserts () causes a serious violation of write locks. This may affect application performance and prevent readers from collecting / filtering saved logs.

One solution can use a log collector structure such as Fluentd , Logstash , or Flume . These daemons must run on each application node and extract logs from application processes.

They buffer logs and write data asynchronously to other systems, such as MongoDB / PostgreSQL / etc. Recording is performed in batches , so it is much more efficient than writing directly from applications. This link describes how to put logs into Fluentd from a PHP program.

Fluentd: Import data from PHP applications

Here are some tutorials about MongoDB + Fluentd.

Fluentd + MongoDB: The easiest way to efficiently post your data to a 10gen blog.
Fluentd: Store Apache Logs in MongoDB

0

Kazuki Ohta Nov 17 '12 at 5:56

source share

The real bill · Accepted Answer · 2011-05-18T18:58:02+0000

This, of course, can be done in various ways. I will consider each parameter listed, as well as some additional comments.

1) If NGinx can do it, let it be. I do this with Apache, as well as with JBOSS and Tomcat. Then I use syslog-ng to collect them centrally and process them from there. For this route, I propose a message format in the form of delimiters, for example, tabbed, as this simplifies analysis and reading. I don’t know how to register PHP variables, but it certainly can write headers and cookie information. If you intend to use the NGinx protocols in general, I would recommend this route, if possible, - why the log twice?

2) There is no “inability to request a date later”, below below.

3) This is an option, but whether it is useful depends on how long you want to store the data and how much cleaning you want to record. Lower.

4) MongoDB can certainly work. You will have to write queries, and they are not simple SQL commands.

Now to save the data in redis. I am currently logging files using syslog-ng, as indicated, and using the program's purpose to analyze the data and populate it with Redis. In my case, I have several grouping criteria, such as vhost and cluster, so my structures may be slightly different. First you need to ask the question: "what data do I want to extract from this data"? Some of them will be counters such as tariffs. Some of them will be aggregates, and even more will be things like "order my pages in popularity."

I will demonstrate some of these methods to easily get this in redis (and thus backtrack).

First, consider traffic statistics over time. First determine the level of detail. Do you want statistics per minute to compile or compile statistics per hour? Here is one way to track a given URL traffic:

Save the data in a sorted set using the key "traffic to URL: URL: YYYY-MM-DD" in this sorted set that you use zincrby and provide the member "HH: MM". for example, in Python, where "r" is your redis connection:

r.zincrby("traffic-by-url:/foo.html:2011-05-18", "01:04",1)

This example increments the counter for the url "/foo.html" on May 18 at 1:04 am.

To get data for a certain day, you can call zrange on the key ("" URL at the URL: YYYY-MM-DD ") to get a sorted set from the least popular to the most popular. To get the top 10, for example, you would used zrevrange and give it a range.Zrevrange returns the reverse sorting, the most successful will be at the top.A few more sorted typing commands are available that allow you to perform nice queries, such as paging, getting a range of results to minimize evaluation, etc.

You can simply change or expand your key name to handle different time windows. By combining this with zunionstore , you can automatically collapse to less granular time periods. For example, you could combine all the keys in a week or a month and save in a new key, for example, "traffic by url: monthly: URL: YYYY-MM". By performing the above steps on all URLs on a specific day, you can receive daily. Of course, you can also have a daily shared traffic key and gain. This mainly depends on when you want the data to be entered - offline through importing a log file or as part of a user experience.

I would recommend not to do much during the actual user session, as this extends the time it takes for your users to experience it (and server load). Ultimately, it will be a challenge based on traffic levels and resources.

As you could imagine, the above storage scheme can be applied to any counter that you want or define. For example, change the URL to a user ID and you have tracking for each user.

You can also store magazines in Redis. I do this for some logs, saving them as JSON strings (I have them as key-value pairs). Then I have a second process that pulls them out and does it with the data.

For storing raw hits, you can also use sorted sets using Epoch Time as a rank and easily capture a time window using the zrange / zrevrange commands. Or save them in a key based on the user ID. Sets will work for this, as well as sorted sets.

Another option that I have not discussed, but may be useful for some of your data, is stored as a hash. This can be useful for storing detailed information about a given session, for example.

If you really need data in the database, try using the Redis' Pub / Sub function and you have a subscriber who parses it in delimited format and flushes the file. Then have an import process that uses the copy command (or the equivalent for your database) to import in bulk. Your database thanks you.

The last piece of advice (I’ve probably already had enough mental time here) is to make reasonable and liberal use of expire . Using Redis 2.2 or later, you can set the expiration date for even counter keys. A big advantage here is automatic data cleaning. Imagine you are following a pattern similar to the one above. Using expiration commands, you can automatically clear old data. Perhaps you need hourly statistics for 3 months, and then only daily statistics; daily statistics for 6 months, and then only monthly statistics. Just expire your hourly keys after three months (86400 * 90), daily at 6 (86400 * 180), and you do not need to do the cleaning.

For geotags, I do offline IP processing. Imagine a sorted set with this key structure: "traffic-by-ip: YYYY-MM-DD", using the IP address as an element, and using the zincryby command above, you get traffic data by IP address. Now in your report you can get a sorted set and perform an IP search. To save traffic when running reports, you can configure the hash in redis, which maps the IP address to the right place. For example, "geo: country" as a key and IP as a hash element with a country code as a stored value.

The big warning I would add is that if the level of traffic is very high, you can run two instances of Redis (or more depending on the traffic). The first will be an instance of the record. He would not enable the bgsave option. If your traffic is quite high, you will always do bgsave. This is what I recommend for the second instance. It is subordinate to the first, and it saves to disk. You can also run your requests against a slave to distribute the load.

Hope this gives you some ideas and things to try. Play with different options to see what works best for your specific needs. I track a lot of statistics on a high-traffic website (as well as in MTA log statistics) in redis and it works great - in combination with Django and the Google visualization API. I get very beautiful graphics.

A scalable way to register page request data from a PHP application?

More articles: