This, of course, can be done in various ways. I will consider each parameter listed, as well as some additional comments.
1) If NGinx can do it, let it be. I do this with Apache, as well as with JBOSS and Tomcat. Then I use syslog-ng to collect them centrally and process them from there. For this route, I propose a message format in the form of delimiters, for example, tabbed, as this simplifies analysis and reading. I don’t know how to register PHP variables, but it certainly can write headers and cookie information. If you intend to use the NGinx protocols in general, I would recommend this route, if possible, - why the log twice?
2) There is no “inability to request a date later”, below below.
3) This is an option, but whether it is useful depends on how long you want to store the data and how much cleaning you want to record. Lower.
4) MongoDB can certainly work. You will have to write queries, and they are not simple SQL commands.
Now to save the data in redis. I am currently logging files using syslog-ng, as indicated, and using the program's purpose to analyze the data and populate it with Redis. In my case, I have several grouping criteria, such as vhost and cluster, so my structures may be slightly different. First you need to ask the question: "what data do I want to extract from this data"? Some of them will be counters such as tariffs. Some of them will be aggregates, and even more will be things like "order my pages in popularity."
I will demonstrate some of these methods to easily get this in redis (and thus backtrack).
First, consider traffic statistics over time. First determine the level of detail. Do you want statistics per minute to compile or compile statistics per hour? Here is one way to track a given URL traffic:
Save the data in a sorted set using the key "traffic to URL: URL: YYYY-MM-DD" in this sorted set that you use zincrby and provide the member "HH: MM". for example, in Python, where "r" is your redis connection:
r.zincrby("traffic-by-url:/foo.html:2011-05-18", "01:04",1)
This example increments the counter for the url "/foo.html" on May 18 at 1:04 am.
To get data for a certain day, you can call zrange on the key ("" URL at the URL: YYYY-MM-DD ") to get a sorted set from the least popular to the most popular. To get the top 10, for example, you would used zrevrange and give it a range.Zrevrange returns the reverse sorting, the most successful will be at the top.A few more sorted typing commands are available that allow you to perform nice queries, such as paging, getting a range of results to minimize evaluation, etc.
You can simply change or expand your key name to handle different time windows. By combining this with zunionstore , you can automatically collapse to less granular time periods. For example, you could combine all the keys in a week or a month and save in a new key, for example, "traffic by url: monthly: URL: YYYY-MM". By performing the above steps on all URLs on a specific day, you can receive daily. Of course, you can also have a daily shared traffic key and gain. This mainly depends on when you want the data to be entered - offline through importing a log file or as part of a user experience.
I would recommend not to do much during the actual user session, as this extends the time it takes for your users to experience it (and server load). Ultimately, it will be a challenge based on traffic levels and resources.
As you could imagine, the above storage scheme can be applied to any counter that you want or define. For example, change the URL to a user ID and you have tracking for each user.
You can also store magazines in Redis. I do this for some logs, saving them as JSON strings (I have them as key-value pairs). Then I have a second process that pulls them out and does it with the data.
For storing raw hits, you can also use sorted sets using Epoch Time as a rank and easily capture a time window using the zrange / zrevrange commands. Or save them in a key based on the user ID. Sets will work for this, as well as sorted sets.
Another option that I have not discussed, but may be useful for some of your data, is stored as a hash. This can be useful for storing detailed information about a given session, for example.
If you really need data in the database, try using the Redis' Pub / Sub function and you have a subscriber who parses it in delimited format and flushes the file. Then have an import process that uses the copy command (or the equivalent for your database) to import in bulk. Your database thanks you.
The last piece of advice (I’ve probably already had enough mental time here) is to make reasonable and liberal use of expire . Using Redis 2.2 or later, you can set the expiration date for even counter keys. A big advantage here is automatic data cleaning. Imagine you are following a pattern similar to the one above. Using expiration commands, you can automatically clear old data. Perhaps you need hourly statistics for 3 months, and then only daily statistics; daily statistics for 6 months, and then only monthly statistics. Just expire your hourly keys after three months (86400 * 90), daily at 6 (86400 * 180), and you do not need to do the cleaning.
For geotags, I do offline IP processing. Imagine a sorted set with this key structure: "traffic-by-ip: YYYY-MM-DD", using the IP address as an element, and using the zincryby command above, you get traffic data by IP address. Now in your report you can get a sorted set and perform an IP search. To save traffic when running reports, you can configure the hash in redis, which maps the IP address to the right place. For example, "geo: country" as a key and IP as a hash element with a country code as a stored value.
The big warning I would add is that if the level of traffic is very high, you can run two instances of Redis (or more depending on the traffic). The first will be an instance of the record. He would not enable the bgsave option. If your traffic is quite high, you will always do bgsave. This is what I recommend for the second instance. It is subordinate to the first, and it saves to disk. You can also run your requests against a slave to distribute the load.
Hope this gives you some ideas and things to try. Play with different options to see what works best for your specific needs. I track a lot of statistics on a high-traffic website (as well as in MTA log statistics) in redis and it works great - in combination with Django and the Google visualization API. I get very beautiful graphics.