Database structure for website analysis using MongoDB

Question

Database structure for website analysis using MongoDB

I started developing a website analytic system in MySQL for a project that I am working on, but quickly realized that this would not be enough for my needs (in terms of scalability, speed, etc.). Having done a fair portion of the research, MongoDB continues to prove to be a good candidate, the only problem I am facing is that I have no experience with this and I don’t know the best practices for high-performance / dimensional MongoDB databases, as well as for MySQL

When a user visits a website, they need to record standard information (IP, browser information, website ID, URL, username). It should also record every subsequent page the user visits (current timestamp, url). If a user leaves the website and returns after 10 days, he needs to register this visit and also note that he is a returning user (indicated by username).

In addition to magazine visits for several websites (looking at 500 entries added per second), he should be able to generate reports. I'm fine with creating charts, etc., but I need to know how to efficiently retrieve data from a database. I would like to be able to provide graphs that show activity every 15 minutes, but an hour is enough if this is more practical.

As the party thought, it would be nice if it could be in real time in the future, but this is beyond the scope of the current project.

Now I read the article at http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics , but doesn’t mention anything about high traffic websites - it may just be capable of dealing with several thousand entries for everything I know. Am I following the concept of this message and drawing reports directly from this collection, or would it be better to pre-analyze the data and save it in a separate collection?

Any thoughts on data entry, database structure and reporting would be greatly appreciated!

+4

mongodb

Richw Dec 13 '11 at 18:17

source share

1 answer

mnemosyn · Accepted Answer · 2011-12-13T19:13:20+0000

(MySQL) will not be sufficient for my needs (in terms of scalability, speed, etc.)

Well ... it looks like facebook uses MySQL to a large extent. When it comes to NoSQL, I think it's not necessarily technology, it's data structures and algorithms.

What you are facing is a high bandwidth recording situation. One approach to high write throughput that works well for your problem is sharding : no matter how big the machine is and how efficient the software is, there will be a limit to the number of records that one machine can process. Sharding shares data across multiple servers, so you can write to different servers. For example, AM users write to server 1, NZ users to server 2.

Now the penalty is due to complexity, because he needs to balance, aggregations for all fragments can be complicated, you need to maintain several independent databases, etc.

This is a technological thing: MongoDB sharding is pretty simple because they support an automatic outline that does most of the unpleasant things for you. I do not think that you will need it at 500 inserts per second, but it is good to know this.

For circuit design, it is important to think about a shrapnel key that will be used to determine which shard is responsible for the document. This may depend on your traffic patterns. Suppose you have a user who works at a fair. Once a year, his site completely goes blank, but for 360 days it is one of the bottom traffic sites. Now, if you are fined by CustomerId , this particular user can lead to problems. On the other hand, if you plunge into VisitorId , you will have to hit every shard for a simple count() .

Part of the analysis depends a lot on the queries you want to support. The real slice & dice deal is quite complicated, I would say, in particular, if you want to support analytics in almost real time. A simpler approach is to limit user parameters and provide a small set of operations. They can also be cached, so you do not have to perform all aggregations every time.

In general, analytics can be complex, because there are many functions that require relationships. For example, for cohort analysis, you only need to consider log entries that were created by a specific group of users. The $in query will do the trick for smaller cohorts, but if we are talking about tens of thousands of users, this will not be done. You can choose only an arbitrary subset of users, because this should be statistically sufficient, but, of course, it depends on your specific requirements.

Map / Reduce is useful for analyzing large amounts of data: it will process on the server, and Map / Reduce also benefits from shards, because tasks can be processed individually by each shard. However, depending on the gazillion factor, these tasks will take some time.

I believe that there is information on this in the Ice block blog; they definitely have experience processing a lot of analytic data using MongoDB.

Database structure for website analysis using MongoDB

More articles: