Web server log analyzers (such as Urchin) often display multiple “sessions”. A session is defined as a series of page visits / clicks made by a person in a limited continuous time segment. An attempt is made to identify these segments using IP addresses and often supplement information such as the user agent and OS and the session timeout threshold, for example, 15 or 30 minutes.
For certain websites and applications, the user can be registered and / or tracked using a cookie, which means that the server can know exactly when the session starts. I'm not talking about this, but about getting sessions heuristically (" session reconstruction ") when the web server does not track them.
I could write some code, for example. in Python to try to restore sessions based on the criteria mentioned above, but I would not reinvent the wheel. I am looking at log files of about 400 thousand lines, so I have to be careful to use a scalable algorithm.
My goal is to extract a list of unique IP addresses from the log file and for each IP address so that the number of sessions is deduced from this log. Absolute accuracy and precision are not needed ... pretty good grades are in order.
Based on this description :
a new request is placed in an existing one if two conditions apply:
- The IP address and user agent match the requests already inserted into the session,
- the request is executed less than fifteen minutes after the last request is inserted.
it would be just theoretically to write a Python program to create a dictionary (with a key over IP) of dictionaries (using a user agent), the value of which is a pair: (number of sessions, last request of the last session).
, , .
FYI, - , ():
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status
2010-09-21 23:59:59 215.51.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http: