Free implementation of counting user sessions from the web server log?

Web server log analyzers (such as Urchin) often display multiple “sessions”. A session is defined as a series of page visits / clicks made by a person in a limited continuous time segment. An attempt is made to identify these segments using IP addresses and often supplement information such as the user agent and OS and the session timeout threshold, for example, 15 or 30 minutes.

For certain websites and applications, the user can be registered and / or tracked using a cookie, which means that the server can know exactly when the session starts. I'm not talking about this, but about getting sessions heuristically (" session reconstruction ") when the web server does not track them.

I could write some code, for example. in Python to try to restore sessions based on the criteria mentioned above, but I would not reinvent the wheel. I am looking at log files of about 400 thousand lines, so I have to be careful to use a scalable algorithm.

My goal is to extract a list of unique IP addresses from the log file and for each IP address so that the number of sessions is deduced from this log. Absolute accuracy and precision are not needed ... pretty good grades are in order.

Based on this description :

a new request is placed in an existing one if two conditions apply:

  • The IP address and user agent match the requests already inserted into the session,
  • the request is executed less than fifteen minutes after the last request is inserted.

it would be just theoretically to write a Python program to create a dictionary (with a key over IP) of dictionaries (using a user agent), the value of which is a pair: (number of sessions, last request of the last session).

, , .

FYI, - , ():

#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status 
2010-09-21 23:59:59 215.51.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.mysite.org/blarg.htm 200 0 0
+3
1

, - , Python. Python. .

#!/usr/bin/env python

"""Reconstruct sessions: Take a space-delimited web server access log
including IP addresses, timestamps, and User Agent,
and output a list of the IPs, and the number of inferred sessions for each."""

## Input looks like:
# Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status
# 2010-09-21 23:59:59 172.21.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.site.org//baz.htm 200 0 0

import datetime
import operator

infileName = "ex100922.log"
outfileName = "visitor-ips.csv"

ipDict = {}

def inputRecords():
    infile = open(infileName, "r")

    recordsRead = 0
    progressThreshold = 100
    sessionTimeout = datetime.timedelta(minutes=30)

    for line in infile:
        if (line[0] == '#'):
            continue
        else:
            recordsRead += 1

            fields = line.split()
            # print "line of %d records: %s\n" % (len(fields), line)
            if (recordsRead >= progressThreshold):
                print "Read %d records" % recordsRead
                progressThreshold *= 2

            # http://www.dblab.ntua.gr/persdl2007/papers/72.pdf
            #   "a new request is put in an existing session if two conditions are valid:
            #    * the IP address and the user-agent are the same of the requests already
            #      inserted in the session,
            #    * the request is done less than fifteen minutes after the last request inserted."

            theDate, theTime = fields[0], fields[1]
            newRequestTime = datetime.datetime.strptime(theDate + " " + theTime, "%Y-%m-%d %H:%M:%S")

            ipAddr, userAgent = fields[8], fields[9]

            if ipAddr not in ipDict:
                ipDict[ipAddr] = {userAgent: [1, newRequestTime]}
            else:
                if userAgent not in ipDict[ipAddr]:
                    ipDict[ipAddr][userAgent] = [1, newRequestTime]
                else:
                    ipdipaua = ipDict[ipAddr][userAgent]
                    if newRequestTime - ipdipaua[1] >= sessionTimeout:
                        ipdipaua[0] += 1
                    ipdipaua[1] = newRequestTime
    infile.close()
    return recordsRead

def outputSessions():
    outfile = open(outfileName, "w")
    outfile.write("#Fields: IPAddr Sessions\n")
    recordsWritten = len(ipDict)

    # ipDict[ip] is { userAgent1: [numSessions, lastTimeStamp], ... }
    for ip, val in ipDict.iteritems():
        # TODO: sum over on all keys' values  [(v, k) for (k, v) in d.iteritems()].
        totalSessions = reduce(operator.add, [v2[0] for v2 in val.itervalues()])
        outfile.write("%s\t%d\n" % (ip, totalSessions))

    outfile.close()
    return recordsWritten

recordsRead = inputRecords()

recordsWritten = outputSessions()

print "Finished session reconstruction: read %d records, wrote %d\n" % (recordsRead, recordsWritten)

: 342K 21K 39 . . -, 3/4 strptime()!

+2

Source: https://habr.com/ru/post/1766092/


All Articles