Algorithm / Heuristic for grouping chat message history using "conversation" / implicit sessions from timestamps?

Problem: I have a series of chat messages - between two users - with timestamps. I could imagine, say, an entire day of chat messages right away. Throughout the day, however, there were several discrete conversations / sessions ... and it would be more beneficial for the user to see them divided, rather than all days, as one continuous stream.

Is there an algorithm or heuristic that can β€œinfer” implicit / conversation / breaks from timestamps? In addition to the arbitrary "if the interval is more than x minutes, this is a separate session." And if this is the only case, how is this interval determined? In any case, I would like to avoid this.

For example, there are fifty messages that are sent between 2:00 and 3:00, followed by a break, and then twenty messages sent between 4:00 and 5:00. There will be a gap between them, but how to determine the gap?

I am sure that there is already literature on this subject, but I just do not know what to look for.

I've been playing around with things like border detection algorithms and gradient based approaches for a while.

(see comments for clarification)

+6
source share
1 answer

EDIT (best idea):

You can view each message as having two types:

  • Continuation of the previous conversation
  • New conversation

You can model these two types of messages as independent Poisson processes , where the time difference between adjacent messages is an exponential distribution .

Then you can empirically determine the exponential parameters for these two types of messages manually (it will not be too difficult to do if you take into account some initial data). You now have a model for these two events.

Finally, when a new message arrives, you can calculate the likelihood that the message will be of type 1 or type 2. If type 2, then you have a new conversation.

Clarification:

The likelihood that the message is a new conversation, given that the delay is some time T

 P(new conversation | delay=T) = P(new conversation AND delay=T)/P(delay=T) 

Using Bayes rule:

 = P(delay=T | new conversation)*P(new conversation)/P(delay=T) 

The same calculation is performed for P(old conversation | delay=T) .

P(delay=T | new conversation) comes from the model. P(new conversation) easily calculated from the data used to create your model. P(delay=T) you do not need to calculate at all, since all you want to do is to compare the two probabilities.


The difference between timestamps between adjacent messages depends on the type of conversation and the participation of people. Thus, you will need an algorithm that takes into account local characteristics, in contrast to the global threshold parameter.

My suggestion would be as follows:

  • Get the time difference between the last 10 contiguous messages.
  • Calculate the average (or median)
  • If the delay to the next message is more than 30 times the average, this is a new conversation.

Of course, I came up with these numbers in place. They should be customized to fit your purpose.

+3
source

Source: https://habr.com/ru/post/921240/


All Articles