EDIT (best idea):
You can view each message as having two types:
- Continuation of the previous conversation
- New conversation
You can model these two types of messages as independent Poisson processes , where the time difference between adjacent messages is an exponential distribution .
Then you can empirically determine the exponential parameters for these two types of messages manually (it will not be too difficult to do if you take into account some initial data). You now have a model for these two events.
Finally, when a new message arrives, you can calculate the likelihood that the message will be of type 1 or type 2. If type 2, then you have a new conversation.
Clarification:
The likelihood that the message is a new conversation, given that the delay is some time T
P(new conversation | delay=T) = P(new conversation AND delay=T)/P(delay=T)
Using Bayes rule:
= P(delay=T | new conversation)*P(new conversation)/P(delay=T)
The same calculation is performed for P(old conversation | delay=T)
.
P(delay=T | new conversation)
comes from the model. P(new conversation)
easily calculated from the data used to create your model. P(delay=T)
you do not need to calculate at all, since all you want to do is to compare the two probabilities.
The difference between timestamps between adjacent messages depends on the type of conversation and the participation of people. Thus, you will need an algorithm that takes into account local characteristics, in contrast to the global threshold parameter.
My suggestion would be as follows:
- Get the time difference between the last 10 contiguous messages.
- Calculate the average (or median)
- If the delay to the next message is more than 30 times the average, this is a new conversation.
Of course, I came up with these numbers in place. They should be customized to fit your purpose.
source share