Determining the chances of an event before it has occurred

A user visits my site at time t , and they may or may not click on a specific link, which I care about if they record the fact that they clicked the link, as well as the duration with t that they clicked, call it d .

I need an algorithm that allows me to create a class like this:

class ClickProbabilityEstimate { public void reportImpression(long id); public void reportClick(long id); public double estimateClickProbability(long id); } 

Each impression gets a unique id , and this is used when you give a click to indicate which impression belongs to the click.

I need an algorithm that will return the probability, based on how much time has passed since the message about the impression, that a click will be received on the impression, based on how many previous clicks are required. Obviously, this probability can be expected to decrease over time if the click is still missing.

If necessary, we can set an upper limit beyond which we will consider the probability of a click equal to 0 (for example, if it were an hour from the moment the impression appeared, we can be sure that there will be no click).

The algorithm should be both space and time efficient, and I hope to make as few assumptions as possible, while being elegant. Ease of implementation will also be enjoyable. Any ideas?

+1
source share
3 answers

Assuming you save past impressions and clicks, it's easy: let them say you have an impression and time has passed d ' . You can divide your data into three groups:

  • Impressions with a click of less than d '
  • Impressions that got click after more than d '
  • Impressions that did not receive a click

Obviously, the current impression is not in group (1), so eliminate this. You want the probability to be in group (2), which then

 P = N2 / (N2 + N3) 

where N2 is the number of impressions in group 2 and similarly for N3 .

As for the actual implementation, first of all I would like to keep an ordered list of times d for past impressions that got clicks, as well as the number of impressions that never got clicked and just do a binary search of d ' in this list. The position you find will give you N1 , and then N2 is the list length minus N1 .

If you do not need perfect granularity, you can save the elapsed time as a histogram, that is, a list containing in each element of list[n] number of impressions that received a click after at least n but less than n+1 minutes. (Or seconds or any other time interval). In this case, you probably want to save the total number of clicks as a separate variable so that you can easily calculate N2 .

(By the way, I just did this, I don’t know if there are standard algorithms for this kind of thing that could be better)

+2
source

I would suggest a hypothesis about the arrival process (clicks per minute) and try to fit the distribution to this arrival process using your existing data. I bet that the result is the negative bin you get when you have a Poisson arrival process with an unsteady average if the average has a gamma distribution. The converse (minutes per click) gives you the distribution of the process between the prizes. I do not know if there is a distribution for this, but you can create an empirical one.

Hope this helps.

0
source

Source: https://habr.com/ru/post/1345171/


All Articles