Calculation of the probability of spam

Question

Calculation of the probability of spam

I am building a website in python / django and want to predict if the user feed is valid or maybe spam.

Users have an admission rate for their applications, for example, on this website.

Users can moderate the views of other users; and these moderations are later metamodized by the administrator.

Considering this:

registered user A with a 60% application submits something.
User B moderates the message as a valid view. However, user B does not work in 70% of cases.
user C moderates the message as spam. User C is usually right. If user C says that something is spam / without spam, this will be correct in 80% of cases.

How can I predict the likelihood that a message will be spam?

Edit: I created a python script simulating this script:

#!/usr/bin/env python

import random

def submit(p):
    """Return 'ham' with (p*100)% probability"""
    return 'ham' if random.random() < p else 'spam'

def moderate(p, ham_or_spam):
    """Moderate ham as ham and spam as spam with (p*100)% probability"""
    if ham_or_spam == 'spam':
        return 'spam' if random.random() < p else 'ham'
    if ham_or_spam == 'ham':
        return 'ham' if random.random() < p else 'spam'

NUMBER_OF_SUBMISSIONS = 100000 
USER_A_HAM_RATIO = 0.6 # Will submit 60% ham
USER_B_PRECISION = 0.3 # Will moderate a submission correctly 30% of the time
USER_C_PRECISION = 0.8 # Will moderate a submission correctly 80% of the time

user_a_submissions = [submit(USER_A_HAM_RATIO) \
                        for i in xrange(NUMBER_OF_SUBMISSIONS)]

print "User A has made %d submissions. %d of them are 'ham'." \
        % ( len(user_a_submissions), user_a_submissions.count('ham'))

user_b_moderations = [ moderate( USER_B_PRECISION, ham_or_spam) \
                        for ham_or_spam in user_a_submissions]

user_b_moderations_which_are_correct = \
    [i for i, j in zip(user_a_submissions, user_b_moderations) if i == j]

print "User B has correctly moderated %d submissions." % \
    len(user_b_moderations_which_are_correct)

user_c_moderations = [ moderate( USER_C_PRECISION, ham_or_spam) \
                        for ham_or_spam in user_a_submissions]

user_c_moderations_which_are_correct = \
    [i for i, j in zip(user_a_submissions, user_c_moderations) if i == j]

print "User C has correctly moderated %d submissions." % \
    len(user_c_moderations_which_are_correct)

i = 0
j = 0    
k = 0 
for a, b, c in zip(user_a_submissions, user_b_moderations, user_c_moderations):
    if b == 'spam' and c == 'ham':
        i += 1
        if a == 'spam':
            j += 1
        elif a == "ham":
            k += 1

print "'spam' was identified as 'spam' by user B and 'ham' by user C %d times." % j
print "'ham' was identified as 'spam' by user B and 'ham' by user C %d times." % k
print "If user B says it spam and user C says it ham, it will be spam \
        %.2f percent of the time, and ham %.2f percent of the time." % \
         ( float(j)/i*100, float(k)/i*100)

Running the script gives me this result:

User A made 100,000 submissions. 60,094 of them are ham.
User B correctly moderated 29864 transmissions.
User C correctly moderated 79990 requests.
"spam" was identified by the user "spam" by user B and "ham" by user C 2,346 times.
"ham" was identified by the user "spam" by user B and "ham" by user C 33634 times.
If user B says that it is spam and user C says that it is ham, it will be spam in 6.52 percent of cases, and ham - at 93.48 percent.

Is the probability reasonable here? Would this be the right way to model the script?

+3

probability bayesian

Hobhouse 07 . '10 15:02

3

. .

( ). , , (.. ).

, "", , . (, DOS), .

+2

ConcernedOfTunbridgeWells 07 . '10 15:15

.

, /, - - - / googlebot , .

: 1 , 2 80% +, , , 3 , . , , , 1 2 . RBL, , .

- , , .

-1

Peter Rowell 07 . '10 16:17

Alex Martelli · Accepted Answer · 2010-06-07T15:37:18+0000

:

A B X Y . A, B C, , :

P(X|Y) = P(Y|X) P(X) / P(Y)

: , X this post _by A_ is spam, " " (, , Y "B , C " ). , - . .

, X , " ", Y, A has posted it, B approved it, C rejected it ( ).

P(X), , ( , ), ; P(Y), , A, B, C ( , ); P(Y | X), , , .

, , . : A 0,4 (, , ); B - 0,3, , -, , "" ( ); C - 0,8, , , , "" ( ).

, ! , C 80% , , - 40%, A, C ( , -), 80%, " ". , 20%, C 1/4 ( 1/16 -), .

Guessing for B, 30% , "", 20%, , B 1/4 5/16 -.

: P(X)=0.2; P(Y)=0.3*0.2=0.06 (B C-); P(Y|X)=0.4*0.25*0.75=0.075 ( B - C - ).

P(X|Y)=0.075*0.2/0.06=0.25 - ( , , ;-), , , 0,25 - - , , , , A .

, , ( hte place; =) / B C . ( , B C -), () ( B C) "" ( ), .

, .

, BTW, Python (, , Django), - , !

: ( - shd Q!):

, " " 30%. , B - / 7 . , 70% , - / , . C " 80%" C , - , 80% . 20%.

... ( , B C). , B - " ", 70% ! -).

: B A 0,6 * 0,3 ( A ) + 0,4 * 0,7 ( A) = 0,18 + 0,28 = 0,46; C 0,8 * 0,4 + 0,2 * 0,6 = 0,32 + 0,12 = 0,44. , ...:

P(X)=0.4 ( 0.2 , , A 0,4 - , , - A!); P(Y)=0.46*0.56=0.2576 ( B A C A); P(Y|X)=0.7*0.8=0.56 (B prob C prob ).

So P(X|Y)=0.56*0.4/0.2576=0.87 (). IOW: , A , 0,4, B, C , A 87% .

Calculation of the probability of spam

More articles: