Algorithm for determining the quality of an article

Question

Algorithm for determining the quality of an article

I am working on a project that requires me to parse news articles and determine the best among them. I found out that to determine the quality of the article I will need three main parameters: the length of the article, the share / retweets of facebook and the time since the publication of the article.

The problem I am facing right now is how do I collect all three parameters in a mathematical function and come up with an account for each article? The rating assigned to each of them will help me rank articles and show them to users.

Also let me know if there is any other parameter that I need to consider when determining quality.

+4

algorithm text-parsing nlp data-mining data-modeling

shashank Jul 04 '13 at 4:25

source share

2 answers

emschorsch · Answer 1 · 2013-07-04T04:49:08+0000

I'm not sure exactly what the nature of your project is, but this task is very difficult to do for sure. How do you take into account the fact that articles that are the most common / favorite are often the most polarizing. The number of likes / promotions also clearly depends on how popular the news site is. I would think that any automated text analysis would not be accurate enough and could easily be abused. It will be best to look for indicative proxies, for example:

Website credibility as measured by ranking in google search results
Website popularity measured by traffic
The number of facebook / stocks you liked, as you mentioned
The number of online places associated with this article.

Since the dataset containing the article classes will be difficult to find, you probably will not be able to perform any statistical analysis. Instead, you just need to make a formula and weigh the parameters with the best judgment. To support this a bit, maybe a few articles are at hand and see what different formulas give you.

Chris pillen · Answer 2 · 2017-06-04T19:40:42+0000

What you desire is simply amazing. You have the types of data that interest you: increase and decrease data. Increasing data is considered “good,” well, if it is increasing. Data reduction is considered "better" the closer to zero.

It turns out that all four datasets are prime integers:

data increase

share: positive integer s \in N_0 (every integer from zero to infinity)
retweets: positive integer r \in N_0

data reduction

To reduce data, you want to use the absolute value as an indicator:

Let t_0 be the timestamp (unix or so) of the article.
Let T be the current timestamp.
Let l_0 denote the length of the article considered "best."
Let L denote the actual length of the article.

Then:

time: |t_0 - T| the closer to zero
length: |l_0 - L| the closer to zero

since the absolute value is a natural number:

|l_0 - L| + |t_0 - T| closer to zero since |t_0 - T| and |l_0 - L| closer to zero.

The same is true for increasing numbers.

So, the more likely that the article should be “correct” in length and new, the closer this number is to zero.

output

the factor of increasing number over decreasing itself increases. Think about it: the smaller the denominator, the greater the coefficient. The larger the numerator, the greater the coefficient.

This means: if you consider the "better" factor

(s+r) / (|l_0 - L| + |t_0 - T|)

increasing.

This is not necessarily an integer.

Accessory

You can moderate the growth of stocks and retweets, so the account becomes a little "natural" using ln .

ln(s+r) / (|l_0 - L| + |t_0 - T|)

You can use exp to soften the denominator:

ln(s+r) / exp(-(|l_0 - L| + |t_0 - T|))

Algorithm for determining the quality of an article

More articles: