How to implement an algorithm for related articles using this form of collaborative filtering

As the name implies, I have a problem with the implementation of the algorithm of related articles. Let me start by listing the tables from the database:

[articles] id_article id_category name content publish_date is_deleted [categories] id_category id_parent name [tags_to_articles] id_tag id_article [tags] id_tag name [articles_to_authors] id_article id_author [authors] id_author name is_deleted [related_articles] id_article_left id_article_right related_score 

Algorithm

Every other table, except for those associated with it, has data in it. Now I want to fill out related_articles with ratings between articles (very important: the table will work as a directed graph, the rating of article A with article B may differ from the rating between B and A, see list). The score is calculated as follows:

  • If the two articles in question have the same category, the number (x) is added to the score.
  • for each author with whom they have in common, the number (y) is added to the account
  • for each tag that they have, a number (z) is added to the score
  • if we calculate the score of article A with article B, the difference between now () and publish_date of article B will generate the number (t) that will be subtracted from the score

My first (ineffective) approach

I tried to make a request like this:

 SELECT a.id, b.id, a.id_category, a.publish_date, b.id_category, b.publish_date, c.id_tag, e.id_author FROM `articles` a, articles b, tags_to_articles c, tags_to_articles d, articles_to_authors e, articles_to_authors f WHERE a.id_article <> b.id_article AND ( (a.id_article=c.id_article and c.id_tag=d.id_tag and d.id_article=b.id_article) OR (a.id=e.id_article and e.id_author=f.id_author and f.id_article=b.id_article) OR (a.id_category=b.id_category) ) 

In theory, this will list each element that is worth computing for evaluation. However, this takes too much time and resources.

Is there another way? I am also open to setting up an algorithm or tables if it gets a workable solution. It is also worth noting that counter calculations are performed in cron, of course, I do not expect this to be done on every page request.

+6
source share
3 answers

I seriously doubt that you can do something like this with a single statement and get some kind of performance. Break it into pieces. Use temporary tables. Use settings operations.

 -- First, let list all tables that share a category. SELECT a1.id_article as 'left_article', a2.id_article as 'right_article', 1 as 'score' INTO #tempscore FROM #articles a1 INNER JOIN #articles a2 ON a1.id_category = a2.id_category AND a1.id_article <> a2.id_article -- Now, let add up everything that shares an author INSERT INTO #tempscore (left_article, right_article, score) SELECT ata1.id_article, ata2.id_article, 2 FROM #articles_to_authors ata1 INNER JOIN #articles_to_authors ata2 ON ata1.id_author = ata2.id_author -- Now, let add up everything that shares aa tag INSERT INTO #tempscore (left_article, right_article, score) SELECT ata1.id_article, ata2.id_article, 4 FROM #tags_to_articles ata1 INNER JOIN #tags_to_articles ata2 ON ata1.id_tag = ata2.id_tag -- We haven't looked at dates, yet, but let go ahead and consolidate what we know. SELECT left_article as 'left_article', right_article as 'right_article', SUM (score) as 'total_score' INTO #cscore FROM #tempscore GROUP BY left_article, right_article -- Clean up some extranneous stuff DELETE FROM #cscore WHERE left_article = right_article -- Now we need to deal with dates SELECT DateDiff (Day, art1.publish_date, art2.publish_date) as 'datescore', art1.id_article as 'left_article', art2.publish_date as 'right_article' INTO #datescore FROM #cscore INNER JOIN #articles art1 ON #cscore.left_article = art1.id_article INNER JOIN #articles art2 ON #cscore.right_article = art2.id_article WHERE art1.publish_date > art2.publish_date -- And finally, put it all together INSERT INTO #related_articles (id_article_left, id_article_right, related_score) SELECT s1.left_article, s1.right_article, s1.total_score + IsNull (s2.datescore, 0) FROM #cscore s1 LEFT JOIN #datescore s2 ON s1.left_article = s2.left_article AND s1.right_article = s2.right_article 

In my testing, the results seem to have come out correctly, but I don't have real sample data to leave, so I can't be sure. If nothing else, this should give you a reason to start.

+4
source

Your approach has the right concept, you will need the Cartesian product of the article table with yourself. It was the best solution I could come up with, but it would take some testing:

 INSERT INTO related_articles SELECT a_left.id_article,a_right.id_article, IF(a_left.id_category = a_right.id_category,x,0) + IF( IFNULL(atu1.id_author,0) AND IFNULL(atu2.id_author,0), IF(atu1.id_author = atu2.id_author,y,0), 0 ) + IF( IFNULL(tta1.id_tag,0) AND IFNULL(tta2.id_tag,0), IF(tta1.id_tag = tta2.id_tag,z,0), 0 ) -(CURRENT_TIMESTAMP - UNIX_TIMESTAMP(a_right.publish_date)) AS score FROM articles a_left join articles a_right ON a_left.id_article<>a_right.id_article AND aleft.id_article > CHECKPOINT_ID LEFT OUTER JOIN articles_to_authors atu1 ON atu1.id_article = a_left.id_article LEFT OUTER JOIN articles_to_authors atu2 ON atu2.id_article = a_right.id_article LEFT OUTER JOIN tags_to_articles tta1 ON tta1.id_article = a_left.id_article LEFT OUTER JOIN tags_to_articles tta2 ON tta2.id_article = a_right.id_article 

You may need 2 additional LEFT JOINs to handle deleted authors. The key here is the CHECKPOINT_ID parameter, which you can use so that you can perform this procedure step by step. This will allow you to process new articles. An alternative (although I see no reason) will add a condition such as

 ... ON a_left.id_article<>a_right.id_article AND NOT EXISTS(SELECT id_article_left FROM related_articles WHERE id_article_left = a_left.id_article AND id_article_right = a_right.id_article) ... 
+2
source

I used the method on the Sql server

I gave as many related tags for each article

then I get related articles by match tags, the same tags mean more related

 ALTER PROCEDURE [dbo].[GetRelatedArticles] @ArticleLang int, @ArticleURI varchar(100), @Count int = 10 AS SET NOCOUNT ON DECLARE @URI dbo.URICountType; INSERT INTO @URI([URI], [Count]) SELECT TOP (@Count) ArticleTag.ArticleURI, COUNT(ArticleTag.ArticleURI) AS ArticleCount FROM ArticleTag WITH (NOLOCK) INNER JOIN ArticleTag AS ArticleTags WITH (NOLOCK) ON ArticleTags.ArticleURI = @ArticleURI AND ArticleTag.ArticleURI <> @ArticleURI AND ArticleTag.ArticleTag = ArticleTags.ArticleTag GROUP BY ArticleTag.ArticleURI SELECT Article.ArticleURI, Article.ArticleLang FROM Article WITH (NOLOCK) INNER JOIN ( SELECT MIN(ABS( ArticleLang-@ArticleLang )) AS ArticleLangDifference, ArticleURI FROM Article WITH (NOLOCK) WHERE ArticleURI IN (SELECT URI FROM @URI) GROUP BY ArticleURI ) AS ArticleGrounp ON Article.ArticleURI = ArticleGrounp.ArticleURI AND ABS( Article.ArticleLang-@ArticleLang ) = ArticleGrounp.ArticleLangDifference INNER JOIN @URI AS URI ON Article.ArticleURI = URI.URI ORDER BY URI.Count DESC, Article.ArticleLastUpdate DESC 
0
source

Source: https://habr.com/ru/post/902735/


All Articles