How to improve the results of my recommendation? I use spark als implicit

Firstly, I have some history of using a user application.

For instance:
user1, app1, 3 (startup time)
user2, app2, 2 (startup time)
user3, app1, 1 (startup time)

I have basically two requirements:

  • Recommend the application to each user.
  • Recommend a similar application for each application.

Therefore, I use ALS (implicit) MLLib for the spark to implement it. At first, I just use the source data to train the model. The result is terrible. I think this may be caused by the start time range. And the launch time varies from 1 to a thousand. Therefore, I process the raw data. I think that the assessment can reflect the true situation and the great regularization.

score = lt / uMlt + lt / aMlt

assessment is the result of a process for training a model.
This is the start time in the source data.
uMlt - average user startup time in the source data. uMlt (all time a user starts) / (the number of applications that this user has ever run)
aMlt - the average application startup time in the source data. aMlt (all the time the application was launched) / (the number of users who have ever run this application)
The following is an example of data after processing.

Rating (95788,20992,0.14167073369026184)
Rating (98696.20992.5.92363166809082)
Rating (160020,11264,2.261538505554199)
Rating (67904,11264,2.261538505554199)
Rating (268430,11264,0.13846154510974884)
Rating (201369,11264,1.7999999523162842)
Rating (180857,11264,2.2720916271209717)
Rating (217692,11264,1.3692307472229004)
Rating (186274.28672.2.4250855445861816)
Rating (120820.28672.0.4422124922275543)
Rating (221146.28672.1.0074234008789062)

After I did this and combined the applications that have a different package name, the result looks better. But still not enough.
I believe that the functions of users and products are so small, and most of them are negative.

Here is a 3-line example of product features, 10 dimensions for each row:

((CompactBuffer (com.youlin.xyzs.shoumeng, com.youlin.xyzs.juhe.shoumeng)), (- 4.798973236574966E-7, -7.641608021913271E-7,6.040852440492017E-7,2.82689171626771E-7, -46666 -7.1.815822798789668E-7.5.000047167413868E-7.2.0220664964654134E-7.6.386763402588258E-7, -4.289261710255232E-7))
((CompactBuffer (com. 5.2.0895185298286378E-4.2.968782791867852E-4.1.9461760530248284E-4))
) -6.2.296348611707799E-5.3.8075613701948896E-5.1.2197584510431625E-5))

Here is a 3-line example of user capabilities, 10 dimensions for each row:

(96768, (- +0.0010857731103897095, -0.001926362863741815,0.0013726564357057214,6.345533765852451E-4, -9.048808133229613E-4, -4.1544197301846E-5,0.001442175940610710710710710738180738180738180730790870870870870870870801
(97280, (- 0.0022841691970825195, -0.0017134940717369318,0.001027365098707378,9.437055559828877E-4, -0.0011165080359205604,0.0017137592658400536,9.713359759189188E-4,88,885,884,884,883,884,884,884,884,884,884,885,708,885,78,88,88, -08,485,708,78,885,708,708,808,708,841,808,808,841,808,808,841,808,808,841,88,88,88,808,708,808,708,708,708,708,708,708.
(97792, (- 0.0017802991205826402, -0.003464450128376484,0.002837196458131075,0.0015725698322057724, -0.0018932095263153315.9.185600210912526E-4,0.0018971719546243548,72504504504504507504504507504504507504507504507504504507504504507504507504507504507504507504504507504507504507504507504507504507504507504507..

So, you can imagine how small it is when I get a point product of object vectors to calculate the value of a matrix of user elements.

My question is:

  • Is there any other way to improve the outcome of the recommendation?
  • Are my functions correct, or is something wrong?
  • Am I trying to process the initial startup time (convert to score) correctly?

Here I put the code. And this is absolutely a programmatic issue. But perhaps it cannot be solved with a few lines of code.

val model = ALS.trainImplicit(ratings, rank, iterations, lambda, alpha) print("recommendForAllUser") val userTopKRdd = recommendForAllUser(model, topN).join(userData.map(x => (x._2._1, x._1))).map { case (uid, (appArray, mac)) => { (mac, appArray.map { case (appId, rating) => { val packageName = appIdPriorityPackageNameDict.value.getOrElse(appId, Constants.PLACEHOLDER) (packageName, rating) } }) } } HbaseWriter.writeRddToHbase(userTopKRdd, "user_top100_recommendation", (x: (String, Array[(String, Double)])) => { val mac = x._1 val products = x._2.map { case (packageName, rating) => packageName + "=" + rating }.mkString(",") val putMap = Map("apps" -> products) (new ImmutableBytesWritable(), Utils.getHbasePutByMap(mac, putMap)) }) print("recommendSimilarApp") println("productFeatures ******") model.productFeatures.take(1000).map{ case (appId, features) => { val packageNameList = appIdPackageNameListDict.value.get(appId) val packageNameListStr = if (packageNameList.isDefined) { packageNameList.mkString("(", ",", ")") } else { "Unknow List" } (packageNameListStr, features.mkString("(", ",", ")")) } }.foreach(println) println("productFeatures ******") model.userFeatures.take(1000).map{ case (userId, features) => { (userId, features.mkString("(", ",", ")")) } }.foreach(println) val similarAppRdd = recommendSimilarApp(model, topN).flatMap { case (appId, similarAppArray) => { val groupedAppList = appIdPackageNameListDict.value.get(appId) if (groupedAppList.isDefined) { val similarPackageList = similarAppArray.map { case (destAppId, rating) => (appIdPriorityPackageNameDict.value.getOrElse(destAppId, Constants.PLACEHOLDER), rating) } groupedAppList.get.map(packageName => { (packageName, similarPackageList) }) } else { None } } } HbaseWriter.writeRddToHbase(similarAppRdd, "similar_app_top100_recommendation", (x: (String, Array[(String, Double)])) => { val packageName = x._1 val products = x._2.map { case (packageName, rating) => packageName + "=" + rating }.mkString(",") val putMap = Map("apps" -> products) (new ImmutableBytesWritable(), Utils.getHbasePutByMap(packageName, putMap)) }) 

UPDATE:
I learned something new about my data after reading the article ("Collaborative filtering for implicit feedback datasets"). My data is too scarce compared to the IPTV dataset described in the document.

Paper: 300,000 (users) 17,000 (products) 32,000,000 (data)
Mine: 300,000 (users) 31,000 (products) 700,000 (data)

Thus, the matrix of the user element in the data set for the paper is filled with 0.00627 = (32,000,000 / 300,000 / 17,000). My data ratio is 0.0000033. I think this means that my user element matrix is ​​2,000 times smaller than paper.
Should this lead to poor results? And any way to improve it?

+5
source share
1 answer

There are two things you should try:

  • Standardize your data so that it has zero mean and unit variance for each user vector. This is a common step in many cars. This helps to reduce the effect of outliers that cause near-zero values ​​that you see.
  • Remove all users who have only one application. The only thing you learn from these users is a slightly better β€œaverage” value for app ratings. They will not help you find out any meaningful relationship, although this is what you really want.

By deleting a user from the model, you will lose the opportunity to get a recommendation for this user directly from the model by providing the user ID. However, they only have one app rating. Thus, you can instead run a KNN search on the product matrix to find applications that are most similar to these applications for users = recommendations.

0
source

Source: https://habr.com/ru/post/1243753/


All Articles