Firstly, I have some history of using a user application.
For instance:
user1, app1, 3 (startup time)
user2, app2, 2 (startup time)
user3, app1, 1 (startup time)
I have basically two requirements:
- Recommend the application to each user.
- Recommend a similar application for each application.
Therefore, I use ALS (implicit) MLLib for the spark to implement it. At first, I just use the source data to train the model. The result is terrible. I think this may be caused by the start time range. And the launch time varies from 1 to a thousand. Therefore, I process the raw data. I think that the assessment can reflect the true situation and the great regularization.
score = lt / uMlt + lt / aMlt
assessment is the result of a process for training a model.
This is the start time in the source data.
uMlt - average user startup time in the source data. uMlt (all time a user starts) / (the number of applications that this user has ever run)
aMlt - the average application startup time in the source data. aMlt (all the time the application was launched) / (the number of users who have ever run this application)
The following is an example of data after processing.
Rating (95788,20992,0.14167073369026184)
Rating (98696.20992.5.92363166809082)
Rating (160020,11264,2.261538505554199)
Rating (67904,11264,2.261538505554199)
Rating (268430,11264,0.13846154510974884)
Rating (201369,11264,1.7999999523162842)
Rating (180857,11264,2.2720916271209717)
Rating (217692,11264,1.3692307472229004)
Rating (186274.28672.2.4250855445861816)
Rating (120820.28672.0.4422124922275543)
Rating (221146.28672.1.0074234008789062)
After I did this and combined the applications that have a different package name, the result looks better. But still not enough.
I believe that the functions of users and products are so small, and most of them are negative.
Here is a 3-line example of product features, 10 dimensions for each row:
((CompactBuffer (com.youlin.xyzs.shoumeng, com.youlin.xyzs.juhe.shoumeng)), (- 4.798973236574966E-7, -7.641608021913271E-7,6.040852440492017E-7,2.82689171626771E-7, -46666 -7.1.815822798789668E-7.5.000047167413868E-7.2.0220664964654134E-7.6.386763402588258E-7, -4.289261710255232E-7))
((CompactBuffer (com. 5.2.0895185298286378E-4.2.968782791867852E-4.1.9461760530248284E-4))
) -6.2.296348611707799E-5.3.8075613701948896E-5.1.2197584510431625E-5))
Here is a 3-line example of user capabilities, 10 dimensions for each row:
(96768, (- +0.0010857731103897095, -0.001926362863741815,0.0013726564357057214,6.345533765852451E-4, -9.048808133229613E-4, -4.1544197301846E-5,0.001442175940610710710710710738180738180738180730790870870870870870870801
(97280, (- 0.0022841691970825195, -0.0017134940717369318,0.001027365098707378,9.437055559828877E-4, -0.0011165080359205604,0.0017137592658400536,9.713359759189188E-4,88,885,884,884,883,884,884,884,884,884,884,885,708,885,78,88,88, -08,485,708,78,885,708,708,808,708,841,808,808,841,808,808,841,808,808,841,88,88,88,808,708,808,708,708,708,708,708,708.
(97792, (- 0.0017802991205826402, -0.003464450128376484,0.002837196458131075,0.0015725698322057724, -0.0018932095263153315.9.185600210912526E-4,0.0018971719546243548,72504504504504507504504507504504507504507504507504504507504504507504507504507504507504507504504507504507504507504507504507504507504507504507..
So, you can imagine how small it is when I get a point product of object vectors to calculate the value of a matrix of user elements.
My question is:
- Is there any other way to improve the outcome of the recommendation?
- Are my functions correct, or is something wrong?
- Am I trying to process the initial startup time (convert to score) correctly?
Here I put the code. And this is absolutely a programmatic issue. But perhaps it cannot be solved with a few lines of code.
val model = ALS.trainImplicit(ratings, rank, iterations, lambda, alpha) print("recommendForAllUser") val userTopKRdd = recommendForAllUser(model, topN).join(userData.map(x => (x._2._1, x._1))).map { case (uid, (appArray, mac)) => { (mac, appArray.map { case (appId, rating) => { val packageName = appIdPriorityPackageNameDict.value.getOrElse(appId, Constants.PLACEHOLDER) (packageName, rating) } }) } } HbaseWriter.writeRddToHbase(userTopKRdd, "user_top100_recommendation", (x: (String, Array[(String, Double)])) => { val mac = x._1 val products = x._2.map { case (packageName, rating) => packageName + "=" + rating }.mkString(",") val putMap = Map("apps" -> products) (new ImmutableBytesWritable(), Utils.getHbasePutByMap(mac, putMap)) }) print("recommendSimilarApp") println("productFeatures ******") model.productFeatures.take(1000).map{ case (appId, features) => { val packageNameList = appIdPackageNameListDict.value.get(appId) val packageNameListStr = if (packageNameList.isDefined) { packageNameList.mkString("(", ",", ")") } else { "Unknow List" } (packageNameListStr, features.mkString("(", ",", ")")) } }.foreach(println) println("productFeatures ******") model.userFeatures.take(1000).map{ case (userId, features) => { (userId, features.mkString("(", ",", ")")) } }.foreach(println) val similarAppRdd = recommendSimilarApp(model, topN).flatMap { case (appId, similarAppArray) => { val groupedAppList = appIdPackageNameListDict.value.get(appId) if (groupedAppList.isDefined) { val similarPackageList = similarAppArray.map { case (destAppId, rating) => (appIdPriorityPackageNameDict.value.getOrElse(destAppId, Constants.PLACEHOLDER), rating) } groupedAppList.get.map(packageName => { (packageName, similarPackageList) }) } else { None } } } HbaseWriter.writeRddToHbase(similarAppRdd, "similar_app_top100_recommendation", (x: (String, Array[(String, Double)])) => { val packageName = x._1 val products = x._2.map { case (packageName, rating) => packageName + "=" + rating }.mkString(",") val putMap = Map("apps" -> products) (new ImmutableBytesWritable(), Utils.getHbasePutByMap(packageName, putMap)) })
UPDATE:
I learned something new about my data after reading the article ("Collaborative filtering for implicit feedback datasets"). My data is too scarce compared to the IPTV dataset described in the document.
Paper: 300,000 (users) 17,000 (products) 32,000,000 (data)
Mine: 300,000 (users) 31,000 (products) 700,000 (data)
Thus, the matrix of the user element in the data set for the paper is filled with 0.00627 = (32,000,000 / 300,000 / 17,000). My data ratio is 0.0000033. I think this means that my user element matrix is ββ2,000 times smaller than paper.
Should this lead to poor results? And any way to improve it?