Mahout: reading user input file

I played with Mahout and found that FileDataModel accepts data in the format

userId,itemId,pref(long,long,Double). 

I have some data that have the format

  String,long,double 

What is the best / easiest way to work with this dataset in Mahout?

+6
source share
3 answers

One way to do this is to create the FileDataModel extension. You need to override the readUserIDFromString (String value) method in order to use some kind of converter for the conversion. You can use one of the IDMigrator implementations, as Sean suggests.

For example, if you have an initialized MemoryIDMigrator , you can do this:

 @Override protected long readUserIDFromString(String stringID) { long result = memoryIDMigrator.toLongID(stringID); memoryIDMigrator.storeMapping(result, stringID); return result; } 

That way you can also use memoryIDMigrator to reverse display. If you do not need it, you can simply do it the way it was done in their implementation (this is in AbstractIDMigrator ).

+3
source

userId and itemId can be string, so this is CustomFileDataModel , which converts your string to an integer and stores the map (String, Id) in memory; after the recommendations, you can get the string from id.

+3
source

Assuming your input fits into memory, scroll through it. Keep track of the identifier for each line in the dictionary. If it does not fit into memory, use sorting and then group it to fulfill the same idea.

In python:

import sys

 import sys next_id = 0 str_to_id = {} for line in sys.stdin: fields = line.strip().split(',') this_id = str_to_id.get(fields[0]) if this_id is None: next_id += 1 this_id = next_id str_to_id[fields[0]] = this_id fields[0] = str(this_id) print ','.join(fields) 
+1
source

Source: https://habr.com/ru/post/895993/


All Articles