I am completely new to word2vec, so please take it with me. I have a set of text files, each of which contains a set of tweets, between 1000-3000. I have chosen a general keyword ( "kw1"), and I want to find semantically relevant terms for "kw1"using word2vec. For example, if a keyword "apple", I expect to see related terms, such as "ipad" "os" "mac"... based on the input file. Thus, this set of related terms for "kw1"will be different for each input file, since word2vec will be trained on separate files (for example, 5 input files, run word2vec 5 times in each file).
My goal is to find sets of related terms for each input file based on a common keyword ( "kw1") that will be used for some other purposes.
My questions / doubts:
- Does it make sense to use word2vec for such a task? Is it technically correct to use, given the small size of the input file?
I downloaded the code from code.google.com: https://code.google.com/p/word2vec/ and just gave it a dry move like this:
time ./word2vec -train $file -output vectors.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 1 -sample 1e-3 -threads 12 -binary 1 -iter 50
./distance vectors.bin
From my results, I saw that I have a lot of noisy terms (stop words) when I use the tool 'distance'to get related terms with "kw1". Therefore, I deleted stop words and other noisy terms, such as mentions of users. But I have not seen anywhere that word2vec requires cleared input?
? , ( distance) , '-window', '-iter'. , . ( , ).