In a document entitled " Machine Learning at the Limit," "Canny, et. et al. Report significant word2vec processing speed improvements.
I work with the BIDMach library used in this article and cannot find any resource explaining how Word2Vec is implemented or how it should be used in this structure.
There are several scenarios in the repo:
I tried to run them (after creating the tparse2.exe reference file) without success.
I tried modifying them to make them start, but nothing but errors returned.
I emailed the author and posted a problem with the github repository but received nothing. I only have someone else having the same problems who says he launched it, but at much slower speeds than he reported on the new GPU hardware.
I searched everything, trying to find anyone who used this library to achieve these speeds without any luck. This library has several links floating around this point, as the fastest implementation, and gives numbers in the document:
When I search for a similar library (gensim) and the import code needed to run it, I find thousands of results and tutorials , but a similar BIDMach code search only gives the BIDMach repository .
This BIDMach implementation certainly has a reputation for being the best, but can someone tell me how to use it ?
All I want to do is run a simple learning process to compare it with several other implementations on my own hardware.
Every other implementation of this concept that I can find either works with the original shell script test file , contains actual instructions or provides its own shell scripts prior to test .
UPDATE: The author of the library added additional shell scripts to run the above scripts, but exactly what they mean and how they work is still a complete mystery, and I cannot figure out how to make the word2vec learning procedure work independently of the data.
EDIT (for generosity)
I will give a reward to anyone who can explain how I will use my own body (text8 will be wonderful), and then prepare the model, and then save the ouput vectors and the dictionary into files that Omar Levy Hypervoics can read.
This is exactly what the original C implementation will do with the arguments -binary 1 -output vectors.bin -save-vocab vocab.txt
This is what the Intel implementation does, as well as other CUDA implementations, etc., so this is a great way to create something that can easily be compared with other versions ...
UPDATE (expiration time without reply) John Canny updated several scripts in the repo and added the file fmt.txt , which allows you to run test scripts that are packages in the repo.
However, my attempt to run this with text8 corpus gives an accuracy of about 0% when checking hyperlogs.
Performing the training process according to the standard criterion of a billion words (which is what repo scripts do now) also gives accuracy that is less than the average accuracy of the hyperlog test.
Thus, either the library never gave accuracy on these tests, or I still did not lose anything in my setup.
The problem remains open on github .