I plan to write a program in Ruby to analyze some of the data that was returned from the online application. There are hundreds of thousands of answers, and each respondent answers 200 questions. Each question has multiple choices, so for each of them there is a certain number of possible answers.
The goal is to use part of the demographic data given by each respondent to train the system, which can then guess the same part of demographic data (e.g. age) from the respondent who answers one questionnaire, but do not provide demographic data.
Therefore, I plan to use the vector (in the mathematical sense, not in the sense of the data structure) to represent the answers for this respondent. This means that each vector will be large (over 200 elements), and the overall data set will be huge. I plan to store data in a MySQL database.
So. 2 questions:
How to store this in a database? One line per answer to one question or one line for each respondent? Or something else?
I plan to use something like a k-nearest neighbor algorithm or a simple machine learning algorithm such as a naive Bayes classifier to learn how to classify new answers. Should I manipulate the data only through SQL or should I load it into memory and store it in some huge array?
source
share