I want to use the pyspark.mllib.tree.RandomForest module to get the approximation matrix for my observations.
So far, my data has been small enough to load directly into memory. So I used sklearn.ensemble.RandomForestClassifier to get the proximity matrix as follows: suppose X is a matrix containing functions and Y is a vector containing labels. I trained a random forest in order to distinguish between objects labeled “0” and “1”. Having a trained random forest, I wanted to get a measure of the proximity between each pair of observations in my data set, referring to how many decision trees for both cases received the same final node (= sheet). Thus, for 100 decision trees, the proximity between two observations can vary from 0 (never fall into the same final sheet) and 100 (dropped to the same final sheet in all decision trees). Python implementation:
import numpy
from sklearn import ensemble
print X.shape, Y.shape
>> (8562, 4281) (8562,)
n_trees = 100
rand_tree = sklearn.ensemble.RandomForestClassifier(n_estimators=n_tress)
rand_tree.fit(X, Y)
apply_mat = rand_tree.apply(X)
obs_num = len(apply_mat)
sim_mat = numpy.eye(obs_num) * len(apply_mat[0])
for i in xrange(obs_num):
for j in xrange(i, obs_num):
vec_i = apply_mat[i]
vec_j = apply_mat[j]
sim_val = len(vec_i[vec_i==vec_j])
sim_mat[i][j] = sim_val
sim_mat[j][i] = sim_val
sim_mat_norm = sim_mat / len(apply_mat[0])
print sim_mat_norm.shape
>> (8562, 8562)
, , , Spark. , "" , . ?
( , Spark: https://spark.apache.org/docs/1.2.0/mllib-ensembles.html#classification):
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
(trainingData, testData) = data.randomSplit([0.7, 0.3])
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
, .
!