Introduction:
I am trying to prepare svm evaluation of tensor flow tensorflow.contrib.learn.python.learn.estimators.svmwith sparse data. An example of use with sparse data in the github repository in tensorflow/contrib/learn/python/learn/estimators/svm_test.py#L167(I am not allowed to post more links, so here is the relative path).
The svm evaluator expects as a parameter example_id_columnand feature_columnswhere function columns should be inferred from the class FeatureColumn, for example tf.contrib.layers.feature_column.sparse_column_with_hash_bucket. See the Github Repository tensorflow/contrib/learn/python/learn/estimators/svm.py#L85and the tensorflow.org documentation for python/contrib.layers#Feature_columns.
Question:
- How do I configure my input pipeline to format sparse data so that I can use one of the tf.contrib.layers feature_columns as input to evaluate svm.
- What would a dense input function look like with many functions?
Background
The data I use is a dataset a1afrom the LIBSVM website . The data set has 123 functions (which will correspond to 123 feature_columns if the data is dense). I wrote to user op to read the type data tf.decode_csv(), but for the LIBSVM format. Op returns labels as a dense tensor, and functions as a sparse tensor. My input pipeline:
NUM_FEATURES = 123
batch_size = 200
decode_libsvm_module = tf.load_op_library('./libsvm.so')
def input_pipeline(filename_queue, batch_size):
with tf.name_scope('input'):
reader = tf.TextLineReader(name="TextLineReader_")
_, libsvm_row = reader.read(filename_queue, name="libsvm_row_")
min_after_dequeue = 1000
capacity = min_after_dequeue + 3 * batch_size
batch = tf.train.shuffle_batch([libsvm_row], batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
name="text_line_batch_")
labels, sp_indices, sp_values, sp_shape = \
decode_libsvm_module.decode_libsvm(records=batch,
num_features=123,
OUT_TYPE=tf.int64,
name="Libsvm_decoded_")
return tf.SparseTensor(sp_indices, sp_values, sp_shape), labels
Here's an example package with batch_size = 5.
def input_fn(dataset_name):
maybe_download()
filename_queue_train = tf.train.string_input_producer([dataset_name],
name="queue_t_")
features, labels = input_pipeline(filename_queue_train, batch_size)
return {
'example_id': tf.as_string(tf.range(1,123,1,dtype=tf.int64)),
'features': features
}, labels
This is what I have tried so far:
with tf.Session().as_default() as sess:
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
feature_column = tf.contrib.layers.sparse_column_with_hash_bucket(
'features', hash_bucket_size=1000, dtype=tf.int64)
svm_classifier = svm.SVM(feature_columns=[feature_column],
example_id_column='example_id',
l1_regularization=0.0,
l2_regularization=1.0)
svm_classifier.fit(input_fn=lambda: input_fn(TRAIN),
steps=30)
accuracy = svm_classifier.evaluate(
input_fn= lambda: input_fn(features, labels),
steps=1)['accuracy']
print(accuracy)
coord.request_stop()
coord.join(threads)
sess.close()