Getting free text functions in Tensorflow Canned Estimators using the Dataset API via feature_columns

I am trying to build a model that gives reddit_score = f('subreddit','comment')

This is basically an example that I can use for a working project.

My code is here .

My problem is that I see that canned evaluations like DNNLinearCombinedRegressor should have feature_columns that are part of the class FeatureColumn.

I have my vocab file and know that if I just limited the first word of the comment, I could just do something like

tf.feature_column.categorical_column_with_vocabulary_file(
        key='comment',
        vocabulary_file='{}/vocab.csv'.format(INPUT_DIR)
        )

But if I go through, I will say the first 10 words from the comment, then I’m not sure how to go from a line, such as "this is a pre padded 10 word comment xyzpadxyz xyzpadxyz", to feature_column, so that I can build an attachment to go to a deepfunction in a wide and deep model.

It seems like it should be something really obvious or simple, but life cannot find for me any existing examples with this particular setting (finished wide and deep, api dataset and a combination of functions like subreddit and raw text function as a comment).

I even thought about doing a holistic search for vocals in such a way that the function commentthat I pass will be similar to [23,45,67,12,1,345,7,99,999,999], and then, maybe, I can get it through numeric_feature with the form, and then from there do something with it. But this is a little strange.

+3
2

@Lak, api.

# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(prefix, mode, batch_size):

    def _input_fn():

        def decode_csv(value_column):

            columns = tf.decode_csv(value_column, field_delim='|', record_defaults=DEFAULTS)
            features = dict(zip(CSV_COLUMNS, columns))

            features['comment_words'] = tf.string_split([features['comment']])
            features['comment_words'] = tf.sparse_tensor_to_dense(features['comment_words'], default_value=PADWORD)
            features['comment_padding'] = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
            features['comment_padded'] = tf.pad(features['comment_words'], features['comment_padding'])
            features['comment_sliced'] = tf.slice(features['comment_padded'], [0,0], [-1, MAX_DOCUMENT_LENGTH])
            features['comment_words'] = tf.pad(features['comment_sliced'], features['comment_padding'])
            features['comment_words'] = tf.slice(features['comment_words'],[0,0],[-1,MAX_DOCUMENT_LENGTH])

            features.pop('comment_padding')
            features.pop('comment_padded')
            features.pop('comment_sliced')

            label = features.pop(LABEL_COLUMN)

            return features, label

        # Use prefix to create file path
        file_path = '{}/{}*{}*'.format(INPUT_DIR, prefix, PATTERN)

        # Create list of files that match pattern
        file_list = tf.gfile.Glob(file_path)

        # Create dataset from file list
        dataset = (tf.data.TextLineDataset(file_list)  # Read text file
                    .map(decode_csv))  # Transform each elem by applying decode_csv fn

        tf.logging.info("...dataset.output_types={}".format(dataset.output_types))
        tf.logging.info("...dataset.output_shapes={}".format(dataset.output_shapes))

        if mode == tf.estimator.ModeKeys.TRAIN:

            num_epochs = None # indefinitely
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)

        else:

            num_epochs = 1 # end-of-input after this

        dataset = dataset.repeat(num_epochs).batch(batch_size)

        return dataset.make_one_shot_iterator().get_next()

    return _input_fn

, decode_csv():

# Define feature columns
def get_wide_deep():

    EMBEDDING_SIZE = 10

    # Define column types
    subreddit = tf.feature_column.categorical_column_with_vocabulary_list('subreddit', ['news', 'ireland', 'pics'])

    comment_embeds = tf.feature_column.embedding_column(
        categorical_column = tf.feature_column.categorical_column_with_vocabulary_file(
            key='comment_words',
            vocabulary_file='{}/vocab.csv-00000-of-00001'.format(INPUT_DIR),
            vocabulary_size=100
            ),
        dimension = EMBEDDING_SIZE
        )

    # Sparse columns are wide, have a linear relationship with the output
    wide = [ subreddit ]

    # Continuous columns are deep, have a complex relationship with the output
    deep = [ comment_embeds ]

    return wide, deep
0

Source: https://habr.com/ru/post/1689940/


All Articles