transformations

class transformations.EmbeddingVectorizer(embedding_sequence_length=None)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Converts text into padded sequences. The output of this transformation is consistent with the required format for Keras embedding layers

For example ‘the fat man’ might be transformed into [2, 0, 27, 1, 1, 1], if the embedding_sequence_length is 6.

There are a few sentinel values used by this layer:

  • 0 is used for the UNK token (tokens which were not seen during training)
  • 1 is used for the padding token (to fill out sequences that shorter than embedding_sequence_length)
__init__(embedding_sequence_length=None)

x.__init__(…) initializes x; see help(type(x)) for signature

fit(X, y=None)
generate_embedding_sequence_length(observation_series)
static pad(input_sequence, length, pad_char)

Pad the given iterable, so that it is the correct length.

Parameters:
  • input_sequence – Any iterable object
  • length (int) – The desired length of the output.
  • pad_char (str or int) – The character or int to be added to short sequences
Returns:

A sequence, of len length

Return type:

[]

process_string(input_string)

Turn a string into padded sequences, consisten with Keras’s Embedding layer

  • Simple preprocess & tokenize
  • Convert tokens to indices
  • Pad sequence to be the correct length
Parameters:input_string (str) – A string, to be converted into a padded sequence of token indices
Returns:A padded, fixed-length array of token indices
Return type:[int]
transform(X)