transformations¶
-
class
transformations.
EmbeddingVectorizer
(embedding_sequence_length=None)¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Converts text into padded sequences. The output of this transformation is consistent with the required format for Keras embedding layers
For example ‘the fat man’ might be transformed into [2, 0, 27, 1, 1, 1], if the embedding_sequence_length is 6.
There are a few sentinel values used by this layer:
- 0 is used for the UNK token (tokens which were not seen during training)
- 1 is used for the padding token (to fill out sequences that shorter than embedding_sequence_length)
-
__init__
(embedding_sequence_length=None)¶ x.__init__(…) initializes x; see help(type(x)) for signature
-
fit
(X, y=None)¶
-
generate_embedding_sequence_length
(observation_series)¶
-
static
pad
(input_sequence, length, pad_char)¶ Pad the given iterable, so that it is the correct length.
Parameters: - input_sequence – Any iterable object
- length (int) – The desired length of the output.
- pad_char (str or int) – The character or int to be added to short sequences
Returns: A sequence, of len length
Return type: []
-
process_string
(input_string)¶ Turn a string into padded sequences, consisten with Keras’s Embedding layer
- Simple preprocess & tokenize
- Convert tokens to indices
- Pad sequence to be the correct length
Parameters: input_string (str) – A string, to be converted into a padded sequence of token indices Returns: A padded, fixed-length array of token indices Return type: [int]
-
transform
(X)¶