The Cpt class

class cpt.cpt.Cpt

Compact Prediction Tree class.

Attributes
split_lengthint, default 0 (all elements are considered)

The split length is used to delimit the length of training sequences.

noise_ratiofloat, default 0 (no noise)

The threshold of frequency to consider elements as noise.

MBRint, default 0 (at least one update)

Minimum number of similar sequences needed to compute predictions.

alphabetAlphabet

The alphabet is used to encode values for Cpt. alphabet should not be used directly.

Methods

compute_noisy_items

Compute noisy elements.

find_similar_sequences

Find similar sequences.

fit

Train the model with a list of sequence.

predict

Predict the next element of each sequence of the parameter sequences.

predict_k

Predict the next elements of each sequence of the parameter sequences, sorted by descending confidence.

retrieve_sequence

Retrieve sequence from the training data.

fit(sequences)

Train the model with a list of sequence.

The model can be retrained to add new sequences. model.fit(seq1);model.fit(seq2) is equivalent to model.fit(seq1 + seq2) with seq1, seq2 list of sequences.

Parameters
sequenceslist

A list of sequences of any hashable type.

Returns
None

Examples

>>> model.fit([['hello', 'world'], ['hello', 'cpt']])
predict(sequences, multithreading=True)

Predict the next element of each sequence of the parameter sequences.

Parameters
sequenceslist

A list of sequences of any hashable type.

multithreadingbool, default True

True if the multithreading should be used for predictions.

Returns
predictionslist of length len(sequences)

The predicted elements.

Raises
ValueError

noise_ratio should be between 0 and 1. MBR should be non-negative.

Examples

>>> model = Cpt()
>>> model.fit([['hello', 'world'],
     ['hello', 'this', 'is', 'me'],
     ['hello', 'me']
    ])
>>> model.predict([['hello'], ['hello', 'this']])
['me', 'is']
predict_k(sequences, k, multithreading=True)

Predict the next elements of each sequence of the parameter sequences, sorted by descending confidence.

Parameters
sequenceslist

A list of sequences of any hashable type.

k: int

Number of predictions to make per sequence, ordered by descending confidence.

multithreadingbool, default True

True if the multithreading should be used for predictions.

Returns
predictionsList[List[Any]] of dimension len(sequences) * k

The predicted elements.

Raises
ValueError

noise_ratio should be between 0 and 1. MBR should be non-negative.

Examples

>>> model = Cpt()
>>> model.fit([['hello', 'world'],
     ['hello', 'this', 'is', 'me'],
     ['hello', 'me']
    ])
>>> model.predict_k([['hello']], 2)
[['me', 'this']]
compute_noisy_items(noise_ratio)

Compute noisy elements.

An element is considered as noise if the frequency of sequences in which it appears at least once is below noise_ratio.

Parameters
noise_ratiofloat

The threshold of frequency to consider elements as noise.

Returns
noisy_itemslist

The noisy items.

Raises
ValueError

noise_ratio should be between 0 and 1

find_similar_sequences(sequence)

Find similar sequences.

A sequence similar X of a sequence S is a sequence in which every element of S is in X

Parameters
sequencelist
Returns
similar_sequenceslist

The list of similar_sequences.

retrieve_sequence(index)

Retrieve sequence from the training data.

Parameters
indexint

Index of the sequence to retrieve.

Returns
sequencelist

Examples

>>> model = Cpt()
>>> model.fit([['sample', 'data'], ['should', 'not', 'be', 'retrieved']])
>>> model.retrieve_sequence(0)
['sample', 'data']