CPT’s documentation¶
A sequence prediction algorithm¶
Introduction¶
This project is a cython open-source implementation of the Compact Prediction Tree algorithm using multithreading.
CPT is a sequence prediction algorithm. It is a highly explainable model and good at predicting, in a finite alphabet, next value of a sequence. However, given a sequence, CPT cannot predict an element already present in this sequence (Cf “Tuning” part).
This implementation is based on the following research papers
http://www.philippe-fournier-viger.com/ADMA2013_Compact_Prediction_trees.pdf
http://www.philippe-fournier-viger.com/spmf/PAKDD2015_Compact_Prediction_tree+.pdf
Examples¶
Hello World example¶
You can test the model with the following code
from cpt.cpt import Cpt
model = Cpt()
model.fit([['hello', 'world'],
['hello', 'this', 'is', 'me'],
['hello', 'me']
])
model.predict([['hello'], ['hello', 'this']])
# Output: ['me', 'is']
Sklearn Example¶
This code is also compatible with sklearn tools such as Gridsearch
from sklearn.base import BaseEstimator
from cpt.cpt import Cpt
from sklearn.model_selection import GridSearchCV
class SKCpt(Cpt, BaseEstimator):
def __init__(self, split_length=0, noise_ratio=0, MBR=0):
super().__init__(split_length, noise_ratio, MBR)
def score(self, X):
# Choose your own scoring function here
predictions = self.predict(list(map(lambda x: x[self.split_length:-1], X)))
score = sum([predictions[i] == X[i][-1] for i in range(len(X))]) / len(X) * 100
return score
data = [['hello', 'world'], ['hello', 'cpt'], ['hello', 'cpt']]
tuned_params = {'MBR': [0, 5], 'split_length': [0, 1, 5]}
gs = GridSearchCV(SKCpt(), tuned_params)
gs.fit(data)
gs.cv_results_
The Cpt
class¶
- class cpt.cpt.Cpt¶
Compact Prediction Tree class.
- Attributes
- split_lengthint, default 0 (all elements are considered)
The split length is used to delimit the length of training sequences.
- noise_ratiofloat, default 0 (no noise)
The threshold of frequency to consider elements as noise.
- MBRint, default 0 (at least one update)
Minimum number of similar sequences needed to compute predictions.
- alphabetAlphabet
The alphabet is used to encode values for Cpt.
alphabet
should not be used directly.
Methods
Compute noisy elements.
Find similar sequences.
Train the model with a list of sequence.
Predict the next element of each sequence of the parameter
sequences
.Predict the next elements of each sequence of the parameter
sequences
, sorted by descending confidence.Retrieve sequence from the training data.
- fit(sequences)¶
Train the model with a list of sequence.
The model can be retrained to add new sequences.
model.fit(seq1);model.fit(seq2)
is equivalent tomodel.fit(seq1 + seq2)
with seq1, seq2 list of sequences.- Parameters
- sequenceslist
A list of sequences of any hashable type.
- Returns
- None
Examples
>>> model.fit([['hello', 'world'], ['hello', 'cpt']])
- predict(sequences, multithreading=True)¶
Predict the next element of each sequence of the parameter
sequences
.- Parameters
- sequenceslist
A list of sequences of any hashable type.
- multithreadingbool, default True
True if the multithreading should be used for predictions.
- Returns
- predictionslist of length
len(sequences)
The predicted elements.
- predictionslist of length
- Raises
- ValueError
noise_ratio should be between 0 and 1. MBR should be non-negative.
Examples
>>> model = Cpt()
>>> model.fit([['hello', 'world'], ['hello', 'this', 'is', 'me'], ['hello', 'me'] ])
>>> model.predict([['hello'], ['hello', 'this']]) ['me', 'is']
- predict_k(sequences, k, multithreading=True)¶
Predict the next elements of each sequence of the parameter
sequences
, sorted by descending confidence.- Parameters
- sequenceslist
A list of sequences of any hashable type.
- k: int
Number of predictions to make per sequence, ordered by descending confidence.
- multithreadingbool, default True
True if the multithreading should be used for predictions.
- Returns
- predictionsList[List[Any]] of dimension
len(sequences)
* k The predicted elements.
- predictionsList[List[Any]] of dimension
- Raises
- ValueError
noise_ratio should be between 0 and 1. MBR should be non-negative.
Examples
>>> model = Cpt()
>>> model.fit([['hello', 'world'], ['hello', 'this', 'is', 'me'], ['hello', 'me'] ])
>>> model.predict_k([['hello']], 2) [['me', 'this']]
- compute_noisy_items(noise_ratio)¶
Compute noisy elements.
An element is considered as noise if the frequency of sequences in which it appears at least once is below
noise_ratio
.- Parameters
- noise_ratiofloat
The threshold of frequency to consider elements as noise.
- Returns
- noisy_itemslist
The noisy items.
- Raises
- ValueError
noise_ratio should be between 0 and 1
- find_similar_sequences(sequence)¶
Find similar sequences.
A sequence similar
X
of a sequenceS
is a sequence in which every element ofS
is inX
- Parameters
- sequencelist
- Returns
- similar_sequenceslist
The list of similar_sequences.
- retrieve_sequence(index)¶
Retrieve sequence from the training data.
- Parameters
- indexint
Index of the sequence to retrieve.
- Returns
- sequencelist
Examples
>>> model = Cpt() >>> model.fit([['sample', 'data'], ['should', 'not', 'be', 'retrieved']]) >>> model.retrieve_sequence(0) ['sample', 'data']
Tuning¶
CPT has 3 meta parameters that need to be tuned
MBR¶
MBR indicates the number of similar sequences that need to be found before predicting a value.
The higher this parameter, the longer the prediction. Having more similar sequences can result in a higher accuracy.
split_length¶
split_length is the number of elements per sequence to be stored in the model. (Choosing 0 results in taking all elements)
split_length needs to be finely tuned. As the model cannot predict an element present in the sequence, giving a too long sequence might result in lower accuracy.
noise_ratio¶
The noise_ratio determines which elements are defined as noise and should not be taken into account.