|
In order to detect data drift, we train a random forest classifier (the drift model) to discriminate the current dataset from the reference dataset.
If this classifier has accuracy > 0.5, it implies that reference data and current data can be distinguished and that you are observing data drift.
You may consider retraining your model in that situation.
The train and test data frames used to train the drift model are computed the following way, ref_df and cur_df
being pandas data frames of the selected "reference" and "current" samples:
size = min(len(ref_df), len(cur_df))
ref_df = ref_df.sample(size, random_state=42)
cur_df = cur_df.sample(size, random_state=43)
full_df = pd.concat([ref_df, cur_df]).reset_index(drop=True)
train_df = full_df.sample(frac=0.7, random_state=42)
test_df = full_df.drop(train_df.index)
| Hypothesis tested | No drift (accuracy <= 0.5) |
|---|---|
| Significance level | {{ 1 - uiState.driftState.driftParamsOfResult.confidenceLevel | nicePrecision:3 }} |
| p-value | {{ uiState.driftState.driftResult.driftModelResult.driftModelAccuracy.pvalue | nicePrecision:5 }} |
| Conclusion | Drift detected Inconclusive |
The hypothesis tested is that there is no drift, in which case the expected drift model accuracy is 0.5 (datasets undistinguishable). The observed accuracy might deviate from this expectation and the Binomial test evaluates whether this deviation is statistically significant, modelling the number of correct predictions as a random variable drawn from a Binomial distribution.
The p-value is the probability to observe this particular accuracy (or larger) under the hypothesis of absent drift. If this probability is lower than the significance level (i.e. 5%), it’s then unlikely to be in the situation of absent drift: the hypothesis of no drift is rejected, triggering a drift detection.
The significance level indicates the rate of falsely-detected drifts we are ready to accept from the test.
| Feature | Type | Distribution | KS test | Chi-square test | Population stability index | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| {{ columnSetting.name }} |
|
|
{{univariateDrift.ksTestPvalue | nicePrecision: 2 | ifEmpty: '-' }}
|
{{univariateDrift.chiSquareTestPvalue | nicePrecision: 2 | ifEmpty: '-' }}
|
{{univariateDrift.populationStabilityIndex | nicePrecision: 2 | ifEmpty: '-' }}
|
{{ columnSetting.name }}
Unsupported feature type
Rejected
Rejected at feature processing step
|
|
{{ columnSetting.errorMessage }}
|
||||||
Not available for this model
Click on "Compute" to generate the feature drift importance chart
The scatter plot shows feature importance for the original model versus feature importance for the (data classifying) drift model.
This graph should be examined alongside with the drift score.
For a highly drifted dataset (drift score ~1), if the features most responsible for the data drift are of low importance in the original model (bottom right quadrant), you can expect the behavior of the model to remain the same.
Features in the top right quadrant of the scatter plot are highly drifted (i.e. they are powerful in distinguished test data from new observations), but also of high importance for the original model. In this situation, you can expect the performance of the model to degrade as your model does not apply to your new observations.
Computation of text drift between different evaluations is not supported
| Feature | Euclidian distance | Cosine similarity | Classifier gini |
|---|---|---|---|
| {{ embeddingDrift._key }} |
{{embeddingDrift.euclidianDistance | nicePrecision: 2 | ifEmpty: '-' }}
|
{{embeddingDrift.cosineSimilarity | nicePrecision: 2 | ifEmpty: '-' }}
|
{{embeddingDrift.classifierGini | nicePrecision: 2 | ifEmpty: '-' }}
|
In order to detect data drift in text columns, we convert text into embeddings vectors that serve as a numerical representation of the texts, designed to capture semantic relationships and contextual information about the objects they represent.
We then compute statistical and geometrical metrics in this embedding space to detect a shift in the embedding distribution, which in turn reflects a shift in the text distribution.
To materialize these shift, we are using the Euclidian distance and the Cosine similarity metrics. Additionally, we are training a Binary Classifier to differentiate the embeddings with a resulting metric ranging from 0 (no drift) to 1 (drift).
See