er_evaluation.error_analysis#

Error Analysis#

The error_analysis module provides tools to analyze errors, given a set of ground truth clusters. These ground truth clusters may correspond to a benchmark dataset which is complete (all of the entities within it are fully resolved and have no missing elements), or to a probability sample of ground truth clusters.

The key assumptions used for this module are:

A predicted clustering is available as a membership vector (named prediction throughout).
A set of ground truth clusters is available as a membership vector (named sample throughout).

Furthermore, two types of errors can be defined and analyzed:

Cluster-level errors are errors associated to each cluster.
Record-level errors are errors associated to each record.

Analyze cluster-level errors#

Toy Example

Consider the following set of ground truth clusters and predicted clusters of records \(1,2,\dots, 8\):

                    ┌───────┐  ┌─────┐  ┌───┐
                    │ 1   2 │  │  4  │  │ 6 │  ┌───┐
     True clusters: │       │  │     │  │   │  │ 8 │
                    │   3   │  │  5  │  │ 7 │  └───┘
                    └───────┘  └─────┘  └───┘   c4
                        c1        c2      c3

                    ┌───────┐  ┌─────┐  ┌──────────┐
                    │ 1   2 │  │  4  │  │ 6        │
Predicted clusters: ├───────┴──┴─────┤  │        8 │
                    │   3         5  │  │ 7        │
                    └────────────────┘  └──────────┘

Assume that the ground truth clusters c1, c2, and c4 are available in a benchmark dataset sample. Then, we have:

>>> import pandas as pd
>>> prediction = pd.Series(index=[1,2,3,4,5,6,7,8], data=[1,1,2,3,2,4,4,4])
>>> sample = pd.Series(index=[1,2,3,4,5,8], data=["c1", "c1", "c1", "c2", "c2", "c4"])

The following error metrics, namely the splitting entropy, expected number of extraneous elements, and expected number of missing elements, are used to quantify errors associated with each ground truth cluster. Refer to the API documentation for full definitions:

>>> from er_evaluation.error_analysis import (splitting_entropy, expected_extra, expected_missing)

>>> expected_extra(prediction, sample)
reference
c1    0.333333
c2    0.500000
c4    2.000000
Name: expected_extra, dtype: float64

>>> expected_missing(prediction, sample)
reference
c1    1.333333
c2    1.000000
c4    0.000000
Name: expected_missing, dtype: float64

>>> splitting_entropy(prediction, sample)
reference
c1    1.889882
c2    2.000000
c4    1.000000
Name: splitting_entropy_1, dtype: float64

Analyse record-level errors#

We define errors at the record level through a record error table, which provides the following quantities for each sampled record:

pred_cluster_size: The size of the predicted cluster which contains the record.
ref_cluster_size: The size of the true cluster which contains the record.
extra: The number of elements in the predicted cluster which are not in the true cluster.
missing: The number of elements in the true cluster which are not in the predicted cluster.

These four quantities, together with the record index, the predicted cluster ID, and the true cluster ID, are stored in what is called the record error table. The record error table can be computed using the record_error_table() function, given a sample of ground truth clusters and a prediction.

From the record error table, the cluster error metrics can be computed. The functions expected_size_difference_from_table(), expected_extra_from_table(), expected_missing_from_table(), expected_relative_extra_from_table(), expected_relative_missing_from_table(), and error_indicator_from_table() compute cluster-level errors from the record error table rather than from the prediction and the sample.

The key advantage of working with the record error table is that it allows sensitivity analyses to be performed. Since all cluster error metrics and representative performance estimators can be computed directly from the record error table, uncertainty regarding error rates can be propagated from the record error table into cluster error metrics and into performance estimates.

Functions#

`count_extra`(prediction, sample)	Count the number of extraneous elements to sampled clusters.
`count_missing`(prediction, sample)	Count the number of missin elements to sampled clusters.
`error_indicator`(prediction, sample)	Error indicator metric.
`error_metrics`(prediction, sample)	Compute canonical set of error metrics from record error table.
`expected_extra`(prediction, sample)	Expected number of extraneous elements to records in sampled clusters.
`expected_missing`(prediction, sample)	Expected number of missin elements to records in sampled clusters.
`expected_relative_extra`(prediction, sample)	Expected relative number of extraneous elements to records in sampled clusters.
`expected_relative_missing`(prediction, sample)	Expected relative number of missin elements to records in sampled clusters.
`expected_size_difference`(prediction, sample)	Expected size difference between predicted and sampled clusters.
`splitting_entropy`(prediction, sample[, alpha])	Splitting entropy of true clusters.
`cluster_sizes_from_table`(error_table)	Compute cluster sizes from record error table.
`error_indicator_from_table`(error_table)	Compute error indicator from record error table.
`error_metrics_from_table`(error_table)	Compute canonical set of error metrics from record error table.
`expected_extra_from_table`(error_table)	Compute expected extra elements from record error table.
`expected_missing_from_table`(error_table)	Compute expected missin elements from record error table.
`expected_relative_extra_from_table`(error_table)	Compute expected relative extra elements from record error table.
`expected_relative_missing_from_table`(error_table)	Compute expected relative missin elements from record error table.
`expected_size_difference_from_table`(error_table)	Compute expected size difference from record error table.
`fit_dt_regressor`(X, y[, numerical_features, ...])	Fits a decision tree regressor model with optional preprocessing for numerical and categorical features.
`pred_cluster_sizes_from_table`(error_table)	Compute predicted cluster sizes from record error table.
`record_error_table`(prediction, sample)	Compute record error table.