Estimating Performance#

After preparing your data and monitoring summary statistics, the next step in evaluating your entity resolution (ER) system is estimating performance metrics.

Note

Estimating these metrics is not as straightforward as one might assume. Naive computation of performance metrics using benchmark datasets can lead to biased and over-optimistic results. This is due to the non-linear scaling of entity resolution: while it might be easy to disambiguate a small benchmark dataset, the complexity of the problem grows quadratically with the dataset size. Therefore, to obtain reliable and representative performance estimators, it is crucial to consider the entire population of interest and account for the sampling processes and biases.

In this guide, we will discuss how to estimate performance metrics using the ER-Evaluation package, emphasizing the importance of applying sampling weights alongside predictions and benchmark data.

To estimate performance metrics, you will need:

  • Predicted clusters: Predicted entity clusters for a set of records or entity mentions, usually the main output of an ER system.

  • Reference/benchmark data: A trusted benchmark dataset or reference disambiguation for “ground truth” data, typically containing 200 to 400 disambiguated entities.

  • Sampling weights: Weights applied to account for the sampling process and biases, such as “uniform” weights or “cluster_size” weights for sampling with probability proportional to cluster size.

Note

Sampling weights correct for biases in performance metric estimation when working with sampled data. They can be derived from the sampling design and selection probabilities, or estimated using propensity scoring techniques. For instance, in probability proportional to cluster size sampling (sampling records at random), weights are calculated as the inverse of each cluster size. Incorporating sampling weights ensures accurate performance metric estimation, accounting for non-representative reference datasets or unequal selection probabilities.

Available Performance Metric Estimators#

The ER-Evaluation package provides functions to estimate various performance metrics and summary statistics, such as B-cubed precision and recall, cluster precision and recall, F-scores, matching rate, homonymy rate, and name variation rate. Refer to the module documentation for a comprehensive list of available functions.

Estimating Metrics#

To estimate performance metrics, use the functions provided by the module. These functions accept a predicted disambiguation, a set of ground truth clusters, and a set of cluster sampling weights as input. They return an estimate of the performance metric along with an estimate of the standard deviation.

Here’s an example using the er_evaluation.pairwise_precision_estimator() function, and continuing with PatentsView’s disambiguation of a subset inventor names on granted patents as our running example. The “reference” benchmark data is a sample of inventor clusters, where the sampling probabilities were chosen to be proportional to cluster size.

import pandas as pd
import er_evaluation as ee

predictions, reference = ee.load_pv_disambiguations()
prediction = predictions[pd.Timestamp('2017-08-08')]

ee.pairwise_precision_estimator(prediction, reference, weights="cluster_size")
(0.5682905929738009, 0.10762619520264015)

Note

Our performance metric estimators use information about the whole disambiguation, including elements that are not part of the benchmark dataset. As such, the entire predicted disambiguation should be provided as a first argument to estimators. Providing only a subset of the predicted disambiguation, such as the subset that intersects the benchmark data, will lead to erroneous results.

Estimating Multiple Metrics#

You can use the er_evaluation.estimates_table() function to estimate multiple metrics at once for multiple disambiguations:

ee.estimates_table(predictions, samples_weights={"Sample 1": {"sample":reference, "weights":"cluster_size"}})
prediction sample_weights estimator value std
0 2017-08-08 Sample 1 pairwise_precision 0.568291 0.107626
1 2017-08-08 Sample 1 pairwise_recall 0.961092 0.009080
2 2017-08-08 Sample 1 pairwise_f 0.720135 0.083511
3 2017-08-08 Sample 1 cluster_precision 1.590564 0.127525
4 2017-08-08 Sample 1 cluster_recall 0.816566 0.026620
... ... ... ... ... ...
115 2022-06-30 Sample 1 cluster_precision 1.956251 0.164533
116 2022-06-30 Sample 1 cluster_recall 0.774960 0.029681
117 2022-06-30 Sample 1 cluster_f 1.111095 0.052049
118 2022-06-30 Sample 1 b_cubed_precision 0.895336 0.019149
119 2022-06-30 Sample 1 b_cubed_recall 0.985940 0.004212

120 rows × 5 columns

Note

Note that the cluster_precision estimates are higher than 1.0 in this example, which is not a realistic estimate. This is due to the fact that the predictions used in this example are only a subset of the entire disambiguation. As previously noted, this causes some of the estimates to be significantly biased. In a real application, the entire disambiguations (here membership vectors with millions of rows) should be provided rather than the toy subset.

Plots#

Helper visualization functions <visualizations>_ are available in the module:

ee.plot_estimates(predictions, {"sample":reference, "weights":"cluster_size"})

The plot shows the performance metric estimates as well as error bars corresponding to +/- 1 estimated standard deviation.