Monitoring of Summary Statistics#

Once you have represented predicted disambiguations as membership vectors, you are ready for the first step of evaluation: monitoring summary statistics.

Summary statistics give you insight into the behavior of your entity resolution, allowing you to understand how your system evolves over time and allowing you to compare different disambiguation results.

In this guide, we consider a subset of PatentsView’s disambiguation of inventor names on granted patents. The history of PatentsView’s disambiguation over time, indexed by Pandas Timestamps and with values corresponding to the disambiguation of inventor mentions (represented as membership vectors), are available as a toy dataset in the package:

import pandas as pd
import er_evaluation as ee

predictions, _ = ee.load_pv_disambiguations()
predictions[pd.Timestamp('2017-08-08')]  # First available disambiguation
mention_id
US5828387-4     4661703-2
US8031420-4     6219192-3
US10692631-0         None
US7976910-2     4742508-1
US5073693-0     5073693-1
                  ...    
US4793455-2     4383408-2
US4673655-2     4673655-3
US9740948-0          None
US10178129-2         None
US6762742-3     6762742-4
Name: disamb_inventor_id_20170808, Length: 133541, dtype: object

Simple Statistics#

The most basic set of summary statistics we can compute are the:

  • Average cluster size

  • Number of clusters

  • Number of distinct cluster sizes (labeled “H0”)

  • Matching rate: Proportion of elements that are matched to at least one other element.

Additionally, Hill Numbers of the cluster size distribution can be computed. The exponentiated shannon entropy is labeled “H1”, and Simpson’s index is labeled “H2”.

This default set of summary statistics can be computed for a single disambiguation as follows:

ee.summary_statistics(predictions[pd.Timestamp('2017-08-08')])
{'number_of_clusters': 11264,
 'average_cluster_size': 7.523881392045454,
 'matching_rate': 0.939586307803042,
 'H0': 171,
 'H1': 9.595842527724987,
 'H2': 4.081892065967765}

Name-Based Statistics#

In addition to the above, we can use element names to describe when elements with the same name are not clustered together, and when elements with the different names are clustered together. The statistics used to quantify this are:

  • Homonymy rate: The proportion of clusters that share a name with another cluster.

  • Name variation rate: The name variation rate is the proportion of clusters with name variation within.

These can be computed as follows, using inventor last names in our example:

pv_data = ee.load_pv_data()
names = pv_data.set_index("mention_id")["raw_inventor_name_last"]

ee.summary_statistics(predictions[pd.Timestamp('2017-08-08')], names=names)
{'number_of_clusters': 11264,
 'average_cluster_size': 7.523881392045454,
 'matching_rate': 0.939586307803042,
 'H0': 171,
 'H1': 9.595842527724987,
 'H2': 4.081892065967765,
 'homonymy_rate': 0.9921875,
 'name_variation_rate': 0.0035511363636363635}

Plotting#

When dealing with a dictionary of disambiguations, you can use er_evaluation.plot_summaries() from the er_evaluation.plots() module to plot the summary statistics as a time series:

ee.plot_summaries(predictions, names=names)

You can also plot the cluster size distribution for a given disambiguation:

fig = ee.plot_cluster_sizes_distribution(predictions[pd.Timestamp('2017-08-08')])
fig.update_xaxes(range=[0, 50])

And you can plot the Hill numbers curve of the cluster size distribution:

ee.plot_entropy_curve(predictions[pd.Timestamp('2017-08-08')])

At q=0, the Hill number represents the number of distinct cluster sizes. At q=1, it is the exponentiated Shannon entropy, and at q=2, it is the inverse of the Simpson diversity index.

Comparison Statistics#

To compare disambiguations between them, we recommend using well-known metrics such as precision, recall, and f-scores (including pairwise metrics, cluster metrics, and b-cubed metrics) from the module.

For example, here is how to compute the b-cubed precision for two disambiguations:

ee.b_cubed_precision(predictions[pd.Timestamp('2017-08-08')], predictions[pd.Timestamp('2022-06-30')])
0.7738213703566124

To compute metrics between all pairs of disambiguations between two dictionaries, you can use er_evaluation.metrics_table(). In this case, we compare two disambiguations labeled “A” and “B”.

from er_evaluation.metrics import metrics_table

metrics_table(predictions={"A":predictions[pd.Timestamp('2017-08-08')]}, references={"B":predictions[pd.Timestamp('2022-06-30')]})
prediction reference metric value
0 A B Pairwise Precision 0.324046
1 A B Pairwise Recall 0.880232
2 A B Pairwise F1 0.473704
3 A B B-Cubed Precision 0.773821
4 A B B-Cubed Recall 0.878674
5 A B B-Cubed F1 0.822921
6 A B Cluster Precision 0.420632
7 A B Cluster Recall 0.492464
8 A B Cluster F1 0.453723

Plots#

You can plot a heatmap representation of metrics for all pairs of disambiguations using the er_evaluation.plot_comparison() function. You can change the metrics using the optional metrics argument.

ee.plot_comparison(predictions)

The heatmap helps identify changes in the disambiguation results. For instance, you can notice significant changes to the disambiguation algorithm in December 2017 and December 2020.