load_pv_disambiguations#

er_evaluation.datasets.load_pv_disambiguations()[source]#

Load reference disambiguation and predicted disambiguations for the PatentsView dataset.

See er_evaluation.datasets.load_pv_data() for more information on the PatentsView dataset.

The reference disambiguation corresponds to Binette’s 2022 inventors benchmark. It does not cover the entirety of the PatentsView dataset. It is a sample of 400 inventor clusters with sampling probabilities proportional to cluster size.

Predicted disambiguations correspond to inventor disambiguations released by PatentsView between 2017 and 2022. The data has been restricted to inventor mentions for which the last name and first two letters of the first name match those found in Binette’s 2022 inventors benchmark.

Returns:: tuple (predictions, reference) where reference is the ground truth disambiguation and predictions is a dictionary of predicted disambiguations.

Examples

Estimate pairwise precision for PatentsView’s 2021/12/30 disambiguation:

>>> predictions, reference = load_pv_disambiguations()
>>> from er_evaluation.estimators import pairwise_precision_estimator
>>> import pandas as pd
>>> prediction = predictions[pd.Timestamp('2021-12-30 00:00:00')]
>>> pairwise_precision_estimator(prediction, reference, weights="cluster_size")
(0.9131787709880134, 0.018619907220335144)

References

Binette, Olivier, Sarvo Madhavan, Jack Butler, Beth Anne Card, Emily Melluso and Christina Jones. 2023. PatentsView-Evaluation: Evaluation Datasets and Tools to Advance Research on Inventor Name Disambiguation. arXiv e-prints: arxiv:2301.03591. Available online at https://arxiv.org/abs/2301.03591
Binette, Olivier, Sokhna A York, Emma Hickerson, Youngsoo Baek, Sarvo Madhavan, Christina Jones. (2022). Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org. arXiv e-prints: arxiv:2210.01230