load_pv_data#
- er_evaluation.load_pv_data()[source]#
Load PatentsView dataset.
This is based on a subset of the “g_inventor_not_disambiguated.tsv” file from PatentsView’s bulk data downloads. The dataset has been subsetted to only contain inventor mentions for blocks which intersect Binette’s 2022 inventors benchmark [1]. Following PatentsView’s disambiguation methodology, a block is defined by an inventor mention’s full last name and the first two letters of its first name. Therefore, this dataset contains all inventor mentions for which the last name and first two letters of the first name match those found in Binette’s 2022 inventors benchmark.
A number of features have been added, such as inventor mention name, location, patent title, abstract, filing date, assignees, attorneys, CPC codes, and co-inventors list. The code used to produce this dataset is located in “er_evaluation/datasets/raw_data/patentsview/reproduce.ipynb”.
Refer to
er_evaluation.datasets.load_pv_disambiguations()in order to access Binette’s 2022 inventors benchmark and PatentsView’s predicted disambiguations.- Returns:
pandas DataFrame
References
Binette, Olivier, Sarvo Madhavan, Jack Butler, Beth Anne Card, Emily Melluso and Christina Jones. 2023. PatentsView-Evaluation: Evaluation Datasets and Tools to Advance Research on Inventor Name Disambiguation. arXiv e-prints: arxiv:2301.03591. Available online at https://arxiv.org/abs/2301.03591
Binette, Olivier, Sokhna A York, Emma Hickerson, Youngsoo Baek, Sarvo Madhavan, Christina Jones. (2022). Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org. arXiv e-prints: arxiv:2210.01230