{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Estimating Performance\n", "\n", "After {doc}`preparing your data <01-dataprep>` and {doc}`monitoring summary statistics <02-summary_statistics>`, the next step in evaluating your entity resolution (ER) system is estimating performance metrics." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "```{note}\n", "Estimating these metrics is not as straightforward as one might assume. Naive computation of performance metrics using benchmark datasets can lead to biased and over-optimistic results. This is due to the non-linear scaling of entity resolution: while it might be easy to disambiguate a small benchmark dataset, the complexity of the problem grows quadratically with the dataset size. Therefore, to obtain reliable and representative performance estimators, it is crucial to consider the entire population of interest and account for the sampling processes and biases.\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this guide, we will discuss how to estimate performance metrics using the ER-Evaluation package, emphasizing the importance of applying sampling weights alongside predictions and benchmark data.\n", "\n", "To estimate performance metrics, you will need:\n", "\n", "- **Predicted clusters:** Predicted entity clusters for a set of records or entity mentions, usually the main output of an ER system.\n", "- **Reference/benchmark data:** A trusted benchmark dataset or reference disambiguation for \"ground truth\" data, typically containing 200 to 400 disambiguated entities.\n", "- **Sampling weights:** Weights applied to account for the sampling process and biases, such as \"uniform\" weights or \"cluster_size\" weights for sampling with probability proportional to cluster size." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "```{note}\n", "Sampling weights correct for biases in performance metric estimation when working with sampled data. They can be derived from the sampling design and selection probabilities, or estimated using propensity scoring techniques. For instance, in probability proportional to cluster size sampling (sampling records at random), weights are calculated as the inverse of each cluster size. Incorporating sampling weights ensures accurate performance metric estimation, accounting for non-representative reference datasets or unequal selection probabilities.\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Available Performance Metric Estimators\n", "\n", "The ER-Evaluation package provides functions to estimate various performance metrics and summary statistics, such as B-cubed precision and recall, cluster precision and recall, F-scores, matching rate, homonymy rate, and name variation rate. Refer to the {py:module}`er_evaluation.estimators` module documentation for a comprehensive list of available functions.\n", "\n", "## Estimating Metrics\n", "\n", "To estimate performance metrics, use the functions provided by the {py:module}`er_evaluation.estimators` module. These functions accept a predicted disambiguation, a set of ground truth clusters, and a set of cluster sampling weights as input. They return an estimate of the performance metric along with an estimate of the standard deviation.\n", "\n", "Here's an example using the {py:func}`er_evaluation.pairwise_precision_estimator` function, and continuing with PatentsView's disambiguation of a subset inventor names on granted patents as our running example. The \"reference\" benchmark data is a sample of inventor clusters, where the sampling probabilities were chosen to be proportional to cluster size." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.5682905929738009, 0.10762619520264015)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import er_evaluation as ee\n", "\n", "predictions, reference = ee.load_pv_disambiguations()\n", "prediction = predictions[pd.Timestamp('2017-08-08')]\n", "\n", "ee.pairwise_precision_estimator(prediction, reference, weights=\"cluster_size\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "```{note}\n", "Our performance metric estimators use information about the whole disambiguation, including elements that are not part of the benchmark dataset. As such, the entire predicted disambiguation should be provided as a first argument to estimators. Providing only a subset of the predicted disambiguation, such as the subset that intersects the benchmark data, will lead to erroneous results.\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Estimating Multiple Metrics\n", "\n", "You can use the {py:func}`er_evaluation.estimates_table` function to estimate multiple metrics at once for multiple disambiguations:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predictionsample_weightsestimatorvaluestd
02017-08-08Sample 1pairwise_precision0.5682910.107626
12017-08-08Sample 1pairwise_recall0.9610920.009080
22017-08-08Sample 1pairwise_f0.7201350.083511
32017-08-08Sample 1cluster_precision1.5905640.127525
42017-08-08Sample 1cluster_recall0.8165660.026620
..................
1152022-06-30Sample 1cluster_precision1.9562510.164533
1162022-06-30Sample 1cluster_recall0.7749600.029681
1172022-06-30Sample 1cluster_f1.1110950.052049
1182022-06-30Sample 1b_cubed_precision0.8953360.019149
1192022-06-30Sample 1b_cubed_recall0.9859400.004212
\n", "

120 rows × 5 columns

\n", "
" ], "text/plain": [ " prediction sample_weights estimator value std\n", "0 2017-08-08 Sample 1 pairwise_precision 0.568291 0.107626\n", "1 2017-08-08 Sample 1 pairwise_recall 0.961092 0.009080\n", "2 2017-08-08 Sample 1 pairwise_f 0.720135 0.083511\n", "3 2017-08-08 Sample 1 cluster_precision 1.590564 0.127525\n", "4 2017-08-08 Sample 1 cluster_recall 0.816566 0.026620\n", ".. ... ... ... ... ...\n", "115 2022-06-30 Sample 1 cluster_precision 1.956251 0.164533\n", "116 2022-06-30 Sample 1 cluster_recall 0.774960 0.029681\n", "117 2022-06-30 Sample 1 cluster_f 1.111095 0.052049\n", "118 2022-06-30 Sample 1 b_cubed_precision 0.895336 0.019149\n", "119 2022-06-30 Sample 1 b_cubed_recall 0.985940 0.004212\n", "\n", "[120 rows x 5 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ee.estimates_table(predictions, samples_weights={\"Sample 1\": {\"sample\":reference, \"weights\":\"cluster_size\"}})" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "```{note}\n", "Note that the cluster_precision estimates are higher than 1.0 in this example, which is not a realistic estimate. This is due to the fact that the predictions used in this example are only a subset of the entire disambiguation. As previously noted, this causes some of the estimates to be significantly biased. In a real application, the entire disambiguations (here membership vectors with millions of rows) should be provided rather than the toy subset.\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Plots\n", "\n", "Helper `visualization functions `_ are available in the {py:module}`er_evaluation.plots` module:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "error_y": { "array": [ 0.2152523904052803, 0.2284648845356206, 0.05396008154266888, 0.06762140841603512, 0.0641799010705886, 0.07095188803819781, 0.06855633851104145, 0.08001541453302838, 0.07513925473462715, 0.07646197630520299, 0.0794556989462446, 0.08091626415776206, 0.03812943865295822, 0.03723981444067029, 0.03477773119450892 ] }, "hovertemplate": "estimator=Pairwise precision
prediction=%{x}
value=%{y}", "legendgroup": "Pairwise precision", "line": { "color": "#636efa", "dash": "solid", "shape": "spline" }, "marker": { "symbol": "circle" }, "mode": "lines+markers", "name": "Pairwise precision", "orientation": "v", "showlegend": true, "type": "scatter", "x": [ "2017-08-08T00:00:00", "2017-10-03T00:00:00", "2017-12-26T00:00:00", "2018-05-28T00:00:00", "2018-11-27T00:00:00", "2019-03-12T00:00:00", "2019-08-20T00:00:00", "2019-10-08T00:00:00", "2019-12-31T00:00:00", "2020-03-31T00:00:00", "2020-06-30T00:00:00", "2020-09-29T00:00:00", "2020-12-29T00:00:00", "2021-12-30T00:00:00", "2022-06-30T00:00:00" ], "xaxis": "x", "y": [ 0.5682905929738009, 0.5614532661533171, 0.9177507115333738, 0.878487421180332, 0.876664044828819, 0.871523346373805, 0.8563773235431499, 0.8513799572091044, 0.8653184769196983, 0.8675474079481227, 0.8589447378261852, 0.8637717722052076, 0.9087541920717165, 0.9131787709880134, 0.8833020990513186 ], "yaxis": "y" }, { "error_y": { "array": [ 0.01815964133564645, 0.01895614311202958, 0.02690691861377593, 0.023424393355402976, 0.018081110982682582, 0.017499659502393926, 0.016901175387470806, 0.016721816409361122, 0.015236804944936547, 0.016349374562466146, 0.02184091145328047, 0.014231509859813599, 0.014585344985919273, 0.017568488829289367, 0.014474559485576568 ] }, "hovertemplate": "estimator=Pairwise recall
prediction=%{x}
value=%{y}", "legendgroup": "Pairwise recall", "line": { "color": "#EF553B", "dash": "solid", "shape": "spline" }, "marker": { "symbol": "diamond" }, "mode": "lines+markers", "name": "Pairwise recall", "orientation": "v", "showlegend": true, "type": "scatter", "x": [ "2017-08-08T00:00:00", "2017-10-03T00:00:00", "2017-12-26T00:00:00", "2018-05-28T00:00:00", "2018-11-27T00:00:00", "2019-03-12T00:00:00", "2019-08-20T00:00:00", "2019-10-08T00:00:00", "2019-12-31T00:00:00", "2020-03-31T00:00:00", "2020-06-30T00:00:00", "2020-09-29T00:00:00", "2020-12-29T00:00:00", "2021-12-30T00:00:00", "2022-06-30T00:00:00" ], "xaxis": "x", "y": [ 0.9610920825379785, 0.9541269644954581, 0.9166311090252167, 0.942536009220433, 0.9570321976537539, 0.9601746830047234, 0.962040886261, 0.9627747512149712, 0.9653411752888007, 0.9622775797659693, 0.9514561547643773, 0.8966542126263876, 0.9761365985333893, 0.9622075140463208, 0.9770477584915793 ], "yaxis": "y" } ], "layout": { "legend": { "title": { "text": "estimator" }, "tracegroupgap": 0 }, "margin": { "t": 60 }, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "title": { "text": "Performance estimates" }, "xaxis": { "anchor": "y", "domain": [ 0, 1 ], "title": { "text": "prediction" } }, "yaxis": { "anchor": "x", "domain": [ 0, 1 ], "title": { "text": "value" } } } } }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ee.plot_estimates(predictions, {\"sample\":reference, \"weights\":\"cluster_size\"})" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The plot shows the performance metric estimates as well as error bars corresponding to +/- 1 estimated standard deviation." ] } ], "metadata": { "kernelspec": { "display_name": "er-evaluation", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }