Error Analysis#

Error analysis is the process of analyzing errors associated with a given set of predictions. You can investigate different types of errors, their frequencies, and their relationships with explanatory features of interest.

To characterize errors, we define two types of error metrics:

  1. Cluster-level error metrics: Quantify errors associated with each cluster.

  2. Record-level error metrics: Quantify errors associated with each record.

We recommend using cluster-level error metrics, as they are easier to interpret and relate to cluster-level features. However, some advanced analyses require using record-level error metrics.

Cluster-level Error Metrics#

Cluster-level error metrics quantify errors associated with each ground truth cluster. We provide several metrics in the module, including:

  • Error Indicator: This metric indicates whether there is an error associated to each true cluster. For the error indicator metric, an error means that there is no predicted cluster that matches the true cluster. In other words, an error indicator value of 1 means that the true cluster is not in the disambiguation. A value of 0 means that the true cluster was correctly found, i.e. it is part of the disambiguation.

  • Expected Extra Elements: This metric represents the expected number of extraneous elements for each true cluster. In other words, it calculates the average number of erroneous links to a other records in a true cluster.

  • Expected Relative Extra Elements: This metric represents the expected relative number of extraneous elements for each true cluster. It calculates the average relative number of erroneous links to a random record in a true cluster.

  • Expected Missing Elements: This metric represents the expected number of missing elements for each true cluster. It calculates the average number of elements that are missing from the predicted clusters compared to the true clusters.

  • Expected Relative Missing Elements: This metric represents the expected relative number of missing elements for each true cluster. It calculates the average relative number of elements that are missing from the predicted clusters compared to the true clusters.

You can find more information about these metrics, including formal mathematical definitions, in the er_evaluation.error_analysis module.

Example#

Here is an example based on PatentsView’s disambiguation of patent inventor names. The er_evaluation.error_indicator() metric indicates whether or not the predicted disambiguation makes an error for the given true “reference” cluster.

import pandas as pd
import er_evaluation as ee

predictions, reference = ee.load_pv_disambiguations()
prediction = predictions[pd.Timestamp('2017-08-08')]

ee.error_indicator(prediction, reference)
reference
9unk95ybl10788b3dxzyz0qlt    0
fl:a._ln:eversole-1          0
fl:ab_ln:patil-16            1
fl:ak_ln:ohno-16             0
fl:ak_ln:sawada-11           1
                            ..
on89lkbvct0i0fbi2jdngxwc1    0
t88yown1o8l8x6i2wo45xtn3z    0
uzgor2vfmuk5bytnhr71rgwni    0
ytt5secbbneclm84c5o8yy75u    0
zpj3f8n9vln5it7gx0y1v4bkr    0
Name: error_indicator, Length: 370, dtype: int64

Error Analysis with Decision Trees#

To identify combinations of features leading to performance disparities, we recommend doing error analysis using decision trees. First, define features associated with each cluster and choose an error metric to target. You can use any error metric from the er_evaluation.error_analysis module. We recommend using thresholded 0-1 features for interpretability.

Here is an example with PatentsView data. First we define cluster-level features to consider.

import numpy as np
from statistics import mode
import er_evaluation as ee

pv_data = ee.load_pv_data()
pv_data.set_index("mention_id", inplace=True)

def flatten_mode(x):
    return mode(np.concatenate(x.apply(lambda x: np.unique(x)).values))

features_df = (
    pv_data.merge(pv_data["block"].value_counts().rename("block_size"), left_on="block", right_index=True)
    .assign(num_coauthors=pv_data["coinventor_sequence"].apply(len))
    .assign(
        year_first=pv_data["filing_date"].apply(lambda x: float(str(x).split("-")[0]) if isinstance(x, str) else np.nan)
    )
    .assign(
        year_last=pv_data["filing_date"].apply(lambda x: float(str(x).split("-")[0]) if isinstance(x, str) else np.nan)
    )
    .merge(reference.rename("reference"), left_on="mention_id", right_index=True)
    .groupby("reference")
    .agg(
        {
            "raw_inventor_name_first": mode,
            "raw_inventor_name_last": mode,
            "patent_id": "count",
            "raw_country": mode,
            "patent_type": mode,
            "num_coauthors": "mean",
            "block_size": "mean",
            "cpc_section": flatten_mode,
            "year_first": min,
            "year_last": max,
        }
    )
    .rename(
        columns={
            "raw_inventor_name_first": "name_first",
            "raw_inventor_name_last": "name_last",
            "patent_id": "prolificness",
            "raw_country": "country",
            "num_coauthors": "avg_coauthors",
        }
    )
)

numerical_features = [
    "prolificness",
    "avg_coauthors",
    "block_size",
    "year_first",
    "year_last",
]
categorical_features = ["country", "patent_type", "cpc_section"]

pred = predictions[pd.Timestamp("2021-12-30")]
y = ee.error_indicator(pred, reference)
weights = 1/ee.cluster_sizes(reference.dropna())
weights = len(y) * weights / weights.sum()
/tmp/ipykernel_1094/2221547407.py:22: FutureWarning: The provided callable <built-in function min> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  .agg(
/tmp/ipykernel_1094/2221547407.py:22: FutureWarning: The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  .agg(

Afterwards, the function below fits and displays a decision tree modeling the chosen error metric as a function of provided features.

ee.make_dt_regressor_plot(
                y,
                weights,
                features_df,
                numerical_features,
                categorical_features,
                max_depth=3,
                type="sunburst")

You can see other visualization options in the visualizations page.

Fairness Analysis#

The er_evaluation.plot_performance_disparities function helps you identify subgroups in the data with the largest performance disparity compared to the overall data performance. You can use any performance metric estimator provided by the package (see the er_evaluation.estimators() module).

Here’s an example using “cpc_section” (patent classification code section) as a feature to define subgroups:

protected_feature = pv_data['cpc_section'].apply(lambda x: x[0] if isinstance(x, np.ndarray) and len(x) > 0 else 'None')
protected_feature = pd.concat([reference, protected_feature], join="inner", axis=1).groupby("unique_id").agg("first")["cpc_section"]

ee.plot_performance_disparities(
    prediction=predictions[pd.Timestamp('2021-12-30')],
    reference=reference,
    weights="cluster_size",
    protected_feature=protected_feature,
)