Visualization Examples#

Examples of the main visualization functions provided by ER-Evaluation. For more information on the use and meaning of each, please refer to the User Guide and API Documentation.

To get started, we first collect all relevant data from our toy subset of PatentsView data:

predictions: A dictionary of predicted disambiguations indexed by time.
reference: A benchmark dataset containing 401 disambiguated inventors (our “ground truth” sample).
names: The names on each record, that are used to compute the homonymy rate and name variation rate statistics.

import pandas as pd
import numpy as np
from er_evaluation.datasets import load_pv_disambiguations, load_pv_data

predictions, reference = load_pv_disambiguations()

pv_data = load_pv_data()
pv_data.set_index("mention_id", inplace=True)
names = pv_data["raw_inventor_name_first"] + " " + pv_data["raw_inventor_name_last"]

Summary Statistics#

Visualization of relevant disambiguation summaries and their evolution over time. Refer to the Summary Statistics page for more information.

from er_evaluation.plots import plot_summaries

plot_summaries(predictions, names)

Performance Estimates#

Visualize estimates for key metrics over time with uncertainty quantification (+/- 1 standard deviation). Note that estimates for f-score, b-cubed metrics, and cluster metrics can also be computed and visualized in the same way. Refer to the Performance Estimation page for more information.

What is the difference between performance estimates and performance metrics? Performance metrics naively computed on benchmark datasets do not account for sample size and biases, leading to over-optimistic results and poor performance. In contrast, our statistical estimators account for the smaller size of the benchmark dataset and provide uncertainty quantification.

from er_evaluation.plots import plot_estimates

plot_estimates(predictions, {"sample": reference, "weights": "cluster_size"})

Disambiguation Similarity#

Pairwise precision and pairwise F-score between pairs of predicted disambiguations (computed on their inner join). This is used to characterize changes in disambiguations over time. For a given timestamp on the y axis and a timestamp on the x axis, the pairwise precision is defined as:

pairwise precision: the proportion of predicted links at time x that were also present at time y.

from er_evaluation.plots import plot_comparison

plot_comparison(predictions, color_continuous_scale="Blues")

The heatmap helps identify changes in the disambiguation results. For instance, you can notice significant changes to the disambiguation algorithm in December 2017 and December 2020.

Cluster Error Metrics#

A key element of our evaluation framework is the definition of record-level and cluster-level error metrics. See the Error Analysis page for more information.

These metrics are used to estimate performance metrics and to perform error analysis. The function below is used to display a scatter plot of any pair of two such metrics, with point size corresponding to sampling weights. See the module for more information.

from er_evaluation.plots import plot_cluster_errors

plot_cluster_errors(predictions[pd.Timestamp('2021-12-30')], reference, weights="cluster_size")

Fairness Analysis#

The er_evaluation.plot_performance_disparities() function is used identify subgroups in the data with the largest performance disparity compared to full-data performance. You can use any performance metric estimator provided by the package (see the module).

from er_evaluation.plots import plot_performance_disparities

protected_feature = pv_data['cpc_section'].apply(lambda x: x[0] if isinstance(x, np.ndarray) and len(x) > 0 else 'None')
protected_feature = pd.concat([reference, protected_feature], join="inner", axis=1).groupby("unique_id").agg("first")["cpc_section"]

plot_performance_disparities(
    prediction=predictions[pd.Timestamp('2021-12-30')],
    reference=reference,
    weights="cluster_size",
    protected_feature=protected_feature,
)

Error Analysis with Decision Trees#

In order to identify combinations of features leading to performance disparities, we recommend doing error analysis with decision trees.

For this, we first need to define features associated with each cluster and choose an error metric to target. Here, we use an error indicator representing whether or not a given inventor in our reference dataset is associated with a prediction error. Any other error metric from the module can be used.

from statistics import mode
from er_evaluation.error_analysis import error_indicator
from er_evaluation.summary import cluster_sizes

def flatten_mode(x):
    return mode(np.concatenate(x.apply(lambda x: np.unique(x)).values))

pv_data = load_pv_data()
features_df = (
    pv_data.merge(pv_data["block"].value_counts().rename("block_size"), left_on="block", right_index=True)
    .assign(num_coauthors=pv_data["coinventor_sequence"].apply(len))
    .assign(
        year_first=pv_data["filing_date"].apply(lambda x: float(str(x).split("-")[0]) if isinstance(x, str) else np.nan)
    )
    .assign(
        year_last=pv_data["filing_date"].apply(lambda x: float(str(x).split("-")[0]) if isinstance(x, str) else np.nan)
    )
    .merge(reference.rename("reference"), left_on="mention_id", right_index=True)
    .groupby("reference")
    .agg(
        {
            "raw_inventor_name_first": mode,
            "raw_inventor_name_last": mode,
            "patent_id": "count",
            "raw_country": mode,
            "patent_type": mode,
            "num_coauthors": "mean",
            "block_size": "mean",
            "cpc_section": flatten_mode,
            "year_first": min,
            "year_last": max,
        }
    )
    .rename(
        columns={
            "raw_inventor_name_first": "name_first",
            "raw_inventor_name_last": "name_last",
            "patent_id": "prolificness",
            "raw_country": "country",
            "num_coauthors": "avg_coauthors",
        }
    )
)

numerical_features = [
    "prolificness",
    "avg_coauthors",
    "block_size",
    "year_first",
    "year_last",
]
categorical_features = ["country", "patent_type", "cpc_section"]

/tmp/ipykernel_1150/347797915.py:20: FutureWarning:

The provided callable <built-in function min> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.

/tmp/ipykernel_1150/347797915.py:20: FutureWarning:

The provided callable <built-in function max> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.

pred = predictions[pd.Timestamp("2021-12-30 00:00:00")]
y = error_indicator(pred, reference)
weights = 1/cluster_sizes(reference.dropna())
weights = len(y) * weights / weights.sum()

Afterwards, the function below fits and displays a decision tree modeling the chosen error metric as a function of provided features, with node size corresponding to the (weighted) number of samples in each node.

from er_evaluation.plots import make_dt_regressor_plot

make_dt_regressor_plot(
                y,
                weights,
                features_df,
                numerical_features,
                categorical_features,
                max_depth=3,
                type="tree")

In addition to the standard tree representation, you can use a sunburst chart or a treemap to visualize the tree with more focus on leaf nodes, where the arc angle or block size corresponds to the (weighted) number of samples in each node.

make_dt_regressor_plot(
                y,
                weights,
                features_df,
                numerical_features,
                categorical_features,
                max_depth=3,
                type="sunburst")

make_dt_regressor_plot(
                y,
                weights,
                features_df,
                numerical_features,
                categorical_features,
                max_depth=3,
                type="treemap")

Additional Visualization Functions#

Some visualization functions are not shown here, including:

er_evaluation.plot_cluster_sizes_distribution() to visualize cluster size distributions.
er_evaluation.plot_entropy_curve() to display Hill number curves.
er_evaluation.add_ests_to_summaries() to add summary statistics estimators to the summary statistics visualization.
er_evaluation.plot_metrics() to display raw (non-representative) performance metrics.