sample_clusters#

er_evaluation.utils.sample_clusters(membership, weights='uniform', sample_prop=0.2, size=None, replace=True, random_state=1)[source]#

Sample clusters from a membership vector.

Parameters:

membership (Series) – Membership vector.
weights (str, optional) – Probability weights to use. Should be one “uniform”, “cluster_size”, or a pandas Series indexed by cluster identifiers and with values corresponding to probability weights. Defaults to “uniform”.
sample_prop (float, optional) – Proportion of clusters to sample. Defaults to 0.2.
replace (bool, optional) – Wether or not to sample with replacement. Defaults to True.
random_state (int, optional) – Random seed. Defaults to 1.

Returns:

Membership vector with elements corresponding to sampled clusters.

Return type:

Series

Examples

Load a toy dataset:

>>> from er_evaluation.datasets import load_rldata10000_disambiguations
>>> predictions, reference = load_rldata10000_disambiguations()

Sample a set of ground truth clusters uniformly at random:

>>> sample = sample_clusters(reference, weights="uniform", sample_prop=0.2)

Compute pairwise_precision on the sample:

>>> from er_evaluation.metrics import pairwise_precision
>>> pairwise_precision(predictions['name_by'], sample)
0.96

Compare to the true precision on the full data:

>>> pairwise_precision(predictions['name_by'], reference)
0.7028571428571428

The metric computed on a sample is over-optimistic (0.96 versus true precision of 0.7). Instead, use an estimator to accurately estimate pairwise precision from a sample, which returns a point estimate and its standard deviation estimate:

>>> from er_evaluation.estimators import pairwise_precision_estimator
>>> pairwise_precision_estimator(predictions['name_by'], sample, weights="uniform")
(0.7633453805063894, 0.04223296142335369)