expected_size_difference#

er_evaluation.error_analysis.expected_size_difference(prediction, sample)[source]#

Expected size difference between predicted and sampled clusters.

Expected Size Difference:: For a given sampled cluster \(c\) with records \(r \in c\), let \(\hat c(r)\) be the predicted cluster containing \(r\). Then the expected size difference for \(c\) is

\[E_{\text{size}}(c) = \frac{1}{\lvert c \rvert}\sum_{r\in c} \lvert \hat c(r) \rvert - \lvert c \rvert.\]

Parameters:

prediction (Series) – Membership vector representing a predicted disambiguation.
sample (Series) – Membership vector representing a set of true clusters.

Returns:

Pandas Series indexed by true cluster identifiers (unique values in sample) and with values corresponding to the expected size difference.

Return type:

Series

Examples

>>> prediction = pd.Series(index=[1,2,3,4,5,6,7,8], data=[1,1,2,3,2,4,4,4])
>>> sample = pd.Series(index=[1,2,3,4,5,6,7], data=["c1", "c1", "c1", "c2", "c2", "c3", "c3"])
>>> expected_size_difference(prediction, sample)
reference
c1   -1.0
c2   -0.5
c3    1.0
Name: expected_size_diff, dtype: float64

Notes

The sample is restricted to the set of records which are present in the prediction.