expected_size_difference#
- er_evaluation.error_analysis.expected_size_difference(prediction, sample)[source]#
Expected size difference between predicted and sampled clusters.
- Expected Size Difference:
For a given sampled cluster \(c\) with records \(r \in c\), let \(\hat c(r)\) be the predicted cluster containing \(r\). Then the expected size difference for \(c\) is
\[E_{\text{size}}(c) = \frac{1}{\lvert c \rvert}\sum_{r\in c} \lvert \hat c(r) \rvert - \lvert c \rvert.\]
- Parameters:
prediction (Series) – Membership vector representing a predicted disambiguation.
sample (Series) – Membership vector representing a set of true clusters.
- Returns:
Pandas Series indexed by true cluster identifiers (unique values in sample) and with values corresponding to the expected size difference.
- Return type:
Series
Examples
>>> prediction = pd.Series(index=[1,2,3,4,5,6,7,8], data=[1,1,2,3,2,4,4,4]) >>> sample = pd.Series(index=[1,2,3,4,5,6,7], data=["c1", "c1", "c1", "c2", "c2", "c3", "c3"]) >>> expected_size_difference(prediction, sample) reference c1 -1.0 c2 -0.5 c3 1.0 Name: expected_size_diff, dtype: float64
Notes
The sample is restricted to the set of records which are present in the prediction.