expected_extra#
- er_evaluation.error_analysis.expected_extra(prediction, sample)[source]#
Expected number of extraneous elements to records in sampled clusters.
Given a predicted disambiguation
predictionand a sample of true clusterssample, both represented as membership vectors, this functions returns the expected number of extraneous elements for each true cluster. This is a pandas Series indexed by true cluster identifier and with values corresponding to the expected number of extraneous elements.- Expected Number of extraneous elements
For a given sampled cluster \(c\) with records \(r \in c\), let \(A_r\) be the set of records which are erroneously linked to \(r\) in the predicted clustering. That is, if \(\hat c(r)\) is the predicted cluster containing \(r\), then \(A_r = \hat c(r) \backslash c\) Then the expected number of extraneous elements for \(c\) is
\[E_{\text{extra}}(c) = \frac{1}{\lvert c \rvert}\sum_{r\in c} \lvert A_r \rvert.\]This is the expected number of erroneous links to a random record \(r \in c\).
- Parameters:
prediction (Series) – Membership vector representing a predicted disambiguation.
sample (Series) – Membership vector representing a set of true clusters.
- Returns:
Pandas Series indexed by true cluster identifiers (unique values in sample) and with values corresponding to the expected number of extraneous elements.
- Return type:
Series
Examples
>>> prediction = pd.Series(index=[1,2,3,4,5,6,7,8], data=[1,1,2,3,2,4,4,4]) >>> sample = pd.Series(index=[1,2,3,4,5,8], data=["c1", "c1", "c1", "c2", "c2", "c4"]) >>> expected_extra(prediction, sample) reference c1 0.333333 c2 0.500000 c4 2.000000 Name: expected_extra, dtype: float64
Notes
The sample is restricted to the set of records which are present in the prediction.