expected_relative_missing#

er_evaluation.expected_relative_missing(prediction, sample)[source]#

Expected relative number of missin elements to records in sampled clusters.

Given a predicted disambiguation prediction and a sample of true clusters sample, both represented as membership vectors, this functions returns the expected number of missin elements for each true cluster. This is a pandas Series indexed by true cluster identifier and with values corresponding to the expected relative number of missin elements.

Expected Relative Number of missin elements: For a given sampled cluster \(c\) with records \(r \in c\), let \(B_r\) be the set of records which are missing from the predicted cluster containing \(r\). That is, if \(\hat c(r)\) is the predicted cluster containing \(r\), then \(B_r = c \backslash \hat c(r)\). Then the expected number of missin elements for \(c\) is

\[E_{\text{rel_miss}}(c) = \frac{1}{\lvert c \rvert}\sum_{r\in c} \lvert B_r \rvert / \lvert c \rvert.\]

Parameters:

prediction (Series) – Membership vector representing a predicted disambiguation.
sample (Series) – Membership vector representing a set of true clusters.

Returns:

Pandas Series indexed by true cluster identifiers (unique values in sample) and with values corresponding to the expected relative number of missin elements.

Return type:

Series

Examples

>>> prediction = pd.Series(index=[1,2,3,4,5,6,7,8], data=[1,1,2,3,2,4,4,4])
>>> sample = pd.Series(index=[1,2,3,4,5,8], data=["c1", "c1", "c1", "c2", "c2", "c4"])
>>> expected_relative_missing(prediction, sample)
reference
c1    0.444444
c2    0.500000
c4    0.000000
Name: expected_relative_missing, dtype: float64

Notes

The sample is restricted to the set of records which are present in the prediction.