splitting_entropy#

er_evaluation.error_analysis.splitting_entropy(prediction, sample, alpha=1)[source]#

Splitting entropy of true clusters.

This function returns the splitting entropy, defined below, of each entity represented in the sampled clusters sample.

Splitting Entropy:

Let \(\hat{\mathcal{C}}\) be a clustering of records \(\mathcal{R}\) into predicted entities. For a given entity represented by a cluster \(c\), the splitting entropy is defined as the exponentiated Shannon entropy of the set of cluster sizes \(\{\lvert \hat c \cap c \rvert \mid \hat c \in \widehat{\mathcal{C}},\, \lvert \hat c \cap c \rvert > 0 \}\). That is, with using the convention that \(0 \cdot \log (0) = 0\), we have

\[E_{\text{split}}(c) = \exp\left \{-\sum_{\hat c \in \widehat{\mathcal{C}}} \frac{\lvert\hat c \cap c \rvert}{\sum_{\hat c' \in \widehat{\mathcal{C}}} \lvert \hat c' \cap c \rvert } \log \left(\frac{\lvert\hat c \cap c \rvert}{\sum_{\hat c' \in \widehat{\mathcal{C}}} \lvert \hat c' \cap c \rvert }\right) \right \}.\]
Parameters:
  • prediction (Series) – Membership vector representing a predicted disambiguation.

  • sample (Series) – Membership vector representing a set of true clusters.

Returns:

Pandas Series indexed by true cluster identifiers (unique values in sample) and with values corresponding to the splitting entropy.

Return type:

Series

Examples

>>> prediction = pd.Series(index=[1,2,3,4,5,6,7,8], data=[1,1,2,3,2,4,4,4])
>>> sample = pd.Series(index=[1,2,3,4,5,8], data=["c1", "c1", "c1", "c2", "c2", "c4"])
>>> splitting_entropy(prediction, sample)
reference
c1    1.889882
c2    2.000000
c4    1.000000
Name: splitting_entropy_1, dtype: float64

Notes

The sample is restricted to the set of records which are present in the prediction.