splitting_entropy#
- er_evaluation.splitting_entropy(prediction, sample, alpha=1)[source]#
Splitting entropy of true clusters.
This function returns the splitting entropy, defined below, of each entity represented in the sampled clusters sample.
- Splitting Entropy:
Let \(\hat{\mathcal{C}}\) be a clustering of records \(\mathcal{R}\) into predicted entities. For a given entity represented by a cluster \(c\), the splitting entropy is defined as the exponentiated Shannon entropy of the set of cluster sizes \(\{\lvert \hat c \cap c \rvert \mid \hat c \in \widehat{\mathcal{C}},\, \lvert \hat c \cap c \rvert > 0 \}\). That is, with using the convention that \(0 \cdot \log (0) = 0\), we have
\[E_{\text{split}}(c) = \exp\left \{-\sum_{\hat c \in \widehat{\mathcal{C}}} \frac{\lvert\hat c \cap c \rvert}{\sum_{\hat c' \in \widehat{\mathcal{C}}} \lvert \hat c' \cap c \rvert } \log \left(\frac{\lvert\hat c \cap c \rvert}{\sum_{\hat c' \in \widehat{\mathcal{C}}} \lvert \hat c' \cap c \rvert }\right) \right \}.\]
- Parameters:
prediction (Series) – Membership vector representing a predicted disambiguation.
sample (Series) – Membership vector representing a set of true clusters.
- Returns:
Pandas Series indexed by true cluster identifiers (unique values in sample) and with values corresponding to the splitting entropy.
- Return type:
Series
Examples
>>> prediction = pd.Series(index=[1,2,3,4,5,6,7,8], data=[1,1,2,3,2,4,4,4]) >>> sample = pd.Series(index=[1,2,3,4,5,8], data=["c1", "c1", "c1", "c2", "c2", "c4"]) >>> splitting_entropy(prediction, sample) reference c1 1.889882 c2 2.000000 c4 1.000000 Name: splitting_entropy_1, dtype: float64
Notes
The sample is restricted to the set of records which are present in the prediction.