Data Preparation#

ER-Evaluation expects clusterings and disambiguations to be represented as membership vectors, as verified by the er_evaluation.ismembership() function.

A membership vector is a pandas Series indexed by element IDs and with values corresponding to an assigned cluster ID. Here is an example, representing the ground truth disambiguation for the RLdata500 dataset. It assigns record IDs (labels from 0 to 499) in the first column to cluster identifiers in the second column. String identifiers and hashable types are also allowed.

import er_evaluation as ee

_, reference = ee.load_rldata500_disambiguations()
reference
0       34
1       51
2      115
3      189
4       72
      ... 
495    413
496    378
497    399
498    315
499    238
Name: identity.RLdata500, Length: 500, dtype: object

Note

Membership vector indices should be unique and non-NA. Values can be NA to represent non-clustered elements. NA values are typically discarded before any computation.

Data Transformations#

You can transform between different clustering representations using functions from the module. For instance, if your datasets contains pairs of records that belong to the same entity, then you can transform this to a membership vector using the er_evaluation.pairs_to_membership() function. This requires specifying a full index.

In the example below, records 1 and 2 belong to a first cluster, records 3 and 5 belong to a second cluster, record 4 is its own cluster, and records 6 and 7 and 8 belong to the same cluster. The input data look as follows:

from numpy import array
pairs = array([[1, 2], [3, 5], [6, 7], [6, 8], [7, 8]])
pairs
array([[1, 2],
       [3, 5],
       [6, 7],
       [6, 8],
       [7, 8]])

You can transform this to a membership vector by specifying a full index:

indices = array([1,2,3,4,5,6,7,8])

ee.pairs_to_membership(pairs, indices)
1    0
2    0
3    1
4    2
5    1
6    3
7    3
8    3
dtype: int64

Similarly, you can transform a clusters dictionary to a membership using the er_evaluation.clusters_to_membership() function:

clusters = {1: array([1, 2]), 2: array([3, 5]), 3: array([4]), 4: array([6, 7, 8])}
ee.clusters_to_membership(clusters)
1    1
2    1
3    2
5    2
4    3
6    4
7    4
8    4
dtype: int64

Other functions are available in the module to reverse transformations and to deal with igraph Graph objects.

Performance Considerations#

When dealing with large membership vectors (millions of rows), performance can be affected by the choice of data types. To compress a membership vector to an equivalent integer representation, you can use the er_evaluation.compress_memberships() function. Compressing membership vectors rather than using string-valued identifiers can significantly speed up subsequent operations. Here’s an example:

import pandas as pd
membership = pd.Series(["c1", "c1", "c1", "c2", "c2", "c3"], index=[0,1,2,3,4,5])

ee.compress_memberships(membership)
[0    0.0
 1    0.0
 2    0.0
 3    1.0
 4    1.0
 5    2.0
 Name: 0, dtype: float64]

You can pass multiple membership vectors to the function to compress while preserving index compatibility.