Advanced Topics#

Performance Considerations#

When dealing with large membership vectors, we recommend using an integer type for indices and values. You can compress any set of membership vectors (preserving index compatibility) using the er_evaluation.compress_memberships() function.

To speed up the computation of performance estimators and error metrics, you can first compute what is called a record error table from a given prediction and reference dataset using the er_evaluation.record_error_table() function. Then, most performance estimators and error metrics can be obtained faster by computing directly from this table. Functions that accept the record error table rather than membership vectors as an argument have the suffix _from_table to them. You can find these functions in the er_evaluation.estimators.from_table module as well as in the the er_evaluation.error_analysis() module.

The definition and use of the record error table are explained below.

Record Error Table#

All performance estimators can be obtained as a function (ratio estimator) of cluster-level error metrics. In turn, all cluster-level error metrics can be computed from what is called the record error table. Given a predicted membership vector and a reference membership vector, the record error table is indexed by reference record IDs and has the following columns:

pred_cluster_size: The size of the predicted cluster (associated to the record ID on that row).
ref_cluster_size: The size of the true cluster.
extra: The number of elements in the predicted cluster which are not in the true cluster.
missing: The number of elements in the true cluster which are not in the predicted cluster.

The record error table can be used for computational speedups, for modeling the fundamental metrics listed above as a function of explanatory features, and for sensitivity analyses (where noise is introduced at the level of the record error table, before propagating it down to cluster error metrics and performance estimates).