infineac.compare_results.calculate_similarity#

infineac.compare_results.calculate_similarity(df: DataFrame)[source]#

Calculates the intersection and union of all the given categories and topics and, based on this, the similarity within the categories and topics.

The DataFrame must contain at least two topic columns and two category. These are then inferred by the prefix of the column names.

Parameters:

df ((pl.DataFrame)) – The input DataFrame containing topic and category columns.

Returns:

The DataFrame with calculated similarity measures.

Return type:

pl.DataFrame

Raises:

- ValueError – If no topic columns are found.:
- ValueError – If no category columns are found.:
- ValueError – If the number of topic columns and category columns are not equal.:
- ValueError – If only one topic column is found.:
- ValueError – If only one category column is found.:

Notes

The similarity calculated is the Jaccard similarity or index: length of the intersection divided by the length of the union [1]:

\[J(A, B) = \frac{|A \cap B|}{|A \cup B|}\]

Normally the Jaccard similarity is calculated pairwise, i.e. for each pair of categories or topics. But here the Jaccard similarity is calculated in two ways:

pairwise: The Jaccard similarity is calculated pairwise and then the mean is taken.
combined: The Jaccard similarity is calculated for all categories or topics (union and intersection of all categories or topics).

References