infineac.topic_extractor#

Extracts topics from a list of documents using BERTopic.

Examples

>>> import infineac.process_event as process_event
>>> import infineac.file_loader as file_loader
>>> import spacy_stanza
>>> nlp = spacy_stanza.load_pipeline("en", processors="tokenize, lemma")
>>> nlp.add_pipe('sentencizer')
>>> PATH_DIR = "data/transcripts/"
>>> files = list(Path(PATH_DIR).rglob("*.xml"))
>>> events = file_loader.load_files_from_xml(files)
>>> keywords = {"russia": 1, "ukraine": 1}
>>> corpus = process_event.events_to_corpus(
                events=events,
                keywords=keywords,
                nlp_model=nlp)
>>> docs = process_text.remove_sentences_under_threshold(
                corpus["processed_text"].tolist())
>>> topics, probabilities = topic_extractor.bert_advanced(docs)

Functions

bert_advanced(docs, representation_model[, ...])

Extracts topics from a list of documents using BERTopic.

categorize_topics(keywords_topics)

Categorizes a lists of keywords (keywords_topics) according to the infineac.constants.TOPICS() dictionary, that maps keywords to categories.

get_groups_from_hierarchy(hierarchical_topics)

Returns the top n children/groups from the hierarchical topics.

get_topics_per_company(df)

Returns the topics and categories per company.

map_topics_to_categories(topics, mapping)

Maps a list of topics to the corresponding categories.

plot_category_distribution(df[, aggregate, ...])

Plots the category distribution for the given DataFrame df and the given aggregate.