infineac.topic_extractor#

Extracts topics from a list of documents using BERTopic.

Examples

>>> import infineac.process_event as process_event
>>> import infineac.file_loader as file_loader
>>> import spacy_stanza
>>> nlp = spacy_stanza.load_pipeline("en", processors="tokenize, lemma")
>>> nlp.add_pipe('sentencizer')
>>> PATH_DIR = "data/transcripts/"
>>> files = list(Path(PATH_DIR).rglob("*.xml"))
>>> events = file_loader.load_files_from_xml(files)
>>> keywords = {"russia": 1, "ukraine": 1}
>>> corpus = process_event.events_to_corpus(
                events=events,
                keywords=keywords,
                nlp_model=nlp)
>>> docs = process_text.remove_sentences_under_threshold(
                corpus["processed_text"].tolist())
>>> topics, probabilities = topic_extractor.bert_advanced(docs)

Functions

`bert_advanced`(docs, representation_model[, ...])	Extracts topics from a list of documents using BERTopic.
`categorize_topics`(keywords_topics)	Categorizes a lists of keywords (keywords_topics) according to the `infineac.constants.TOPICS()` dictionary, that maps keywords to categories.
`get_groups_from_hierarchy`(hierarchical_topics)	Returns the top n children/groups from the hierarchical topics.
`get_topics_per_company`(df)	Returns the topics and categories per company.
`map_topics_to_categories`(topics, mapping)	Maps a list of topics to the corresponding categories.
`plot_category_distribution`(df[, aggregate, ...])	Plots the category distribution for the given DataFrame df and the given aggregate.