infineac.topic_extractor#
Extracts topics from a list of documents using BERTopic.
Examples
>>> import infineac.process_event as process_event
>>> import infineac.file_loader as file_loader
>>> import spacy_stanza
>>> nlp = spacy_stanza.load_pipeline("en", processors="tokenize, lemma")
>>> nlp.add_pipe('sentencizer')
>>> PATH_DIR = "data/transcripts/"
>>> files = list(Path(PATH_DIR).rglob("*.xml"))
>>> events = file_loader.load_files_from_xml(files)
>>> keywords = {"russia": 1, "ukraine": 1}
>>> corpus = process_event.events_to_corpus(
events=events,
keywords=keywords,
nlp_model=nlp)
>>> docs = process_text.remove_sentences_under_threshold(
corpus["processed_text"].tolist())
>>> topics, probabilities = topic_extractor.bert_advanced(docs)
Functions
|
Extracts topics from a list of documents using BERTopic. |
|
Categorizes a lists of keywords (keywords_topics) according to the |
|
Returns the top n children/groups from the hierarchical topics. |
Returns the topics and categories per company. |
|
|
Maps a list of topics to the corresponding categories. |
|
Plots the category distribution for the given DataFrame df and the given aggregate. |