infineac.process_text#
Contains methods to process text data. The methods are used to extract passages from text data, e.g. earnings calls, that contain specific keywords.
Examples
>>> import infineac.process_text as process_text
>>> import spacy_stanza
>>> nlp = spacy_stanza.load_pipeline("en", processors="tokenize, lemma")
>>> nlp_stanza.add_pipe('sentencizer')
>>> nlp = "data/transcripts/"
>>> files = list(Path(PATH_DIR).rglob("*.xml"))
>>> events = file_loader.load_files_from_xml(files)
>>> keywords = {"russia": 1, "ukraine": 1}
>>> process_text.extract_passages_from_events(
events=events,
keywords=keywords,
nlp_model=nlp)
Notes
It is mainly used by the infineac.process_event module to process
the text data of the events, e.g. earnings calls.
Functions
|
Joins sentences that are adjacent based on their sentence_ids. |
|
Checks if a word contains a stopword. |
Extracts sentences with specific keywords and a modifier_word preceding it. |
|
Extracts sentences with specific keywords and a modifier_word preceding it. |
|
|
Extracts sentences with specific keywords within a text as well as the context surrounding this sentence. |
|
Loops through paragraphs and extracts the sentences that contain a keyword. |
|
Evaluates a string if it contains the words "election" and "presidential election" and returns a string accordingly. |
|
Evaluates a string if it contains the words "russia" and "sanction" and returns a string accordingly. |
|
Searches for the strategy_keywords in a list of strings or DataFrame and returns a boolean list or DataFrame with the results (for each strategy). |
Checks if a string contains one of the keywords and does not contain a modifier_word preceding the keyword. |
|
Checks if a string contains one of the keywords and contains a modifier_word preceding the keyword. |
|
|
Converts a list of strings to a string with a separator. |
|
Processes a corpus (list of documents/texts) with spaCy and an NLP model. |
|
Processes a text with spaCy and an NLP model. |
|
Processes a spaCy document. |
|
Removes sentences from a corpus that only contain threshold words or less. |
|
Samples the strategies from the dataframe. |
|
Checks if a word starts with an additional_word. |
|
Converts the dictionary of strategy_keywords to a list of keywords. |