infineac.process_text#

Contains methods to process text data. The methods are used to extract passages from text data, e.g. earnings calls, that contain specific keywords.

Examples

>>> import infineac.process_text as process_text
>>> import spacy_stanza
>>> nlp = spacy_stanza.load_pipeline("en", processors="tokenize, lemma")
>>> nlp_stanza.add_pipe('sentencizer')
>>> nlp = "data/transcripts/"
>>> files = list(Path(PATH_DIR).rglob("*.xml"))
>>> events = file_loader.load_files_from_xml(files)
>>> keywords = {"russia": 1, "ukraine": 1}
>>> process_text.extract_passages_from_events(
    events=events,
    keywords=keywords,
    nlp_model=nlp)

Notes

It is mainly used by the infineac.process_event module to process the text data of the events, e.g. earnings calls.

Functions

`combine_adjacent_sentences`(sentence_ids, ...)	Joins sentences that are adjacent based on their sentence_ids.
`contains_stopword`(word, stopwords[, only_start])	Checks if a word contains a stopword.
`extract_keyword_sentences_preceding_mod`(...)	Extracts sentences with specific keywords and a modifier_word preceding it.
`extract_keyword_sentences_preceding_mod_nlp`(...)	Extracts sentences with specific keywords and a modifier_word preceding it.
`extract_keyword_sentences_window`(text, ...)	Extracts sentences with specific keywords within a text as well as the context surrounding this sentence.
`extract_passages_from_paragraphs`(paragraphs, ...)	Loops through paragraphs and extracts the sentences that contain a keyword.
`get_elections`(string)	Evaluates a string if it contains the words "election" and "presidential election" and returns a string accordingly.
`get_russia_and_sanction`(string)	Evaluates a string if it contains the words "russia" and "sanction" and returns a string accordingly.
`get_strategies`([lst, strategy_keywords, ...])	Searches for the strategy_keywords in a list of strings or DataFrame and returns a boolean list or DataFrame with the results (for each strategy).
`keyword_threshold_search_exclude_mod`(string)	Checks if a string contains one of the keywords and does not contain a modifier_word preceding the keyword.
`keyword_threshold_search_include_mod`(string)	Checks if a string contains one of the keywords and contains a modifier_word preceding the keyword.
`list_to_string`(list[, separator])	Converts a list of strings to a string with a separator.
`process_corpus`(corpus, nlp_model[, ...])	Processes a corpus (list of documents/texts) with spaCy and an NLP model.
`process_text`(text, nlp_model[, lemmatize, ...])	Processes a text with spaCy and an NLP model.
`process_text_nlp`(text_nlp[, lemmatize, ...])	Processes a spaCy document.
`remove_sentences_under_threshold`(corpus[, ...])	Removes sentences from a corpus that only contain threshold words or less.
`sample_strategies`(dataframe[, k])	Samples the strategies from the dataframe.
`starts_with_additional_word`(word, ...)	Checks if a word starts with an additional_word.
`strategy_keywords_tolist`([strategy_keywords])	Converts the dictionary of strategy_keywords to a list of keywords.