infineac.process_text#

Contains methods to process text data. The methods are used to extract passages from text data, e.g. earnings calls, that contain specific keywords.

Examples

>>> import infineac.process_text as process_text
>>> import spacy_stanza
>>> nlp = spacy_stanza.load_pipeline("en", processors="tokenize, lemma")
>>> nlp_stanza.add_pipe('sentencizer')
>>> nlp = "data/transcripts/"
>>> files = list(Path(PATH_DIR).rglob("*.xml"))
>>> events = file_loader.load_files_from_xml(files)
>>> keywords = {"russia": 1, "ukraine": 1}
>>> process_text.extract_passages_from_events(
    events=events,
    keywords=keywords,
    nlp_model=nlp)

Notes

It is mainly used by the infineac.process_event module to process the text data of the events, e.g. earnings calls.

Functions

combine_adjacent_sentences(sentence_ids, ...)

Joins sentences that are adjacent based on their sentence_ids.

contains_stopword(word, stopwords[, only_start])

Checks if a word contains a stopword.

extract_keyword_sentences_preceding_mod(...)

Extracts sentences with specific keywords and a modifier_word preceding it.

extract_keyword_sentences_preceding_mod_nlp(...)

Extracts sentences with specific keywords and a modifier_word preceding it.

extract_keyword_sentences_window(text, ...)

Extracts sentences with specific keywords within a text as well as the context surrounding this sentence.

extract_passages_from_paragraphs(paragraphs, ...)

Loops through paragraphs and extracts the sentences that contain a keyword.

get_elections(string)

Evaluates a string if it contains the words "election" and "presidential election" and returns a string accordingly.

get_russia_and_sanction(string)

Evaluates a string if it contains the words "russia" and "sanction" and returns a string accordingly.

get_strategies([lst, strategy_keywords, ...])

Searches for the strategy_keywords in a list of strings or DataFrame and returns a boolean list or DataFrame with the results (for each strategy).

keyword_threshold_search_exclude_mod(string)

Checks if a string contains one of the keywords and does not contain a modifier_word preceding the keyword.

keyword_threshold_search_include_mod(string)

Checks if a string contains one of the keywords and contains a modifier_word preceding the keyword.

list_to_string(list[, separator])

Converts a list of strings to a string with a separator.

process_corpus(corpus, nlp_model[, ...])

Processes a corpus (list of documents/texts) with spaCy and an NLP model.

process_text(text, nlp_model[, lemmatize, ...])

Processes a text with spaCy and an NLP model.

process_text_nlp(text_nlp[, lemmatize, ...])

Processes a spaCy document.

remove_sentences_under_threshold(corpus[, ...])

Removes sentences from a corpus that only contain threshold words or less.

sample_strategies(dataframe[, k])

Samples the strategies from the dataframe.

starts_with_additional_word(word, ...)

Checks if a word starts with an additional_word.

strategy_keywords_tolist([strategy_keywords])

Converts the dictionary of strategy_keywords to a list of keywords.