infineac.process_text.process_corpus#
- infineac.process_text.process_corpus(corpus: list[str], nlp_model, lemmatize: bool = True, lowercase: bool = True, remove_stopwords: bool = True, remove_punctuation: bool = True, remove_numeric: bool = True, remove_currency: bool = True, remove_space: bool = True, remove_additional_words_part: list[str] = [], remove_specific_stopwords: list[list[str]] = []) list[list[str]][source]#
Processes a corpus (list of documents/texts) with spaCy and an NLP model.
According to the parameters, the document is lemmatized, lowercased and stopwords, additional_words, punctuation, numeric, currency and space tokens as well as names, strategies and additional_words_whole are removed from the corpus.
- Parameters:
corpus (list[str]:) – List of texts to be processed.
nlp_model (spacy.lang) – The spaCy NLP model.
lemmatize (bool, default: True) – If document should be lemmatized.
lowercase (bool, default: True) – If document should be lowercased.
remove_stopwords (bool, default: True) – If stopwords should be removed from document.
remove_punctuation (bool, default: True) – If punctuation should be removed from document.
remove_numeric (bool, default: False) – If numerics should be removed from document.
remove_currency (bool, default: True) – If currency symbols should be removed from document.
remove_space (bool, default: True) – If spaces should be removed from document.
remove_additional_words (list[str], default: True) – List of additional words to be removed from the document. These words can be part of a another word.
remove_specific_stopwords (list[list[str]], default: []) – List of lists of stopwords to be removed from the document. Each list of stopwords corresponds to a document in the corpus.
- Returns:
The processed corpus as a list of lists (texts) of tokens.
- Return type:
list[list[str]]