infineac.process_text.process_corpus#

infineac.process_text.process_corpus(corpus: list[str], nlp_model, lemmatize: bool = True, lowercase: bool = True, remove_stopwords: bool = True, remove_punctuation: bool = True, remove_numeric: bool = True, remove_currency: bool = True, remove_space: bool = True, remove_additional_words_part: list[str] = [], remove_specific_stopwords: list[list[str]] = []) list[list[str]][source]#

Processes a corpus (list of documents/texts) with spaCy and an NLP model.

According to the parameters, the document is lemmatized, lowercased and stopwords, additional_words, punctuation, numeric, currency and space tokens as well as names, strategies and additional_words_whole are removed from the corpus.

Parameters:
  • corpus (list[str]:) – List of texts to be processed.

  • nlp_model (spacy.lang) – The spaCy NLP model.

  • lemmatize (bool, default: True) – If document should be lemmatized.

  • lowercase (bool, default: True) – If document should be lowercased.

  • remove_stopwords (bool, default: True) – If stopwords should be removed from document.

  • remove_punctuation (bool, default: True) – If punctuation should be removed from document.

  • remove_numeric (bool, default: False) – If numerics should be removed from document.

  • remove_currency (bool, default: True) – If currency symbols should be removed from document.

  • remove_space (bool, default: True) – If spaces should be removed from document.

  • remove_additional_words (list[str], default: True) – List of additional words to be removed from the document. These words can be part of a another word.

  • remove_specific_stopwords (list[list[str]], default: []) – List of lists of stopwords to be removed from the document. Each list of stopwords corresponds to a document in the corpus.

Returns:

The processed corpus as a list of lists (texts) of tokens.

Return type:

list[list[str]]