infineac.process_text.process_text#

infineac.process_text.process_text(text: str, nlp_model, lemmatize: bool = True, lowercase: bool = True, remove_stopwords: bool = True, remove_punctuation: bool = True, remove_numeric: bool = True, remove_currency: bool = True, remove_space: bool = True, remove_additional_words_part: list[str] = [], remove_additional_words_whole: list[str] = []) list[source]#

Processes a text with spaCy and an NLP model.

According to the parameters, the document is lemmatized, lowercased and stopwords, additional_words, punctuation, numeric, currency and space tokens as well as additional_words_part and additional_words_whole are removed from the document.

Parameters:
  • text_nlp (str) – The text document to be processed.

  • nlp_model (spacy.lang) – The spaCy NLP model.

  • lemmatize (bool, default: True) – If document should be lemmatized.

  • lowercase (bool, default: True) – If document should be lowercased.

  • remove_stopwords (bool, default: True) – If stopwords should be removed from document.

  • remove_punctuation (bool, default: True) – If punctuation should be removed from document.

  • remove_numeric (bool, default: False) – If numerics should be removed from document.

  • remove_currency (bool, default: True) – If currency symbols should be removed from document.

  • remove_space (bool, default: True) – If spaces should be removed from document.

  • remove_additional_words_part (list[str], default: []) – List of additional words to be removed from the document. These words can be part of a another word.

  • remove_additional_words_whole (list[str], default: []) – List of additional words to be removed from the document. These words must be a whole, individual word.

Returns:

The processed document as a list of tokens.

Return type:

list[str]