infineac.process_text.process_text_nlp#

infineac.process_text.process_text_nlp(text_nlp: str, lemmatize: bool = True, lowercase: bool = True, remove_stopwords: bool = True, remove_punctuation: bool = True, remove_numeric: bool = True, remove_currency: bool = True, remove_space: bool = True, remove_additional_words_part: list[str] = [], remove_additional_words_whole: list[str] = []) list[str][source]#

Processes a spaCy document.

According to the parameters, the document is lemmatized, lowercased and stopwords, additional_words, punctuation, numeric, currency and space tokens as well as additional_words_part and additional_words_whole are removed from the document.

Parameters:
  • text_nlp (str) – The spaCy document to be processed.

  • lemmatize (bool, default: True) – If document should be lemmatized.

  • lowercase (bool, default: True) – If document should be lowercased.

  • remove_stopwords (bool, default: True) – If stopwords should be removed from document.

  • remove_punctuation (bool, default: True) – If punctuation should be removed from document.

  • remove_numeric (bool, default: False) – If numerics should be removed from document.

  • remove_currency (bool, default: True) – If currency symbols should be removed from document.

  • remove_space (bool, default: True) – If spaces should be removed from document.

  • remove_additional_words_part (list[str], default: []) – List of additional words to be removed from the document. These words can be part of a another word.

  • remove_additional_words_whole (list[str], default: []) – List of additional words to be removed from the document. These words must be a whole, individual word.

Returns:

The processed document as a list of tokens.

Return type:

list[str]