infineac.process_text.extract_passages_from_paragraphs#
- infineac.process_text.extract_passages_from_paragraphs(paragraphs: list[str], keywords: list[str] | dict[str, int], nlp_model, modifier_words: list[str] = ['disregarding', 'except', 'excluding', 'ignoring', 'leaving out', 'not including', 'omitting'], context_window_sentence: tuple[int, int] | int = 0, join_adjacent_sentences: bool = True, subsequent_paragraphs: int = 0, return_type: str = 'list', keyword_n_paragraphs_above: int = -1) str | list[list[str]][source]#
Loops through paragraphs and extracts the sentences that contain a keyword.
If a keyword occurs in a paragraph, the sentence containing it and the context surrounding it are extracted as well (context_window_sentence). Additionally, window_subsequent paragraphs are extracted.
- Parameters:
paragraphs (list[str]) – List of paragraphs to loop through.
keywords (list[str] | dict[str, int]) – List of keywords to search for in the paragraphs. If keywords is a dictionary, the keys are the keywords.
nlp_model (spacy.lang) – NLP model.
modifier_words (list[str], default: MODIFIER_WORDS) – List of modifier_words, which must not precede the keyword.
context_window_sentence (tuple[int, int] | int, default: 0) – The context window of of the sentences to be extracted. Either an integer or a tuple of length 2. The first element of the tuple indicates the number of sentences to be extracted before the sentence the keyword was found in, the second element indicates the number of sentences after it. If only an integer is provided, the same number of sentences are extracted before and after the keyword. If one of the elements is -1, all sentences before or after the keyword are extracted. So -1 can be used to extract all sentences before and after the keyword, e.g. the entire paragraph.
join_adjacent_sentences (bool, default: True) – Whether to join adjacent sentences or leave them as individual. If context_window_sentence > 0, this parameter is automatically set to True.
subsequent_paragraphs (int, default: 0) – Number of subsequent paragraphs to extract after the one containing a keyword.
return_type (str, default: "list") – The return type of the method. Either “str” or “list”
keyword_n_paragraphs_above (int, default: -1) – Number of paragraphs above the current paragraph where the keyword is found.
- Returns:
The extracted passages as a concatenated string or list of paragraphs (lists) of passages (str). (defined by return_type).
- Return type:
str | list[list[str]]
- Raises:
ValueError – If return_type is not “str” or “list”.