infineac.pipeline.pipeline#
- infineac.pipeline.pipeline(path: str | None = None, preload_events: bool | str = False, preload_corpus: bool | str = False, keywords=[], nlp_model=None, year=2019, modifier_words=['disregarding', 'except', 'excluding', 'ignoring', 'leaving out', 'not including', 'omitting'], sections='all', context_window_sentence=0, join_adjacent_sentences=True, subsequent_paragraphs=0, extract_answers=True, lemmatize: bool = True, lowercase: bool = True, remove_stopwords: bool = True, remove_punctuation: bool = True, remove_numeric: bool = True, remove_currency: bool = True, remove_space: bool = True, remove_keywords: bool = True, remove_names: bool = True, remove_strategies: bool | dict[str, list[str]] = True, remove_additional_stopwords: bool | list[str] = True, representation_model=None, embedding_model=None, umap_model=None, vectorizer_model=None, nr_topics=None, predefined_topics: bool | list[list[str]] | None = None, threshold: int = 1)[source]#
Pipeline to extract topics from a given list of files containing earnings calls transcripts or a list of events.
- Parameters:
path (str) – Path to directory of earnings calls transcripts.
preload_events (bool | str, default: False) – Path to file containing events.
preload_corpus (bool | str, default: False) – Path to file containing corpus.
keywords (list[str] | dict[str, int], default: None) – List of keywords to search for in the events and extract the corresponding passages. If keywords is a dictionary, the keys are the keywords.
nlp_model (spacy.lang, default: None) – NLP model. lemmatize : bool, default: True If document should be lemmatized.
year (int, default: constants.BASE_YEAR) – Year to filter the events by.
modifier_words (list[str], default: MODIFIER_WORDS) – List of modifier_words, which must not precede the keyword.
sections (str, default: "all") – Section of the event to extract the passages from. Either “all”, “presentation” or “qa”
context_window_sentence (tuple[int, int] | int, default: 0) – The context window of of the sentences to be extracted. Either an integer or a tuple of length 2. The first element of the tuple indicates the number of sentences to be extracted before the sentence the keyword was found in, the second element indicates the number of sentences after it. If only an integer is provided, the same number of sentences are extracted before and after the keyword. If one of the elements is -1, all sentences before or after the keyword are extracted. So -1 can be used to extract all sentences before and after the keyword, e.g. the entire paragraph.
join_adjacent_sentences (bool, default: True) – Whether to join adjacent sentences.
subsequent_paragraphs (int, default: 0) – Number of subsequent paragraphs to extract after the one containing a keyword.
extract_answers (bool, default: False) – If True, entire answers to questions that include a keyword are also extracted.
return_type (str, default: "list") – The return type of the method. Either “str” or “list”
lowercase (bool, default: True) – If document should be lowercased.
remove_stopwords (bool, default: True) – If stopwords should be removed from document.
remove_punctuation (bool, default: True) – If punctuation should be removed from document.
remove_numeric (bool, default: False) – If numerics should be removed from document.
remove_currency (bool, default: True) – If currency symbols should be removed from document.
remove_space (bool, default: True) – If spaces should be removed from document.
remove_keywords (bool, default: True) – If keywords should be removed from document.
remove_names (bool, default: True) – If participant names should be removed from document.
remove_strategies (bool | dict[str, list[str]], default: True) – If the strategy keywords should be removed from document.
remove_additional_stopwords (bool | list[str], default: True) – If additional stopwords should be removed from document.
representation_model (any) – Representation model to use.
embedding_model (any, default: None) – Embedding model to use. If None, the default embedding model is used.
umap_model (any, default: None) – UMAP model to use. If None, the default UMAP model is used.
vectorizer_model (any, default: None) – Vectorizer model to use. If None, the default vectorizer model is used.
nr_topics (any, default: None) – Number of topics to extract. If None, the number of topics is determined automatically.
predefined_topics (bool | list[list[str]], default: None) – Whether to use predefined_topics. If True,
infineac.constants.TOPICS()is used.threshold (int, default: 1) – Threshold to remove documents from the corpus. If a document contains less words than the threshold, it is removed.
- Returns:
Tuple of two polars DataFrames. The first DataFrame contains the results for each event. The second DataFrame contains the results aggregated for each company and year.
- Return type:
Tuple[polars.DataFrame, polars.DataFrame]