infineac.process_event#

Contains functions to manipulate events and strings and extract the corresponding information for the infineac package. For text processing it uses the infineac.process_text module.

Examples

>>> import infineac.process_event as process_event
>>> import infineac.file_loader as file_loader
>>> import spacy_stanza
>>> nlp = spacy_stanza.load_pipeline("en", processors="tokenize, lemma")
>>> nlp.add_pipe('sentencizer')
>>> PATH_DIR = "data/transcripts/"
>>> files = list(Path(PATH_DIR).rglob("*.xml"))
>>> events = file_loader.load_files_from_xml(files)
>>> keywords = {"russia": 1, "ukraine": 1}
>>> process_event.events_to_corpus(events=events, keywords=keywords, nlp_model=nlp)

Notes

An event is a dictionary with the following key-value pairs:

‘file’: str - the file name
‘year_upload’: integer - the year of the upload
‘corp_participants’: list[list[str]] - the corporate participants
‘corp_participants_collapsed’: list[str] - collapsed list
‘conf_participants’: list[list[str]] - the conference call participants
‘conf_participants_collapsed’: list[str] - collapsed list
‘presentation’: list[dict] - the presentation part
‘presentation_collapsed’: list[str] - collapsed list
‘qa’: list[dict] - the Q&A part
‘qa_collapsed’: list[str] - collapsed list
‘action’: str - the action (e.g. publish)
‘story_type’: str - the story type (e.g. transcript)
‘version’: str - the version of the publication (e.g. final)
‘title’: str - the title of the earnings call
‘city’: str - the city of the earnings call
‘company_name’: str - the company of the earnings call
‘company_ticker’: str - the company ticker of the earnings call
‘date’: date - the date of the earnings call
‘id’: int - the id of the publication
‘last_update’: date - the last update of the publication
‘event_type_id’: int - the event type id
‘event_type_name’: str - the event type name

Functions

`check_if_keyword_align_qa`(qa, keywords)	Function to check if a keyword occurs in a question and the answer to that.
`check_keywords_in_event`(event[, keywords, ...])	Function to check if keywords are present in the presentation or Q&A part of an event.
`corpus_list_to_dataframe`(corpus)	Converts a corpus (nested list of texts) to a polars DataFrame with indices, indicating the position of the texts in the corpus: event - presentation or qa - part - paragraph - sentence.
`create_participants_to_remove`(event)	Creates a list containing the names of the participants of an event to be later removed during the text processing.
`create_samples`(df)	Creates 15 samples for keyword 'russia'.
`events_to_corpus`(events, nlp_model[, ...])	Converts a list of events to a corpus (list of texts).
`excluded_sentences_by_mod_words`(events, ...)	Extracts the sentences that are excluded by the modifier words.
`extract_infos_from_events`(events)	Extracts the id, year, date and company name from a list of events.
`extract_passages_from_event`(event, keywords, ...)	Wrapper function to extract important passages from an event: comprises of `extract_passages_from_presentation()` and `extract_passages_from_qa()`.
`extract_passages_from_events`(events, ...[, ...])	Wrapper function to extract important paragraphs from a list of events.
`extract_passages_from_presentation`(...[, ...])	Extracts important passages from the presentation section of an event.
`extract_passages_from_qa`(qa, keywords, nlp_model)	Extracts important passages, like `extract_passages_from_presentation()`, but for the Q&A section of an event.
`filter_events`(events[, year, keywords, ...])	Filters events based on a given year and keywords.
`test_positions`(events)	Checks if all positions of the speakers of the given events are valid.