infineac.process_event#

Contains functions to manipulate events and strings and extract the corresponding information for the infineac package. For text processing it uses the infineac.process_text module.

Examples

>>> import infineac.process_event as process_event
>>> import infineac.file_loader as file_loader
>>> import spacy_stanza
>>> nlp = spacy_stanza.load_pipeline("en", processors="tokenize, lemma")
>>> nlp.add_pipe('sentencizer')
>>> PATH_DIR = "data/transcripts/"
>>> files = list(Path(PATH_DIR).rglob("*.xml"))
>>> events = file_loader.load_files_from_xml(files)
>>> keywords = {"russia": 1, "ukraine": 1}
>>> process_event.events_to_corpus(events=events, keywords=keywords, nlp_model=nlp)

Notes

An event is a dictionary with the following key-value pairs:
  • ‘file’: str - the file name

  • ‘year_upload’: integer - the year of the upload

  • ‘corp_participants’: list[list[str]] - the corporate participants

  • ‘corp_participants_collapsed’: list[str] - collapsed list

  • ‘conf_participants’: list[list[str]] - the conference call participants

  • ‘conf_participants_collapsed’: list[str] - collapsed list

  • ‘presentation’: list[dict] - the presentation part

  • ‘presentation_collapsed’: list[str] - collapsed list

  • ‘qa’: list[dict] - the Q&A part

  • ‘qa_collapsed’: list[str] - collapsed list

  • ‘action’: str - the action (e.g. publish)

  • ‘story_type’: str - the story type (e.g. transcript)

  • ‘version’: str - the version of the publication (e.g. final)

  • ‘title’: str - the title of the earnings call

  • ‘city’: str - the city of the earnings call

  • ‘company_name’: str - the company of the earnings call

  • ‘company_ticker’: str - the company ticker of the earnings call

  • ‘date’: date - the date of the earnings call

  • ‘id’: int - the id of the publication

  • ‘last_update’: date - the last update of the publication

  • ‘event_type_id’: int - the event type id

  • ‘event_type_name’: str - the event type name

Functions

check_if_keyword_align_qa(qa, keywords)

Function to check if a keyword occurs in a question and the answer to that.

check_keywords_in_event(event[, keywords, ...])

Function to check if keywords are present in the presentation or Q&A part of an event.

corpus_list_to_dataframe(corpus)

Converts a corpus (nested list of texts) to a polars DataFrame with indices, indicating the position of the texts in the corpus: event - presentation or qa - part - paragraph - sentence.

create_participants_to_remove(event)

Creates a list containing the names of the participants of an event to be later removed during the text processing.

create_samples(df)

Creates 15 samples for keyword 'russia'.

events_to_corpus(events, nlp_model[, ...])

Converts a list of events to a corpus (list of texts).

excluded_sentences_by_mod_words(events, ...)

Extracts the sentences that are excluded by the modifier words.

extract_infos_from_events(events)

Extracts the id, year, date and company name from a list of events.

extract_passages_from_event(event, keywords, ...)

Wrapper function to extract important passages from an event: comprises of extract_passages_from_presentation() and extract_passages_from_qa().

extract_passages_from_events(events, ...[, ...])

Wrapper function to extract important paragraphs from a list of events.

extract_passages_from_presentation(...[, ...])

Extracts important passages from the presentation section of an event.

extract_passages_from_qa(qa, keywords, nlp_model)

Extracts important passages, like extract_passages_from_presentation(), but for the Q&A section of an event.

filter_events(events[, year, keywords, ...])

Filters events based on a given year and keywords.

test_positions(events)

Checks if all positions of the speakers of the given events are valid.