infineac.process_event#
Contains functions to manipulate events and strings and extract the
corresponding information for the infineac package. For text processing it uses
the infineac.process_text module.
Examples
>>> import infineac.process_event as process_event
>>> import infineac.file_loader as file_loader
>>> import spacy_stanza
>>> nlp = spacy_stanza.load_pipeline("en", processors="tokenize, lemma")
>>> nlp.add_pipe('sentencizer')
>>> PATH_DIR = "data/transcripts/"
>>> files = list(Path(PATH_DIR).rglob("*.xml"))
>>> events = file_loader.load_files_from_xml(files)
>>> keywords = {"russia": 1, "ukraine": 1}
>>> process_event.events_to_corpus(events=events, keywords=keywords, nlp_model=nlp)
Notes
- An event is a dictionary with the following key-value pairs:
‘file’: str - the file name
‘year_upload’: integer - the year of the upload
‘corp_participants’: list[list[str]] - the corporate participants
‘corp_participants_collapsed’: list[str] - collapsed list
‘conf_participants’: list[list[str]] - the conference call participants
‘conf_participants_collapsed’: list[str] - collapsed list
‘presentation’: list[dict] - the presentation part
‘presentation_collapsed’: list[str] - collapsed list
‘qa’: list[dict] - the Q&A part
‘qa_collapsed’: list[str] - collapsed list
‘action’: str - the action (e.g. publish)
‘story_type’: str - the story type (e.g. transcript)
‘version’: str - the version of the publication (e.g. final)
‘title’: str - the title of the earnings call
‘city’: str - the city of the earnings call
‘company_name’: str - the company of the earnings call
‘company_ticker’: str - the company ticker of the earnings call
‘date’: date - the date of the earnings call
‘id’: int - the id of the publication
‘last_update’: date - the last update of the publication
‘event_type_id’: int - the event type id
‘event_type_name’: str - the event type name
Functions
|
Function to check if a keyword occurs in a question and the answer to that. |
|
Function to check if keywords are present in the presentation or Q&A part of an event. |
|
Converts a corpus (nested list of texts) to a polars DataFrame with indices, indicating the position of the texts in the corpus: event - presentation or qa - part - paragraph - sentence. |
Creates a list containing the names of the participants of an event to be later removed during the text processing. |
|
|
Creates 15 samples for keyword 'russia'. |
|
Converts a list of events to a corpus (list of texts). |
|
Extracts the sentences that are excluded by the modifier words. |
|
Extracts the id, year, date and company name from a list of events. |
|
Wrapper function to extract important passages from an event: comprises of |
|
Wrapper function to extract important paragraphs from a list of events. |
|
Extracts important passages from the presentation section of an event. |
|
Extracts important passages, like |
|
Filters events based on a given year and keywords. |
|
Checks if all positions of the speakers of the given events are valid. |