Data Extraction and Text Analysis of Financial Reports
Objective of this assignment is to extract some sections (which are mentioned below) from SEC / EDGAR financial reports and perform text analysis to compute variables those are explained below. Link to SEC / EDGAR financial reports are given in excel spreadsheet “cik_list.xlsx”. Please add https://www.sec.gov/Archives/ to every cells of column F (cik_list.xlsx) to access link to the financial report. Example: Row 2, column F contains edgar/data/3662/0000950170-98-000413.txt Add https://www.sec.gov/Archives/ to form financial report link i.e. https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt
“Management's Discussion and Analysis”: MDA “Quantitative and Qualitative Disclosures about Market Risk”: QQDMR “Risk Factors”: RF
The output dataframe should contain:
- All input variables in “cik_list.xlsx”
- mda_positive_score
- mda_negative_score
- mda_polarity_score
- mda_average_sentence_length
- mda_percentage_of_complex_words
- mda_fog_index
- mda_complex_word_count
- mda_word_count
- mda_uncertainty_score
- mda_constraining_score
- mda_positive_word_proportion
- mda_negative_word_proportion
- mda_uncertainty_word_proportion
- mda_constraining_word_proportion
- qqdmr_positive_score
- qqdmr_negative_score
- qqdmr_polarity_score
- qqdmr_average_sentence_length
- qqdmr_percentage_of_complex_words
- qqdmr_fog_index
- qqdmr_complex_word_count
- qqdmr_word_count
- qqdmr_uncertainty_score
- qqdmr_constraining_score
- qqdmr_positive_word_proportion
- qqdmr_negative_word_proportion
- qqdmr_uncertainty_word_proportion
- qqdmr_constraining_word_proportion
- rf_positive_score
- rf_negative_score
- rf_polarity_score
- rf_average_sentence_length
- rf_percentage_of_complex_words
- rf_fog_index
- rf_complex_word_count
- rf_word_count
- rf_uncertainty_score
- rf_constraining_score
- rf_positive_word_proportion
- rf_negative_word_proportion
- rf_uncertainty_word_proportion
- rf_constraining_word_proportion
- constraining_words_whole_report
Checkout output data structure spreadsheet for format of your output.