Skip to content

Commit

Permalink
airflow: operator and dag/tasks to sync NTD data via DOT API and XLSX (
Browse files Browse the repository at this point in the history
…#3415)

* first take at getting odata api to work

* small changes to script

* got 2022 ntd data scraped

* fix flake8

* removing the script files changed

* change bucket name, append test to test bucket

* changes for testing

* refactor task format, file type

* remove csv file

* ymls for 2022 NTD endpoints

* fixed faulty dag and file name

* fix file name to remove 'raw'

* created tasks for 2022 base ntd tables

* add new operator, dag, task and testing for XLSX NTD tables

* got xlsx airflow operator working, added tasks for 4 important xlsx data sources

* revert changes to docker-compose

* simplified code to accept workbooks with single or multiple tabs

* beginning of adding error/other logging appopriately, to finish tomorrow

* cleaned XLSX operator, refactored

* refactor, get different buckets working

* converted to use json, fixed gcsfs upload error

* added placeholder schedule, first day of the month every month

* revise to all lowercase for dag tasks

* fix naming, delete to make it work

* renamed with lowercase

* removed unnecessary XLSX endpoints

* update to requested schedule

* restore prod variables
  • Loading branch information
charlie-costanzo authored Sep 18, 2024
1 parent 00ace1e commit 65fec8c
Show file tree
Hide file tree
Showing 46 changed files with 783 additions and 5 deletions.
19 changes: 19 additions & 0 deletions airflow/dags/sync_ntd_data_api/METADATA.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
description: "Scrape NTD endpoints from DOT API monthly"
schedule_interval: "0 11 1 * *" # 11am GMT first day of every month
tags:
- all_gusty_features
default_args:
owner: airflow
depends_on_past: False
catchup: False
start_date: "2024-09-15"
email:
- "hello@calitp.org"
email_on_failure: True
email_on_retry: False
retries: 1
retry_delay: !timedelta 'minutes: 2'
concurrency: 50
#sla: !timedelta 'hours: 2'
wait_for_defaults:
timeout: 3600
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'breakdowns'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'amkt-4ehs'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'breakdowns_by_agency'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'fk8n-qvag'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'capital_expenses_by_capital_use'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'fphd-jyyj'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'capital_expenses_by_mode'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '2667-vitc'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'capital_expenses_for_existing_service'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '7kqv-yqbn'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'capital_expenses_for_expansion_of_service'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'nvgd-g6pj'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'employees_by_agency'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'brbd-9azc'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'employees_by_mode'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'wsxw-2rpq'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'employees_by_mode_and_employee_type'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'uyv8-9jek'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'fuel_and_energy'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '8ehq-7his'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'fuel_and_energy_by_agency'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'wwem-ata9'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'funding_sources_by_expense_type'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '4tmr-gwuu'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'funding_sources_directly_generated'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'yuaq-zdvc'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'funding_sources_federal'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'qpjk-b3zw'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'funding_sources_local'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '8tvb-ywj3'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'funding_sources_state'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'dd43-h6wv'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'funding_sources_taxes_levied_by_agency'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'c8k8-y2cj'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'maintenance_facilities'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '9yj4-fiiz'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'maintenance_facilities_by_agency'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 's68b-wvgx'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'metrics'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'ekg5-frzt'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'operating_expenses_by_function'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'dkxx-zjd6'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'operating_expenses_by_function_and_agency'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'i5ki-dc58'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'operating_expenses_by_type'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'j5uj-anzx'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'operating_expenses_by_type_and_agency'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'i4ua-cjx4'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'service_by_agency'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '6y83-7vuw'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'service_by_mode'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '4fir-qbim'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'service_by_mode_and_time_period'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'wwdp-t4re'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'stations_and_facilities_by_agency_and_facility_type'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'aqct-knjk'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'stations_by_mode_and_age'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'wfz2-eft6'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'track_and_roadway_by_agency'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'pvgq-a73e'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'track_and_roadway_by_mode'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'fzbb-f6kc'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'track_and_roadway_guideway_age_distribution'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'j9q7-53ae'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'vehicles_age_distribution'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '6abt-uhgq'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: '2022'
product: 'vehicles_type_count_by_agency'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: 'nimp-626k'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: 'historical'
product: 'fra_regulated_mode_major_security_events'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '65fa-qbkf'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: 'historical'
product: 'major_safety_events'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '9ivb-8ae9'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: 'historical'
product: 'monthly_modal_time_series_safety_and_service'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '65fa-qbkf'
file_format: '.json'
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
operator: operators.NtdDataProductAPIOperator

year: 'historical'
product: 'nonmajor_safety_and_security_events'
root_url: 'https://data.transportation.gov/resource/'
endpoint_id: '9ivb-8ae9'
file_format: '.json'
19 changes: 19 additions & 0 deletions airflow/dags/sync_ntd_data_xlsx/METADATA.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
description: "Scrape tables from DOT Ridership XLSX file daily"
schedule_interval: "0 10 * * *" # 10am GMT every day
tags:
- all_gusty_features
default_args:
owner: airflow
depends_on_past: False
catchup: False
start_date: "2024-09-15"
email:
- "hello@calitp.org"
email_on_failure: True
email_on_retry: False
retries: 1
retry_delay: !timedelta 'minutes: 2'
concurrency: 50
#sla: !timedelta 'hours: 2'
wait_for_defaults:
timeout: 3600
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
operator: operators.NtdDataProductXLSXOperator

product: 'complete_monthly_ridership_with_adjustments_and_estimates'
xlsx_file_url: 'https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-08/June%202024%20Complete%20Monthly%20Ridership%20%28with%20adjustments%20and%20estimates%29_240801.xlsx'
year: 'historical'
2 changes: 2 additions & 0 deletions airflow/plugins/operators/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,5 @@
from operators.littlepay_raw_sync import LittlepayRawSync
from operators.littlepay_to_jsonl import LittlepayToJSONL
from operators.pod_operator import PodOperator
from operators.scrape_ntd_api import NtdDataProductAPIOperator
from operators.scrape_ntd_xlsx import NtdDataProductXLSXOperator
Loading

0 comments on commit 65fec8c

Please sign in to comment.