-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0037 spider stl board of public service #96
base: main
Are you sure you want to change the base?
0037 spider stl board of public service #96
Conversation
"https://www.stlouis-mo.gov/events/" | ||
"past-meetings.cfm?span=-30&department=332" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The department number for the Board of Public service is 209
. So change the part of the url path to &department=209
.
custom_settings = {"ROBOTSTXT_OBEY": False} | ||
start_urls = [ | ||
( | ||
"https://www.stlouis-mo.gov/government/departments/public-service/index.cfm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the start_urls
link to be the link to where the meeting materials are posted. I believe the website we want is https://www.stlouis-mo.gov/government/departments/public-service/documents/meeting-materials.cfm.
event_sponsors = response.css("ul.list-group li span.small::text").getall() | ||
urls = [] | ||
for url, sponsor in zip(event_urls, event_sponsors): | ||
if "aldermen" in sponsor.lower() or "aldermanic" in sponsor.lower(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, you should change it to something like if "public service" in sponsor.lower()
. The current code will only scrape events for the Board of Alderman.
def _parse_title(self, response): | ||
"""Parse or generate meeting title.""" | ||
title = response.css("div.page-title-row h1::text").get() | ||
title = title.replace("Meeting", "").replace("Metting", "") | ||
title = title.replace("-", "- ") | ||
title = title.replace("(Canceled)", "Cancelled") | ||
return title.replace(" ", " ").strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the Board of Public Service's meeting titles are either Board of Public Service
or Special Board of Public Service Meeting
. So you can do something like this for _parse_title
.
def _parse_description(self, response): | ||
"""Parse or generate meeting description.""" | ||
description = response.css( | ||
"div#EventDisplayBlock div.col-md-8 h4 strong::text" | ||
).getall() | ||
i = 0 | ||
while i < len(description) - 1: | ||
if "following:" in description[i]: | ||
return description[i + 1].replace("\xa0", "") | ||
elif "will" in description[i]: | ||
return description[i].replace("\xa0", "") | ||
else: | ||
i += 1 | ||
else: | ||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get rid of _parse_description
and put description=""
in _parse_event
.
def _parse_classification(self, response): | ||
"""Parse or generate classification from allowed options.""" | ||
title = response.css("div.page-title-row h1::text").get() | ||
if "committee" in title.lower(): | ||
return COMMITTEE | ||
elif "board" in title.lower(): | ||
return BOARD | ||
else: | ||
return NOT_CLASSIFIED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get rid of this and put classification=BOARD
in _parse_event
.
def _parse_location(self, response): | ||
"""Parse or generate location.""" | ||
location = response.css("div.col-md-4 div.content-block p *::text").getall() | ||
temp = [] | ||
for item in location: | ||
item = item.replace("\n", "") | ||
if item != "": | ||
temp.append(item) | ||
location = temp | ||
i, location_index, sponsor_index = 0, 0, 0 | ||
while i < len(location): | ||
if "location" in location[i].lower(): | ||
location_index = i | ||
if "sponsor" in location[i].lower(): | ||
sponsor_index = i | ||
break | ||
i += 1 | ||
|
||
if location_index + 1 < len(location) and sponsor_index < len(location): | ||
name = location[location_index + 1] | ||
address = [] | ||
for j in range(location_index + 2, sponsor_index): | ||
address.append(location[j]) | ||
address = ( | ||
" ".join(address).replace("Directions to this address", "").strip() | ||
) | ||
else: | ||
name = "" | ||
address = "" | ||
|
||
return { | ||
"address": address, | ||
"name": name, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there is a virtual/Zoom meeting, we want the name to be "Zoom"
and the address to be ""
.
pattern_mmddyy = r"(?P<date>(\d{1,2}-\d{1,2}-\d{2}))" | ||
pattern_mmddyyyy = r"(?P<date>(\d{1,2}-\d{1,2}-\d{4}))" | ||
pattern_monthddyyyy = r"(?P<date>([A-Z]* \d{1,2}, \d{4}))" | ||
|
||
rm_mmddyy = re.search(pattern_mmddyy, description) | ||
rm_mmddyyyy = re.search(pattern_mmddyyyy, description) | ||
rm_monthddyyyy = re.search(pattern_monthddyyyy, description) | ||
|
||
dt = None | ||
if rm_mmddyy is not None: | ||
date = rm_mmddyy.group("date") | ||
dt = datetime.strptime(date, "%m-%d-%y") | ||
if rm_mmddyyyy is not None: | ||
date = rm_mmddyyyy.group("date") | ||
dt = datetime.strptime(date, "%m-%d-%Y") | ||
if rm_monthddyyyy is not None: | ||
date = rm_monthddyyyy.group("date") | ||
dt = datetime.strptime(date, "%b %d, %Y") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these regex patterns match the way the date is formatted for the Public Service Board meeting materials.
pattern = r"(?P<date>[A-Z][a-z]* \d{1,2})"
rm = re.search(pattern, description)
if rm is not None:
date = rm.group("date")
dt = datetime.strptime(date, "%B %d")
else:
dt = None
Summary
Issue: #37
Replace "ISSUE_NUMBER" with the number of your issue so that GitHub will link this pull request with the issue and make review easier.
Checklist
All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.
Questions
Include any questions you have about what you're working on.