-
Notifications
You must be signed in to change notification settings - Fork 81
Add ability to scrape user pages #12
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice new addition, thank you! Left a detailed review, some are nitpicks here and there, please bear with me. 😅
@@ -1,3 +1,4 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm... why?
:param short_circuit: | ||
Whether or not to short_circuit total_count loop | ||
|
||
Yields url, captions, hashtags, and mentions for provided insta url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- caption*
- Move this to the top, in the docstring.
:param existing: | ||
URLs to skip | ||
:param short_circuit: | ||
Whether or not to short_circuit total_count loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dedent lines 26-33 by 4 spaces
:param total_count: | ||
Total number of images to be scraped. | ||
:param existing: | ||
URLs to skip | ||
:param mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a colon after mode
List of users to be scraped | ||
:param total_count: | ||
total number of images to be scraped | ||
:param should_continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add colon after should_continue
existing_links.add(row[1]) | ||
start = i + 1 | ||
_single_tag_processing(tag, total_count, existing_links, start) | ||
print(f'[{target}] downloaded {url} as {file_index}.jpg in data/{target}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This becomes incorrect, since we are downloading as f'{count}.jpg'
which is one less than file_index
. Replace count
with file_index
, better variable name.
|
||
try: | ||
req = requests.get(url) | ||
with open(f'data/{tag}/{count}.jpg', 'wb') as img: | ||
with open(f'data/{target}/{count}.jpg', 'wb') as img: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the users to be able to distinguish between the user
photos, and tag
photos, since if I scrape @instagram
, I might mistake it for images scraped from instagram
tag. So, mode specific data directories. :)
print(f'[{target}] downloaded {url} as {file_index}.jpg in data/{target}') | ||
|
||
targets = {'tags': tags, 'users': users} | ||
for mode,lists in targets.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after ,
|
||
Scrapes user and hashtag images from Instagram | ||
""" | ||
def _single_input_processing(target: str, total_count: int, existing_links: set, start: int, mode: str='tag'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this, this is no longer single input processing.
for i, row in enumerate(reader): | ||
existing_links.add(row[1]) | ||
start = i + 1 | ||
_single_input_processing(target, total_count, existing_links, start, mode=mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Account the rename here too
Refactored the code so that you can specify both tags and users to be scraped. Also fixed some off by one errors and added more function documentation.