Skip to content
This repository was archived by the owner on Jan 11, 2022. It is now read-only.

Add ability to scrape user pages #12

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

xraymemory
Copy link

Refactored the code so that you can specify both tags and users to be scraped. Also fixed some off by one errors and added more function documentation.

Copy link
Owner

@meetmangukiya meetmangukiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice new addition, thank you! Left a detailed review, some are nitpicks here and there, please bear with me. 😅

@@ -1,3 +1,4 @@

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm... why?

:param short_circuit:
Whether or not to short_circuit total_count loop

Yields url, captions, hashtags, and mentions for provided insta url
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. caption*
  2. Move this to the top, in the docstring.

:param existing:
URLs to skip
:param short_circuit:
Whether or not to short_circuit total_count loop
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedent lines 26-33 by 4 spaces

:param total_count:
Total number of images to be scraped.
:param existing:
URLs to skip
:param mode
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a colon after mode

List of users to be scraped
:param total_count:
total number of images to be scraped
:param should_continue
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add colon after should_continue

existing_links.add(row[1])
start = i + 1
_single_tag_processing(tag, total_count, existing_links, start)
print(f'[{target}] downloaded {url} as {file_index}.jpg in data/{target}')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes incorrect, since we are downloading as f'{count}.jpg' which is one less than file_index. Replace count with file_index, better variable name.


try:
req = requests.get(url)
with open(f'data/{tag}/{count}.jpg', 'wb') as img:
with open(f'data/{target}/{count}.jpg', 'wb') as img:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the users to be able to distinguish between the user photos, and tag photos, since if I scrape @instagram, I might mistake it for images scraped from instagram tag. So, mode specific data directories. :)

print(f'[{target}] downloaded {url} as {file_index}.jpg in data/{target}')

targets = {'tags': tags, 'users': users}
for mode,lists in targets.items():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after ,


Scrapes user and hashtag images from Instagram
"""
def _single_input_processing(target: str, total_count: int, existing_links: set, start: int, mode: str='tag'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this, this is no longer single input processing.

for i, row in enumerate(reader):
existing_links.add(row[1])
start = i + 1
_single_input_processing(target, total_count, existing_links, start, mode=mode)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Account the rename here too

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants