Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reddit #997

Open
wants to merge 49 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
6cba1ee
Draft reddit
jpontoire Dec 5, 2024
831537e
Fix reddit posts
jpontoire Dec 5, 2024
8fb9cf0
Updating reddit posts
jpontoire Dec 6, 2024
a88ac13
Adding -t, --text to reddit posts
jpontoire Dec 6, 2024
2735a0a
Fix tests
jpontoire Dec 6, 2024
9434b19
fix tests
jpontoire Dec 6, 2024
1c93157
fix tests
jpontoire Dec 6, 2024
a49a818
Merge branch 'master' into reddit
jpontoire Dec 6, 2024
ef116eb
First version of reddit comments
jpontoire Dec 9, 2024
3ab4b42
Update reddit comments
jpontoire Dec 9, 2024
bc901cb
Optimization with yield
jpontoire Dec 20, 2024
e40d672
Adding user_posts function
jpontoire Dec 20, 2024
f53c451
Fix user_posts
jpontoire Dec 20, 2024
b932a8d
Fixing errors with user_posts
jpontoire Dec 20, 2024
26be5f3
Fixing format
jpontoire Dec 20, 2024
e3a96af
Refacto
jpontoire Dec 20, 2024
2fb4cd2
better refacto
jpontoire Dec 20, 2024
bc9ff73
Adding reddit user_comments
jpontoire Dec 20, 2024
47c1ae5
adding scraped values for points and comments
jpontoire Jan 7, 2025
0632112
Handle broken and banned pages
jpontoire Jan 8, 2025
d770363
Better handling for scores
jpontoire Jan 8, 2025
240f1f2
Draft of edited_date
jpontoire Jan 8, 2025
abff314
Fixing error when no pagination and edited_date
jpontoire Jan 8, 2025
b045fb7
Fixing data in user_comments
jpontoire Jan 8, 2025
65ac1bf
refacto and use of posts with the name of the subreddit
jpontoire Jan 8, 2025
30932a2
Fixing typo
jpontoire Jan 9, 2025
2e2abfc
Fixing typo
jpontoire Jan 9, 2025
39700b0
Fixing error in get_new_url
jpontoire Jan 9, 2025
2c5e078
Merge branch 'master' into reddit
Yomguithereal Jan 9, 2025
4d74228
changes doc and kebab-case
jpontoire Jan 9, 2025
6e32569
removing print and sleep
jpontoire Jan 9, 2025
a49918b
Avoid stack overflow error
jpontoire Jan 9, 2025
3189558
refacto
jpontoire Jan 9, 2025
fa3bc28
changing -n, --number to -l, --limit and fixing errors with comments
jpontoire Jan 9, 2025
2f51fb6
Fixing gh-tests error
jpontoire Jan 9, 2025
5a42b15
Fixing comments and handling detection
jpontoire Jan 10, 2025
7b9bb8c
adding use of spoof-ua
jpontoire Jan 10, 2025
bf8aee9
Fixing tests
jpontoire Jan 10, 2025
c28763c
Fixing error with deleted accounts
jpontoire Jan 10, 2025
622fc24
Compiling the regex outside the function
jpontoire Jan 10, 2025
2fdfb61
refacto
jpontoire Jan 10, 2025
1e311fd
Fixing error with number of posts retrieved
jpontoire Jan 13, 2025
b414d8a
Fixing bug with old posts
jpontoire Jan 14, 2025
392f20b
fixing error with "?..." in url
jpontoire Jan 14, 2025
9d4218f
Fixing test error
jpontoire Jan 14, 2025
8468cb8
refacto
jpontoire Jan 14, 2025
8dc345c
refacto
jpontoire Jan 15, 2025
de29187
fix bug with add_slash
jpontoire Jan 15, 2025
25e2e60
refacto
jpontoire Jan 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ ftest/*.csv
*.sqlar
*-wal
*-shm
*.csv
Yomguithereal marked this conversation as resolved.
Show resolved Hide resolved

/crawl
/downloaded
Expand Down
2 changes: 2 additions & 0 deletions minet/cli/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from minet.cli.hyphe import HYPHE_COMMAND
from minet.cli.instagram import INSTAGRAM_COMMAND
from minet.cli.mediacloud import MEDIACLOUD_COMMAND
from minet.cli.reddit import REDDIT_COMMAND
from minet.cli.telegram import TELEGRAM_COMMAND
from minet.cli.tiktok import TIKTOK_COMMAND
from minet.cli.twitter import TWITTER_COMMAND
Expand Down Expand Up @@ -42,6 +43,7 @@
HYPHE_COMMAND,
INSTAGRAM_COMMAND,
MEDIACLOUD_COMMAND,
REDDIT_COMMAND,
TELEGRAM_COMMAND,
TIKTOK_COMMAND,
TWITTER_COMMAND,
Expand Down
145 changes: 145 additions & 0 deletions minet/cli/reddit/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# =============================================================================
# Minet Reddit CLI Action
# =============================================================================
#
# Logic of the `rd` action.
#

from minet.cli.argparse import command

REDDIT_POSTS_SUBCOMMAND = command(
"posts",
"minet.cli.reddit.posts",
title="Minet Reddit Posts Command",
description="""
Retrieve reddit posts from a subreddit link or name.
""",
epilog="""
Example:

. Searching posts from the subreddit r/france:
$ minet reddit posts https://www.reddit.com/r/france > r_france_posts.csv
$ minet reddit posts france > r_france_posts.csv
$ minet reddit posts r/france > r_france_posts.csv
""",
variadic_input={
"dummy_column": "subreddit",
"item_label": "subreddit url, subreddit shortcode or subreddit id",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subreddit url, shortcode or id ce sera probablement un peu moins long dans l'aide

"item_label_plural": "subreddit urls, subreddit shortcodes or subreddits ids",
},
arguments=[
{
"flags": ["-n", "--number"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'habitude cet argument s'appelle -l, --limit dans minet. Et on ne le met en général que quand c'est pas trivial à faire avec un xan slice je crois.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Là c'est pas le maximum qu'il doit récupérer c'est le nombre de résultats à récupérer. Tu préfères que je change ça et que je mette une limite ?

"help": "Number of posts to retrieve.",
"type": int,
},
{
"flags": ["-t", "--text"],
"help": "Retrieve the text of the post. Note that it will require one request per post.",
"action": "store_true",
},
],
)

REDDIT_COMMENTS_SUBCOMMAND = command(
"comments",
"minet.cli.reddit.comments",
title="Minet Reddit Comments Command",
description="""
Retrieve comments from a reddit post link.
Note that it will only retrieve the comments displayed on the page. If you want all the comments you need to use -A, --all but it will require a request per comment, and you can only make 100 requests per 10 minutes.
""",
epilog="""
Example:

. Searching comments from a reddit post:
$ minet reddit comments https://www.reddit.com/r/france/comments/... > r_france_comments.csv
""",
variadic_input={
"dummy_column": "post",
"item_label": "post url, post shortcode or post id",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem

"item_label_plural": "posts urls, posts shortcodes or posts ids",
},
arguments=[
{
"flags": ["-A", "--all"],
"help": "Retrieve all comments.",
"action": "store_true",
},
],
)

REDDIT_USER_POSTS_SUBCOMMAND = command(
"user_posts",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Les noms de commandes sont en kebab case, pas snake case :)

Il faudra changer l'exemple aussi.

"minet.cli.reddit.user_posts",
title="Minet Reddit User Posts Command",
description="""
Retrieve reddit posts from a user link.
""",
epilog="""
Example:

. Searching posts from the user page of u/random_user:
$ minet reddit user_posts https://www.reddit.com/user/random_user/submitted/ > random_user_posts.csv
""",
variadic_input={
"dummy_column": "user",
"item_label": "user url, user shortcode or user id",
"item_label_plural": "user urls, user shortcodes or user ids",
},
arguments=[
{
"flags": ["-n", "--number"],
"help": "Number of posts to retrieve.",
"type": int,
},
{
"flags": ["-t", "--text"],
"help": "Retrieve the text of the post. Note that it will require one request per post.",
"action": "store_true",
},
],
)

REDDIT_USER_COMMENTS_SUBCOMMAND = command(
"user_comments",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem

"minet.cli.reddit.user_comments",
title="Minet Reddit User Comments Command",
description="""
Retrieve reddit comments from a user link.
""",
epilog="""
Example:

. Searching comments from the user page of u/random_user:
$ minet reddit user_comments https://www.reddit.com/user/random_user/comments/ > random_user_comments.csv
""",
variadic_input={
"dummy_column": "user",
"item_label": "user url, user shortcode or user id",
"item_label_plural": "user urls, user shortcodes or user ids",
},
arguments=[
{
"flags": ["-n", "--number"],
"help": "Number of comments to retrieve.",
"type": int,
},
],
)

REDDIT_COMMAND = command(
"reddit",
"minet.cli.reddit",
"Minet Reddit Command",
aliases=["rd"],
description="""
Collect data from Reddit.
""",
subcommands=[
REDDIT_POSTS_SUBCOMMAND,
REDDIT_COMMENTS_SUBCOMMAND,
REDDIT_USER_POSTS_SUBCOMMAND,
REDDIT_USER_COMMENTS_SUBCOMMAND,
],
)
41 changes: 41 additions & 0 deletions minet/cli/reddit/comments.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# =============================================================================
# Minet Reddit Comments CLI Action
# =============================================================================
#
# Logic of the `rd comments` action.
#
from minet.cli.utils import with_enricher_and_loading_bar
from minet.reddit.scraper import RedditScraper
from minet.reddit.types import RedditComment
from minet.reddit.exceptions import RedditInvalidTargetError


@with_enricher_and_loading_bar(
headers=RedditComment,
title="Scraping comments",
unit="pages",
nested=True,
sub_unit="comments",
)
def action(cli_args, enricher, loading_bar):
scraper = RedditScraper()

for i, row, url in enricher.enumerate_cells(
cli_args.column, with_rows=True, start=1
):
with loading_bar.step(url):
try:
if cli_args.all:
comments = scraper.get_comments(url, True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments = scraper.get_comments(url, cli_args.all) sinon?

else:
comments = scraper.get_comments(url, False)

except RedditInvalidTargetError:
loading_bar.print(
"the script could not complete normally on line %i" % (i)
)
continue

for comment in comments:
loading_bar.nested_advance()
enricher.writerow(row, comment)
52 changes: 52 additions & 0 deletions minet/cli/reddit/posts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# =============================================================================
# Minet Reddit Posts CLI Action
# =============================================================================
#
# Logic of the `rd posts` action.
#
from minet.cli.utils import with_enricher_and_loading_bar
from minet.reddit.scraper import RedditScraper
from minet.reddit.types import RedditPost
from minet.reddit.exceptions import RedditInvalidTargetError


@with_enricher_and_loading_bar(
headers=RedditPost,
title="Scraping posts",
unit="pages",
nested=True,
sub_unit="posts",
)
def action(cli_args, enricher, loading_bar):
scraper = RedditScraper()

type_page = "subreddit"

for i, row, url in enricher.enumerate_cells(
cli_args.column, with_rows=True, start=1
):
with loading_bar.step(url):
try:
if cli_args.number:
if cli_args.text:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Même remarque

posts = scraper.get_general_post(
url, type_page, True, cli_args.number
)
else:
posts = scraper.get_general_post(
url, type_page, False, cli_args.number
)
else:
if cli_args.text:
posts = scraper.get_general_post(url, type_page, True)
else:
posts = scraper.get_general_post(url, type_page, False)
except RedditInvalidTargetError:
loading_bar.print(
"the script could not complete normally on line %i" % (i)
)
continue

for post in posts:
loading_bar.nested_advance()
enricher.writerow(row, post)
41 changes: 41 additions & 0 deletions minet/cli/reddit/user_comments.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# =============================================================================
# Minet Reddit Comments CLI Action
# =============================================================================
#
# Logic of the `rd user_comments` action.
#
from minet.cli.utils import with_enricher_and_loading_bar
from minet.reddit.scraper import RedditScraper
from minet.reddit.types import RedditUserComment
from minet.reddit.exceptions import RedditInvalidTargetError


@with_enricher_and_loading_bar(
headers=RedditUserComment,
title="Scraping user comments",
unit="pages",
nested=True,
sub_unit="comments",
)
def action(cli_args, enricher, loading_bar):
scraper = RedditScraper()

for i, row, url in enricher.enumerate_cells(
cli_args.column, with_rows=True, start=1
):
with loading_bar.step(url):
try:
if cli_args.number:
posts = scraper.get_user_comments(url, cli_args.number)
else:
posts = scraper.get_user_comments(url)

except RedditInvalidTargetError:
loading_bar.print(
"the script could not complete normally on line %i" % (i)
)
continue

for post in posts:
loading_bar.nested_advance()
enricher.writerow(row, post)
52 changes: 52 additions & 0 deletions minet/cli/reddit/user_posts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# =============================================================================
# Minet Reddit Posts CLI Action
# =============================================================================
#
# Logic of the `rd user_posts` action.
#
from minet.cli.utils import with_enricher_and_loading_bar
from minet.reddit.scraper import RedditScraper
from minet.reddit.types import RedditUserPost
from minet.reddit.exceptions import RedditInvalidTargetError


@with_enricher_and_loading_bar(
headers=RedditUserPost,
title="Scraping user posts",
unit="pages",
nested=True,
sub_unit="posts",
)
def action(cli_args, enricher, loading_bar):
scraper = RedditScraper()

type_page = "user"

for i, row, url in enricher.enumerate_cells(
cli_args.column, with_rows=True, start=1
):
with loading_bar.step(url):
try:
if cli_args.number:
if cli_args.text:
posts = scraper.get_general_post(
url, type_page, True, cli_args.number
)
else:
posts = scraper.get_general_post(
url, type_page, False, cli_args.number
)
else:
if cli_args.text:
posts = scraper.get_general_post(url, type_page, True)
else:
posts = scraper.get_general_post(url, type_page, False)
except RedditInvalidTargetError:
loading_bar.print(
"the script could not complete normally on line %i" % (i)
)
continue

for post in posts:
loading_bar.nested_advance()
enricher.writerow(row, post)
17 changes: 17 additions & 0 deletions minet/reddit/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# =============================================================================
# Minet Reddit Exceptions
# =============================================================================
#
from minet.exceptions import MinetError


class RedditError(MinetError):
pass


class RedditInvalidTargetError(RedditError):
pass


class RedditNotPostError(RedditError):
pass
Loading
Loading