Reddit-WebScraping

Python project using PRAW for web scraping Reddit's DataScience subreddit to collect and clean data from top posts and comments in 2024.

Overview

Reddit, a social news and discussion platform, is a vast online community with millions of users. The platform is known for its diverse content, from politics to poultry farming and gaming to green hydrogen. This project discusses one subreddit in particular – r/datascience and how we can collect/explore data from it

Why Reddit ?

Reddit's unique upvote/downvote system allows users to collectively determine the most popular content. Analyzing top reddit posts on data science can provide valuable insights on what real practitioners and soon to be data scientists are concerned about. In addition, users are assumed to be anonymous and are more likely to share their opinions without any reservations. Lastly, Reddit data can be accessed programmatically with the help of a Python library that allows easy interaction with the Reddit API.

Methodology

For fetching reddit posts from the data science subreddit, we use Python Reddit API Wrapper(PRAW) Key features:

It can be used to access popular posts, search for content and explore different categories of reddit data
Simplifies making API requests to reddit
Offers methods to filter and sort data based on various criteria, such as time range, score, and more

Criteria for filtering out the posts

Year - Posts of the year 2024 are likely to showcase content that resonates strongly with the data science subreddit community
Importance of the posts - Only posts with score more than 50 are to be considered. Those with a lesser score may not be valuable for our purpose
No. of words in the title - Beware of jokes and memes which usually have less than 5 words in the title.
No. of posts - We need our data set to be sufficiently large to explore the concerns of the subreddit community. A limit of 100 posts would be adequate to achieve this

Once the data is collected, it is converted into a dataframe and exported into a format that is more readable such as a CSV file. It will have relevant details of the posts such as body of the post, title, top 3 comments, time of the post in a readable format

Inference from the data

The posts & comments that we collected highlight several key points

Data Science Jobs in Sports - Discussions surrounding use of data in baseball, basketball, cricket and soccer
Skills required to be successful in the field - Comments on what kind of math, coding and interpersonal skills could potentially lead to a successful career
Vagueness of job descriptions - Remarks on what kind of jobs one could target to have a rewarding careerwhich maybe contradictory to how they are titled
Amount of math required to be successful - Is Data Science only meant for statisticians ?

Conclusion and Takeaways

Web Scrapping of Reddit - It was achieved by familiarizing ourselves with PRAW by following the official documentation before proceeding to gather the data
Data cleaning - We had to exercise caution once the data was gathered by filtering it as per our custom requirements and exporting the resulting dataframe into a csv file
Current landscape - The top posts reflect job openings in unexpected fields, concerns of soon to be data scientists and key skills that could lead to a rewarding career
Insights - Analyzing the top posts provides valuable insights into user preferences which could potentially enable us to take better decisions

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
webScraping.ipynb		webScraping.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit-WebScraping

Overview

Why Reddit ?

Methodology

Criteria for filtering out the posts

Inference from the data

Conclusion and Takeaways

About

Releases

Packages

Languages

Mandar-1007/Reddit-WebScraping

Folders and files

Latest commit

History

Repository files navigation

Reddit-WebScraping

Overview

Why Reddit ?

Methodology

Criteria for filtering out the posts

Inference from the data

Conclusion and Takeaways

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages