MongoDB Pipeline Project Using Reddit Data

This repo will walk you through executing an end-to-end data pipeline using the Reddit API, MongoDB, MongoShell, and Jupyter Notebooks.

Getting Started

You will need the following to reproduce the work:

IU Jetstream Account OR MongoShell
Python 3.8 or greater
Reddit API Account
Files: mongo.py, credientials.py, and requirements.txt

Setting up Reddit API and credentials.py

To utilize mongo.py, you will need to set-up a Twitter API account. Go to this link here and do the following:

Give your bot a name
Selected script for api type
Write a brief description
Click create app

You will have your client id, client secret key, and user agent key ready for use. The last step is to copy and paste your credentials in credentials.py as strings. You will need to do this manually.

client_id = "your client id"
client_secret = "your client secret"
user_agent = "your user agent"

Without this, you will be unable to utilize the API functionality in mongo.py, and data will not be retrieved.

Setting up mongo shell

Reddit data will be be ingested via the API and stored in a MongoDB database. To run mongo commands, you will either need to use a VM with mongo shell pre-installed, or install mongo shell on your desktop. For the VM, I used IU Jetstream, but any VM will work. To install mongo shell on your personal device, follow the steps below:

Download mongo shell here
Add mongo shell binary to path (to be done in the command line)

touch ~/.bash_profile
open ~/.bash_profile
export PATH="<mongo-directory?.bin:$PATH" # type this into bash_profile
# save and exit file
source ~/.bash_profile
echo $PATH

To confirm, enter the "mongo" command in your command line. The following will appear if your installation and sourcing worked:

MongoDB shell version v4.4.5
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb

Running mongo.py

Below shows the arguments for running mongo.py

$ ./mongo.py --help
usage: mongo.py [-h] [--p P] r

This script takes in a subreddit handle (required) and destination path folder
(optional). This outputs a mongodb document.

positional arguments:
  r           subreddit

optional arguments:
  -h, --help  show this help message and exit
  --p P       destination path

The following fields are included in the output:

Post ID
Number of Comments
Post Created (Timestamp)
Post Score
Post Upvote Ratio
Post Ups
Post Downs
Post Title
Post Body
Replies

For this demostration, the subreddit cgrpMigraine was utilized.

Run script without destination path Without the destination path specified, the JSON file will be outputted in your current working directory.

./mongo.py cgrpMigraine

Run script with destination path With the destination path specified, the JSON file will be outputted in the path you entered

./mongo.py cgrpMigraine /users/Desktop

Successful completion of the script will result in the following message:

MongoDB Doc Located at C:\Users\evdelph\Desktop\I535\ProjectB\cgrpMigraine.json

Utilizing MongoDB Database

A MongoDB database called Reddit has been created to on MongoDB Atlas. You will have the ability to create, delete, and edit existing collections in the database. To access the database run this following commands:

mongo "mongodb+srv://cluster0.c7slc.mongodb.net/Reddit" --username user --password ProjectB!
use Reddit

Create a collection

It is assumed that there will be one collection for every subreddit. To create a collection, run the following command:

db.createCollection('yoursubreddit')

# Successful message

MongoDB Enterprise atlas-8fho22-shard-0:PRIMARY> db.createCollection('cgrpMigraine')
{
        "ok" : 1,
        "$clusterTime" : {
                "clusterTime" : Timestamp(1618077435, 1),
                "signature" : {
                        "hash" : BinData(0,"v1tomRMDk3jnGNKkzLqGHccpiX4="),
                        "keyId" : NumberLong("6937724079830990851")
                }
        },
        "operationTime" : Timestamp(1618077435, 1)
}

Load JSON into database

Run the following commands to load your file as a collection.

# load cgrpMigraine.json data
var file = cat('cgrpMigraine.json')
var parsed_file = JSON.parse(file)
db.cgrpMigraine.insert(parsed_file)


# if successful
BulkWriteResult({
        "writeErrors" : [ ],
        "writeConcernErrors" : [ ],
        "nInserted" : 719,
        "nUpserted" : 0,
        "nMatched" : 0,
        "nModified" : 0,
        "nRemoved" : 0,
        "upserted" : [ ]
})

# Verify the entries loaded into the database
db.cgrpMigraine.count()
>> 719

Basic Queries

MongoDB allows your to query based on key words. In this case, I was interested in finding out how many posts that mentioned the drug "aimovig" that received more that 15 likes. The query is shown below:

db.cgrpMigraine.find({body: /.*imovig.*/, score: {$gt: 15}}).count()
>>20

Export Results

To export results, follow the format below:

# Basic structure
mongoexport --uri="mongodb+srv://cluster0.c7slc.mongodb.net/Reddit" --username user --password ProjectB!/
--collection=cgrpMigraine --query='' --fields field1,field2 --type csv --out yourfile.csv

# Output posts based on query
mongoexport --uri="mongodb+srv://cluster0.c7slc.mongodb.net/Reddit" --username user --password ProjectB!/
--collection=cgrpMigraine --query='{"body":{"$regex":".*imovig*.","$options":""},"score":{"$gt":15}}'/
--fields body,score --type csv --out posts.csv

# Output replies based on query
mongoexport --uri="mongodb+srv://cluster0.c7slc.mongodb.net/Reddit" --username user --password ProjectB! /
--collection=cgrpMigraine --query='{"body":{"$regex":".*imovig*.","$options":""},"score":{"$gt":15}}'/
--fields replies,score --type csv --out replies.csv

# Output all posts containing aimovig
mongoexport --uri="mongodb+srv://cluster0.c7slc.mongodb.net/Reddit" --username user --password ProjectB!/
--collection=cgrpMigraine --query='{"body":{"$regex":".*imovig*.","$options":""},"score":{"$gt":-1}}' /
--fields created_at,ups,upvote_ratio,downs,body --type csv --out timestamps.csv

The files post.csv and replies.csv are provided in the repo. I provided another file called exploded_timestamps.csv. I used pyspark create a dataframe of words from the post. It is recommended to use a VM for this step.

Data Analysis and Visualization

The RedditNLP.ipynb notebook contains step by step instructions how to analyze exported Reddit data. In this example, I looked at frequent unigrams and bigrams over the three cgrp autoinjector medications: Aimovig, Ajovy, and Emgality. I also conducted LDA topic clustering.

The RedditTimestamps.ipynb notebook shows word usage over time. Visualizations include a Treemap, Heatmap, and Scatterplot.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
Visualizations		Visualizations
code		code
data		data
README.md		README.md
requirements.txt		requirements.txt
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MongoDB Pipeline Project Using Reddit Data

Getting Started

Setting up Reddit API and credentials.py

Setting up mongo shell

Running mongo.py

Utilizing MongoDB Database

Create a collection

Load JSON into database

Basic Queries

Export Results

Data Analysis and Visualization

About

Releases

Packages

Languages

evdelph/MongoDB

Folders and files

Latest commit

History

Repository files navigation

MongoDB Pipeline Project Using Reddit Data

Getting Started

Setting up Reddit API and credentials.py

Setting up mongo shell

Running mongo.py

Utilizing MongoDB Database

Create a collection

Load JSON into database

Basic Queries

Export Results

Data Analysis and Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages