Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Mcshiloh patch 1 #89

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,225 changes: 1,225 additions & 0 deletions Exploratory Data Analysis(EDA)-Classfication(lemmatization).ipynb

Large diffs are not rendered by default.

183 changes: 35 additions & 148 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,165 +1,52 @@
# Streamlit-based Web Application
#### EXPLORE Data Science Academy Classification Predict
# Twitter Climate Sentiment Classification Project
## Project Description
In this project, we built a classification model to predict climate-related sentiment in tweets. The aim is to help companies determine how people perceive climate change based on their tweets. This information can assist companies in understanding how their product/service may be received in the context of climate sentiment.

## 1) Overview
We explored various Supervised Statistical Learning models including Logistic Regression Classifier, Support Vector Machine, Naive Bayes model, and Random Forest to identify the best classifier. GridSearchCV was utilized to select the best parameters for our final model.

![Streamlit](resources/imgs/streamlit.png)

This repository forms the basis of *Task 2* for the **Classification Predict** within EDSA's Data Science course. It hosts template code which will enable students to deploy a basic [Streamlit](https://www.streamlit.io/) web application.

As part of the predict, students are expected to expand on this base template; increasing the number of available models, user data exploration capabilities, and general Streamlit functionality.

#### 1.1) What is Streamlit?

[![What is an API](resources/imgs/what-is-streamlit.png)](https://youtu.be/R2nr1uZ8ffc?list=PLgkF0qak9G49QlteBtxUIPapT8TzfPuB8)

If you've ever had the misfortune of having to deploy a model as an API (as was required in the Regression Sprint), you'd know that to even get basic functionality can be a tricky ordeal. Extending this framework even further to act as a web server with dynamic visuals, multiple responsive pages, and robust deployment of your models... can be a nightmare. That's where Streamlit comes along to save the day! :star:

In its own words:
> Streamlit ... is the easiest way for data scientists and machine learning engineers to create beautiful, performant apps in only a few hours! All in pure Python. All for free.

> It’s a simple and powerful app model that lets you build rich UIs incredibly quickly.

Streamlit takes away much of the background work needed in order to get a platform which can deploy your models to clients and end users. Meaning that you get to focus on the important stuff (related to the data), and can largely ignore the rest. This will allow you to become a lot more productive.

##### Description of files

For this repository, we are only concerned with a single file:

| File Name | Description |
| :--------------------- | :-------------------- |
| `base_app.py` | Streamlit application definition. |

## 2) Usage Instructions

#### 2.1) Creating a copy of this repo

| :zap: WARNING :zap: |
| :-------------------- |
| Do **NOT** *clone* this repository. Instead follow the instructions in this section to *fork* the repo. |

As described within the Predict instructions for the Classification Sprint, this code represents a *template* from which to extend your own work. As such, in order to modify the template, you will need to **[fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo)** this repository. Failing to do this will lead to complications when trying to work on the web application remotely.

![Fork Repo](resources/imgs/fork-repo.png)

To fork the repo, simply ensure that you are logged into your GitHub account, and then click on the 'fork' button at the top of this page as indicated within the figure above.

#### 2.2) Running the Streamlit web app on your local machine

As a first step to becoming familiar with our web app's functioning, we recommend setting up a running instance on your own local machine.

To do this, follow the steps below by running the given commands within a Git bash (Windows), or terminal (Mac/Linux):

1. Ensure that you have the prerequisite Python libraries installed on your local machine:

```bash
pip install -U streamlit numpy pandas scikit-learn
```

2. Clone the *forked* repo to your local machine.

```bash
git clone https://github.com/{your-account-name}/classification-predict-streamlit-template.git
```

3. Navigate to the base of the cloned repo, and start the Streamlit app.

```bash
cd classification-predict-streamlit-template/
streamlit run base_app.py
```

If the web server was able to initialise successfully, the following message should be displayed within your bash/terminal session:
## Getting Started Guide
Follow these steps to get started with the project:
### Step 1: Install Python
Ensure that you have the latest version of Python installed, preferably Python 3.10.11.
If you haven't already installed it, you can do so by running the following command:

```python
pip install ipython
```
You can now view your Streamlit app in your browser.

Local URL: http://localhost:8501
Network URL: http://192.168.43.41:8501
```

You should also be automatically directed to the base page of your web app. This should look something like:

![Streamlit base page](resources/imgs/streamlit-base-splash-screen.png)
### Step 2: Download Necessary Corpora and Model
To aid with stopword removal and tokenization, you need to download the required corpora and model.
Open a Python environment and execute the following commands:

Congratulations! You've now officially deployed your first web application!

While we leave the modification of your web app up to you, the latter process of cloud deployment is outlined within the next section.

#### 2.4) Running Streamlit on a remote AWS EC2 instance


The following steps will enable you to run your web app on a remote EC2 instance, allowing it to the accessed by any device/application which has internet access.

Within these setup steps, we will be using a remote EC2 instance, which we will refer to as the ***Host***, in addition to our local machine, which we will call the ***Client***. We use these designations for convenience, and to align our terminology with that of common web server practices. In cases where commands are provided, use Git bash (Windows) or Terminal (Mac/Linux) to enter these.

1. Ensure that you have access to a running AWS EC2 instance with an assigned public IP address.

**[On the Host]:**

2. Install the prerequisite python libraries:

```bash
pip install -U streamlit numpy pandas scikit-learn
```

3. Clone your copy of the API repo, and navigate to its root directory:

```bash
git clone https://github.com/{your-account-name}/classification-predict-streamlit-template.git
cd classification-predict-streamlit-template/
```python
import nltk
nltk.download(['punkt', 'stopwords'])
```

| :information_source: NOTE :information_source: |
| :-------------------- |
| In the following steps we make use of the `tmux` command. This programme has many powerful functions, but for our purposes, we use it to gracefully keep our web app running in the background - even when we end our `ssh` session. |
### Step 3: Install Dependencies

4. Enter into a Tmux window within the current directory. To do this, simply type `tmux`.
Install the project dependencies including pandas, numpy, matplotlib, and scikit-learn using the following command:

5. Start the Streamlit web app on port `5000` of the host

```bash
streamlit run --server.port 5000 base_app.py
```python
pip install -U matplotlib numpy pandas scikit-learn
```

If this command ran successfully, output similar to the following should be observed on the Host:

```
You can now view your Streamlit app in your browser.

Network URL: http://172.31.47.109:5000
External URL: http://3.250.50.104:5000

```

Where the specific `Network` and `External` URLs correspond to those assigned to your own EC2 instance. Copy the value of the external URL.

**[On the Client]:**

6. Within your favourite web browser (we hope this isn't Internet Explorer 9), navigate to external URL you just copied from the Host. This should correspond to the following form:

`http://{public-ip-address-of-remote-machine}:5000`

Where the above public IP address corresponds to the one given to your AWS EC2 instance.

If successful, you should see the landing page of your streamlit web app:

![Streamlit base page](resources/imgs/streamlit-base-splash-screen.png)

**[On the Host]:**

7. To keep your web app running continuously in the background, detach from the Tmux window by pressing `ctrl + b` and then `d`. This should return you to the view of your terminal before you opened the Tmux window.

To go back to your Tmux window at any time (even if you've left your `ssh` session and then return), simply type `tmux attach-session`.

To see more functionality of the Tmux command, type `man tmux`.
## Usage
- Open your preferred Python environment or notebook.
- Import the necessary libraries.
- Load the data onto the notebook or import the "clean_train_csv" file directly to skip the cleaning process.
- Fit the data into the selected model. The model used for this project is the Support Vector Machine (SVM). You can experiment with different model types and tweak the parameters to suit your requirements.

Having run your web app within Tmux, you should be now free to end your ssh session while your webserver carries on purring along. Well done :zap:!
## Project Structure
The project repository consists of the following folders/files:

## 3) FAQ
- train.csv: Contains raw tweets and sentiments used for training the model.
- test_with_no_labels.csv: Contains raw tweets without labels, which can be used as a testing dataset.
- clean_train.csv: Contains the clean training data. You can load this file directly to skip the cleaning process.
- clean_test.csv: Contains the clean test data. You can load this file directly to skip the cleaning process.

This section of the repo will be periodically updated to represent common questions which may arise around its use. If you detect any problems/bugs, please [create an issue](https://help.github.com/en/github/managing-your-work-on-github/creating-an-issue) and we will do our best to resolve it as quickly as possible.
## Development
We also developed a web application using Streamlit for easy interaction with our model. You can navigate to our app repository by following this link: [https://github.com/TheZeitgeist-RR12/Streamlit-App.git]
Feel free to explore, experiment, and contribute to the project.

We wish you all the best in your learning experience :rocket:

![Explore Data Science Academy](resources/imgs/EDSA_logo.png)
Loading