Training Multi-Class Logistic Regression model with only shared featu…

…res. (#14) * added feature space information * updated data splits * update with overlapping feature * updated pipeline * update download notebook * Updated Feature selection * fixed bug in feature selection * update data split and reran * re ran and update modeling module * reran jump analysis and visualizations * Update data/JUMP_data/nbconverted/download.py Co-authored-by: Mike Lippincott <58147848+MikeLippincott@users.noreply.github.com> * update documentation * reran all --------- Co-authored-by: Mike Lippincott <58147848+MikeLippincott@users.noreply.github.com>
WayScience · Apr 30, 2024 · 8b24110 · 8b24110
1 parent e38d6f5
commit 8b24110
Show file tree

Hide file tree

Showing 55 changed files with 9,890 additions and 9,058 deletions.
diff --git a/.gitignore b/.gitignore
@@ -161,3 +161,4 @@ cython_debug/
 
 # large dataset
 data/JUMP_data/JUMP_all_plates_normalized_negcon.csv.gz
+notebooks/3.jump-analysis/overlapp-issue-check.ipynb
diff --git a/data/JUMP_data/JUMP_aligned_all_plates_normalized_negcon.csv.gz b/data/JUMP_data/JUMP_aligned_all_plates_normalized_negcon.csv.gz
diff --git a/data/JUMP_data/download.ipynb b/data/JUMP_data/download.ipynb
@@ -1,15 +1,33 @@
 {
     "cells": [
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "# Downloading JUMP Pilot Dataset\n",
+                "\n",
+                "This notebook focuses on downloading the JUMP-CellPainting dataset. The pilot dataset comprises aggregate profiles at the well level, spanning 51 plates. These profiles have been normalized using the negative controls within each plate. We downloaded all 51 negative-controlled normalized aggregate profiles and concatenating them into a single dataset file. The JUMP dataset profile will be saved in the `./data/JUMP_data` directory."
+            ]
+        },
         {
             "cell_type": "code",
             "execution_count": 1,
             "metadata": {},
             "outputs": [],
             "source": [
-                "import os\n",
-                "import pathlib\n",
-                "import requests\n",
-                "import pandas as pd"
+                "import sys\n",
+                "import json\n",
+                "import pandas as pd\n",
+                "\n",
+                "sys.path.append(\"../../\")\n",
+                "from src.utils import split_meta_and_features"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "Reading the plate map to get all the Plate ID's "
             ]
         },
         {
@@ -87,51 +105,85 @@
                 }
             ],
             "source": [
-                "# read\n",
+                "# loading plate map\n",
                 "platemap_df = pd.read_csv(\"./barcode_platemap.csv\")\n",
                 "platemap_df.head()"
             ]
         },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "Next, we use the plate IDs to the URL in order to download the aggregated profiles. We use pandas to download and load each profile, and then concatenate them into a single dataframe. The merged dataframe serves as our main JUMP dataset."
+            ]
+        },
         {
             "cell_type": "code",
             "execution_count": 3,
             "metadata": {},
             "outputs": [],
             "source": [
-                "# download normalized data\n",
+                "# download all normalized aggregated profiles\n",
+                "jump_df = []\n",
                 "for plate_id in platemap_df[\"Assay_Plate_Barcode\"]:\n",
                 "    url = f\"https://cellpainting-gallery.s3.amazonaws.com/cpg0000-jump-pilot/source_4/workspace/profiles/2020_11_04_CPJUMP1/{plate_id}/{plate_id}_normalized_negcon.csv.gz\"\n",
+                "    df = pd.read_csv(url)\n",
+                "    jump_df.append(df)\n",
                 "\n",
-                "    # request data\n",
-                "    with requests.get(url) as response:\n",
-                "        response.raise_for_status()\n",
-                "        save_path = pathlib.Path(f\"./{plate_id}_normalized_negcon.csv.gz\").resolve()\n",
+                "# concat all downloaded concatenate all aggregate profiles\n",
+                "jump_df = pd.concat(jump_df)\n",
                 "\n",
-                "        # save content\n",
-                "        with open(save_path, mode=\"wb\") as f:\n",
-                "            for chunk in response.iter_content(chunk_size=8192):\n",
-                "                f.write(chunk)"
+                "# save concatenated df into ./data/JUMP_data folders\n",
+                "jump_df.to_csv(\n",
+                "    \"JUMP_all_plates_normalized_negcon.csv.gz\", index=False, compression=\"gzip\"\n",
+                ")"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "metadata": {},
+            "source": [
+                "Here, we obtain information about the feature space by splitting both the meta and feature column names and storing them in a dictionary. This dictionary holds information about the feature space and will be utilized for downstream analysis when identifying shared features across different datasets, such as the Cell-injury dataset."
             ]
         },
         {
             "cell_type": "code",
-            "execution_count": null,
+            "execution_count": 4,
             "metadata": {},
-            "outputs": [],
+            "outputs": [
+                {
+                    "name": "stdout",
+                    "output_type": "stream",
+                    "text": [
+                        "NUmber of plates 51\n",
+                        "Number of meta features 13\n",
+                        "Number of features 5792\n"
+                    ]
+                }
+            ],
             "source": [
-                "# after downloading all dataset, concat into a single dataframe\n",
-                "data_files = list(pathlib.Path.cwd().glob(\"*.csv.gz\"))\n",
+                "# saving feature space\n",
+                "jump_meta, jump_feat = split_meta_and_features(jump_df, metadata_tag=True)\n",
                 "\n",
-                "# create main df by concatenating all file\n",
-                "main_df = pd.concat([pd.read_csv(file) for file in data_files])\n",
+                "# saving info of feature space\n",
+                "jump_feature_space = {\n",
+                "    \"name\": \"JUMP\",\n",
+                "    \"n_plates\": len(jump_df[\"Metadata_Plate\"].unique()),\n",
+                "    \"n_meta_features\": len(jump_meta),\n",
+                "    \"n_features\": len(jump_feat),\n",
+                "    \"meta_features\": jump_meta,\n",
+                "    \"features\": jump_feat,\n",
+                "}\n",
                 "\n",
-                "# remove single_dfs\n",
-                "[os.remove(file) for file in data_files]\n",
+                "# save json file\n",
+                "with open(\"jump_feature_space.json\", mode=\"w\") as f:\n",
+                "    json.dump(jump_feature_space, f)\n",
                 "\n",
-                "# save concatenated df into ./data/JUMP_data folders\n",
-                "main_df.to_csv(\n",
-                "    \"JUMP_all_plates_normalized_negcon.csv.gz\", index=False, compression=\"gzip\"\n",
-                ")"
+                "# display\n",
+                "print(\"Shape of Merged dataset\", jump_df.shape)\n",
+                "print(\"NUmber of plates\", len(jump_df[\"Metadata_Plate\"].unique()))\n",
+                "print(\"Number of meta features\", len(jump_meta))\n",
+                "print(\"Number of features\", len(jump_feat))"
             ]
         }
     ],
Original file line number	Diff line number	Diff line change
Expand Up		@@ -161,3 +161,4 @@ cython_debug/

		# large dataset
		data/JUMP_data/JUMP_all_plates_normalized_negcon.csv.gz
		notebooks/3.jump-analysis/overlapp-issue-check.ipynb