Skip to content

Commit

Permalink
Training Multi-Class Logistic Regression model with only shared featu…
Browse files Browse the repository at this point in the history
…res. (#14)

* added feature space information

* updated data splits

* update with overlapping feature

* updated pipeline

* update download notebook

* Updated Feature selection

* fixed bug in feature selection

* update data split and reran

* re ran and update modeling module

* reran jump analysis and visualizations

* Update data/JUMP_data/nbconverted/download.py

Co-authored-by: Mike Lippincott <58147848+MikeLippincott@users.noreply.github.com>

* update documentation

* reran all

---------

Co-authored-by: Mike Lippincott <58147848+MikeLippincott@users.noreply.github.com>
  • Loading branch information
axiomcura and MikeLippincott authored Apr 30, 2024
1 parent e38d6f5 commit 8b24110
Show file tree
Hide file tree
Showing 55 changed files with 9,890 additions and 9,058 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -161,3 +161,4 @@ cython_debug/

# large dataset
data/JUMP_data/JUMP_all_plates_normalized_negcon.csv.gz
notebooks/3.jump-analysis/overlapp-issue-check.ipynb
Binary file not shown.
104 changes: 78 additions & 26 deletions data/JUMP_data/download.ipynb
Original file line number Diff line number Diff line change
@@ -1,15 +1,33 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Downloading JUMP Pilot Dataset\n",
"\n",
"This notebook focuses on downloading the JUMP-CellPainting dataset. The pilot dataset comprises aggregate profiles at the well level, spanning 51 plates. These profiles have been normalized using the negative controls within each plate. We downloaded all 51 negative-controlled normalized aggregate profiles and concatenating them into a single dataset file. The JUMP dataset profile will be saved in the `./data/JUMP_data` directory."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pathlib\n",
"import requests\n",
"import pandas as pd"
"import sys\n",
"import json\n",
"import pandas as pd\n",
"\n",
"sys.path.append(\"../../\")\n",
"from src.utils import split_meta_and_features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading the plate map to get all the Plate ID's "
]
},
{
Expand Down Expand Up @@ -87,51 +105,85 @@
}
],
"source": [
"# read\n",
"# loading plate map\n",
"platemap_df = pd.read_csv(\"./barcode_platemap.csv\")\n",
"platemap_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we use the plate IDs to the URL in order to download the aggregated profiles. We use pandas to download and load each profile, and then concatenate them into a single dataframe. The merged dataframe serves as our main JUMP dataset."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# download normalized data\n",
"# download all normalized aggregated profiles\n",
"jump_df = []\n",
"for plate_id in platemap_df[\"Assay_Plate_Barcode\"]:\n",
" url = f\"https://cellpainting-gallery.s3.amazonaws.com/cpg0000-jump-pilot/source_4/workspace/profiles/2020_11_04_CPJUMP1/{plate_id}/{plate_id}_normalized_negcon.csv.gz\"\n",
" df = pd.read_csv(url)\n",
" jump_df.append(df)\n",
"\n",
" # request data\n",
" with requests.get(url) as response:\n",
" response.raise_for_status()\n",
" save_path = pathlib.Path(f\"./{plate_id}_normalized_negcon.csv.gz\").resolve()\n",
"# concat all downloaded concatenate all aggregate profiles\n",
"jump_df = pd.concat(jump_df)\n",
"\n",
" # save content\n",
" with open(save_path, mode=\"wb\") as f:\n",
" for chunk in response.iter_content(chunk_size=8192):\n",
" f.write(chunk)"
"# save concatenated df into ./data/JUMP_data folders\n",
"jump_df.to_csv(\n",
" \"JUMP_all_plates_normalized_negcon.csv.gz\", index=False, compression=\"gzip\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we obtain information about the feature space by splitting both the meta and feature column names and storing them in a dictionary. This dictionary holds information about the feature space and will be utilized for downstream analysis when identifying shared features across different datasets, such as the Cell-injury dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NUmber of plates 51\n",
"Number of meta features 13\n",
"Number of features 5792\n"
]
}
],
"source": [
"# after downloading all dataset, concat into a single dataframe\n",
"data_files = list(pathlib.Path.cwd().glob(\"*.csv.gz\"))\n",
"# saving feature space\n",
"jump_meta, jump_feat = split_meta_and_features(jump_df, metadata_tag=True)\n",
"\n",
"# create main df by concatenating all file\n",
"main_df = pd.concat([pd.read_csv(file) for file in data_files])\n",
"# saving info of feature space\n",
"jump_feature_space = {\n",
" \"name\": \"JUMP\",\n",
" \"n_plates\": len(jump_df[\"Metadata_Plate\"].unique()),\n",
" \"n_meta_features\": len(jump_meta),\n",
" \"n_features\": len(jump_feat),\n",
" \"meta_features\": jump_meta,\n",
" \"features\": jump_feat,\n",
"}\n",
"\n",
"# remove single_dfs\n",
"[os.remove(file) for file in data_files]\n",
"# save json file\n",
"with open(\"jump_feature_space.json\", mode=\"w\") as f:\n",
" json.dump(jump_feature_space, f)\n",
"\n",
"# save concatenated df into ./data/JUMP_data folders\n",
"main_df.to_csv(\n",
" \"JUMP_all_plates_normalized_negcon.csv.gz\", index=False, compression=\"gzip\"\n",
")"
"# display\n",
"print(\"Shape of Merged dataset\", jump_df.shape)\n",
"print(\"NUmber of plates\", len(jump_df[\"Metadata_Plate\"].unique()))\n",
"print(\"Number of meta features\", len(jump_meta))\n",
"print(\"Number of features\", len(jump_feat))"
]
}
],
Expand Down
Loading

0 comments on commit 8b24110

Please sign in to comment.