Skip to content

Commit

Permalink
Finalized figs (#15)
Browse files Browse the repository at this point in the history
* update module 1.data_splits added data split summary info

* added suppl tables for data splits information

* updates in suppl. figures.

* reran module 1 and 2 and updated shuffling function

* rerain module 3 notebook

* Update confusion matrix

* re-ran all cells and update figures and minor bug fixes

* removed files and unwanted comments

* re-ran module 0

* reran module 0

* reran module 1

* reran module 2

* reran module 3

* reran module 4

* updated panel A and D

* updated main panel

* updated final plot

* update figures added supplemental 4 figure

* update figure panel

* Fixed casing errors and formatting

* updated output name for stable 2

* added data files to git repo

* Update notebooks/3.jump-analysis/3.jump_analysis.py

Co-authored-by: Mike Lippincott <58147848+MikeLippincott@users.noreply.github.com>

* Update notebooks/4.visualization/Figure_2_panel.r

Co-authored-by: Mike Lippincott <58147848+MikeLippincott@users.noreply.github.com>

* added mike's changes

---------

Co-authored-by: Mike Lippincott <58147848+MikeLippincott@users.noreply.github.com>
  • Loading branch information
axiomcura and MikeLippincott authored Jun 7, 2024
1 parent 63ebd6a commit e65ecb7
Show file tree
Hide file tree
Showing 59 changed files with 4,616 additions and 3,142 deletions.
1 change: 1 addition & 0 deletions cell_injury.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ dependencies:
- pre-commit
- ipykernel
- requests
- dataframe_image
- conda-forge::pycytominer==1.1.0
16 changes: 16 additions & 0 deletions data/injury_summary_before_holdout.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
injury_type,injury_code,n_wells,n_compounds,compound_list
Control,0,9855,1,['DMSO']
Cytoskeletal,1,1472,15,"['Nocodazole', 'Colchicine', 'Paclitaxel', 'Vinblastine', 'Ispinesib', 'ARQ 621', 'SB-743921', 'Epothilone B', 'Cytochalasin B', 'Monastrol', 'Cytochalasin D', 'Latrunculin B', 'Citrinin', 'Podophyllotoxin', 'Citreoviridin']"
Miscellaneous,5,1304,39,"['L-Buthionine-(S,R)-sulfoximine', 'CDDO Im', 'Cinobufagin', 'Puromycin', 'Brefeldin A', 'Tetrandrine', 'Pristimerin', 'Perifosine', 'Chloroquine', 'Niclosamide', 'Withaferin A', 'Bay 11-7821', 'Chelerythrine', 'UMI-77', 'MIM1', 'Chaetocin', 'Fisetin', 'Emetine', 'DZNep', 'BSI-201', 'Quercetin', 'Bruceine A', 'trans-Resveratrol', 'Aloisine RP106', 'Cycloheximide', 'Kenpaullone', 'Aurintricarboxylic acid', 'Gliotoxin', 'MitoPQ', 'Streptozotocin', 'cis-Resveratrol', 'Piceatannol', 'Tert-butylhydroquinone', 'Pyrogallol', 'RRx-001', 'Beauvericin', 'Ochratoxin A', 'Cantharidin', 'Cercosporin']"
Kinase,3,1104,13,"['Wortmannin', 'Staurosporine', 'PI-103', 'BEZ-235', 'AZD 1152-HQPA', 'Saracatinib', 'PKC 412', 'Lestaurtinib', 'Dasatinib', 'LY294002', 'Sorafenib', 'KW 2449', 'Sunitinib']"
Genotoxin,4,944,22,"['Camptothecin', 'CX-5461', 'Doxorubicin', 'Cladribine', 'Etoposide', 'Aphidicolin', 'Gemcitabine', 'Cisplatin', 'Oxaliplatin', 'Carboplatin', 'Dacarbazine', 'Lomustine', 'SN-38', 'Decitabine', 'Busulfan', 'Irinotecan', 'Chlorambucil', 'Thio-TEPA', 'Carmustine', 'Melphalan', 'Cyclophosphamide', 'β-Amanitin']"
Hsp90,2,552,3,"['Radicicol', 'Geldanamycin', '17-AAG']"
Redox,6,312,12,"['Menadione', 'PKF118-310', '4-Amino-1-naphthol (HCl)', 'Dunnione', 'MGR2', 'SIN-1 (chloride)', 'AAPH', 'MGR1', 'Phenazine (methosulfate)', 'DA-3003-2', 'IT-901', 'DMNQ']"
Saponin,10,288,11,"['Digitonin', 'Saikosaponin A', 'Polygalasaponin F', 'Bacopasaponin C', 'Pulsatilla Saponin D', 'Hederacoside C', 'Glycyrrhizic acid', 'Platycodin D', 'Onjisaponin B', 'Ginsenoside Ro', 'Protodioscin']"
HDAC,7,168,5,"['AR-42', 'SAHA', 'ITF 2357', 'Panobinostat', 'Apicidin']"
Mitochondria,11,144,4,"['Antimycin A', 'CCCP', 'Rotenone', 'Oligomycin A']"
Proteasome,9,144,4,"['Carfilzomib', 'Bortezomib', '(S)-MG132', '(R)-MG132']"
Nonspecific reactive,14,128,6,"[nan, 'Ebselen', 'IPA-3', '2-Tert-butyl-1,4-benzoquinone', '2-Chloro-1,4-naphthoquinone', 'Ebselen oxide']"
Ferroptosis,12,96,4,"['ML-162', 'ML-210', '(1S,3R)-RSL3', 'Erastin']"
Tannin,13,96,4,"['Gallotannin', 'Corilagin', 'Chebulagic acid', 'Punicalagin']"
mTOR,8,96,2,"['Torin 1', 'Rapamycin']"
16 changes: 16 additions & 0 deletions data/summary_data_split.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Cellular Injury,Number of Wells (Total Data),Number of Wells (Train Split),Number of Wells (Test Split),Number of Wells (Plate Holdout),Number of Wells (Treatment Holdout),Number of Wells (Well Holdout)
Control,9855,6726,1682,1072,0,375
Cytoskeletal,1472,881,221,181,12,177
Ferroptosis,96,66,16,6,0,8
Genotoxin,944,590,147,73,48,86
HDAC,168,110,28,30,0,0
Hsp90,552,334,84,54,0,80
Kinase,1104,600,150,120,12,222
Miscellaneous,1304,806,201,172,18,107
Mitochondria,144,92,23,12,0,17
Nonspecific reactive,128,84,21,19,0,4
Proteasome,144,94,23,24,0,3
Redox,312,172,43,54,24,19
Saponin,288,131,33,102,12,10
Tannin,96,60,15,18,0,3
mTOR,96,56,14,12,0,14
414 changes: 221 additions & 193 deletions notebooks/0.feature_selection/0.feature_selection.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@
suppl_meta_df = pd.read_csv(suppl_meta_path)
cell_injury_df = suppl_meta_df[["Cellular injury category", "Compound alias"]]

print("Cell injury screen shape:", image_profile_df.shape)


# ## Labeling Cell Injury data

Expand Down Expand Up @@ -209,7 +211,7 @@

# Save feature space information while maintaining feature space order

# In[ ]:
# In[8]:


# split meta and feature column names
Expand Down
1,715 changes: 1,089 additions & 626 deletions notebooks/1.data_splits/1.data_splits.ipynb

Large diffs are not rendered by default.

16 changes: 16 additions & 0 deletions notebooks/1.data_splits/injury_data_summary_before_holdout.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
injury_type,injury_code,n_wells,n_compounds,compound_list
Control,0,9855,1,['DMSO']
Cytoskeletal,1,1472,15,"['Nocodazole', 'Colchicine', 'Paclitaxel', 'Vinblastine', 'Ispinesib', 'ARQ 621', 'SB-743921', 'Epothilone B', 'Cytochalasin B', 'Monastrol', 'Cytochalasin D', 'Latrunculin B', 'Citrinin', 'Podophyllotoxin', 'Citreoviridin']"
Miscellaneous,5,1304,39,"['L-Buthionine-(S,R)-sulfoximine', 'CDDO Im', 'Cinobufagin', 'Puromycin', 'Brefeldin A', 'Tetrandrine', 'Pristimerin', 'Perifosine', 'Chloroquine', 'Niclosamide', 'Withaferin A', 'Bay 11-7821', 'Chelerythrine', 'UMI-77', 'MIM1', 'Chaetocin', 'Fisetin', 'Emetine', 'DZNep', 'BSI-201', 'Quercetin', 'Bruceine A', 'trans-Resveratrol', 'Aloisine RP106', 'Cycloheximide', 'Kenpaullone', 'Aurintricarboxylic acid', 'Gliotoxin', 'MitoPQ', 'Streptozotocin', 'cis-Resveratrol', 'Piceatannol', 'Tert-butylhydroquinone', 'Pyrogallol', 'RRx-001', 'Beauvericin', 'Ochratoxin A', 'Cantharidin', 'Cercosporin']"
Kinase,3,1104,13,"['Wortmannin', 'Staurosporine', 'PI-103', 'BEZ-235', 'AZD 1152-HQPA', 'Saracatinib', 'PKC 412', 'Lestaurtinib', 'Dasatinib', 'LY294002', 'Sorafenib', 'KW 2449', 'Sunitinib']"
Genotoxin,4,944,22,"['Camptothecin', 'CX-5461', 'Doxorubicin', 'Cladribine', 'Etoposide', 'Aphidicolin', 'Gemcitabine', 'Cisplatin', 'Oxaliplatin', 'Carboplatin', 'Dacarbazine', 'Lomustine', 'SN-38', 'Decitabine', 'Busulfan', 'Irinotecan', 'Chlorambucil', 'Thio-TEPA', 'Carmustine', 'Melphalan', 'Cyclophosphamide', 'β-Amanitin']"
Hsp90,2,552,3,"['Radicicol', 'Geldanamycin', '17-AAG']"
Redox,6,312,12,"['Menadione', 'PKF118-310', '4-Amino-1-naphthol (HCl)', 'Dunnione', 'MGR2', 'SIN-1 (chloride)', 'AAPH', 'MGR1', 'Phenazine (methosulfate)', 'DA-3003-2', 'IT-901', 'DMNQ']"
Saponin,10,288,11,"['Digitonin', 'Saikosaponin A', 'Polygalasaponin F', 'Bacopasaponin C', 'Pulsatilla Saponin D', 'Hederacoside C', 'Glycyrrhizic acid', 'Platycodin D', 'Onjisaponin B', 'Ginsenoside Ro', 'Protodioscin']"
HDAC,7,168,5,"['AR-42', 'SAHA', 'ITF 2357', 'Panobinostat', 'Apicidin']"
Mitochondria,11,144,4,"['Antimycin A', 'CCCP', 'Rotenone', 'Oligomycin A']"
Proteasome,9,144,4,"['Carfilzomib', 'Bortezomib', '(S)-MG132', '(R)-MG132']"
Nonspecific reactive,14,128,6,"[nan, 'Ebselen', 'IPA-3', '2-Tert-butyl-1,4-benzoquinone', '2-Chloro-1,4-naphthoquinone', 'Ebselen oxide']"
Ferroptosis,12,96,4,"['ML-162', 'ML-210', '(1S,3R)-RSL3', 'Erastin']"
Tannin,13,96,4,"['Gallotannin', 'Corilagin', 'Chebulagic acid', 'Punicalagin']"
mTOR,8,96,2,"['Torin 1', 'Rapamycin']"
190 changes: 148 additions & 42 deletions notebooks/1.data_splits/nbconverted/1.data_splits.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@
# coding: utf-8

# # Spliting Data
#
# Here, we utilize the feature-selected profiles generated in the preceding module notebook [here](../0.feature_selection/0.feature_selection.ipynb), focusing on dividing the data into training, testing, and holdout sets for machine learning training.
#

# In[1]:

Expand All @@ -17,7 +19,7 @@
from sklearn.model_selection import train_test_split

sys.path.append("../../") # noqa
from src.utils import split_meta_and_features # noqa
from src.utils import get_injury_treatment_info, split_meta_and_features # noqa

# ignoring warnings
warnings.catch_warnings(action="ignore")
Expand All @@ -26,6 +28,7 @@
# ## Paramters
#
# Below are the parameters defined that are used in this notebook
#

# In[2]:

Expand Down Expand Up @@ -63,9 +66,11 @@

# ## Exploring the data set
#
# Below is a exploration of the selected features dataset. The aim is to identify treatments, extract metadata, and gain a understanding of the experiment's design.
# Below is a exploration of the selected features dataset. The aim is to identify treatments, extract metadata, and gain a understanding of the experiment's design.
#

# Below demonstrates the amount of wells does each treatment have.
#

# In[4]:

Expand All @@ -79,6 +84,7 @@


# Below we show the amount of wells does a specific cell celluar injury has
#

# In[5]:

Expand All @@ -93,33 +99,25 @@
# Next we wanted to extract some metadata regarding how many compound and wells are treated with a given compounds
#
# This will be saved in the `results/0.data_splits` directory
#

# In[6]:


meta_injury = []
for injury_type, df in fs_profile_df.groupby("injury_type"):
# extract n_wells, n_compounds and unique compounds per injury_type
n_wells = df.shape[0]
unique_compounds = list(df["Compound Name"].unique())
n_compounds = len(unique_compounds)

# store information
meta_injury.append([injury_type, n_wells, n_compounds, unique_compounds])

injury_meta_df = pd.DataFrame(
meta_injury, columns=["injury_type", "n_wells", "n_compounds", "compound_list"]
).sort_values("n_wells", ascending=False)
injury_meta_df.to_csv(data_split_dir / "injury_well_counts_table.csv", index=False)
# get summary information and save it
injury_before_holdout_info_df = get_injury_treatment_info(
profile=fs_profile_df, groupby_key="injury_type"
).reset_index(drop=True)

# display
print("shape:", injury_meta_df.shape)
injury_meta_df
print("Shape:", injury_before_holdout_info_df.shape)
injury_before_holdout_info_df


# Next, we construct the profile metadata. This provides a structured overview of how the treatments assicoated with injuries were applied, detailing the treatments administered to each plate.
#
# This will be saved in the `results/0.data_splits` directory
#

# In[7]:

Expand Down Expand Up @@ -154,6 +152,7 @@
# Here we build a plate metadata infromations where we look at the type of treatments and amount of wells with the treatment that are present in the dataset
#
# This will be saved in `results/0.data_splits`
#

# In[8]:

Expand All @@ -177,23 +176,27 @@


# ## Data Splitting
#
# ---
#

# ### Holdout Dataset
#
# Here we collected out holdout dataset. The holdout dataset is a subset of the dataset that is not used during model training or tuning. Instead, it is reserved solely for evaluating the model's performance after it has been trained.
#
# In this notebook, we will include three different types of held-out datasets before proceeding with our machine learning training and evaluation.
# - Plate hold out
# - treatment hold out
# - well hold out
#
# - Plate hold out
# - treatment hold out
# - well hold out
#
# Each of these held outdata will be stored in the `results/1.data_splits` directory
#

# ### Plate Holdout
#
# Plates are randomly selected based on their Plate ID and save them as our `plate_holdout` data.
#

# In[9]:

Expand Down Expand Up @@ -241,6 +244,7 @@
# To determine which cell injuries should be considered for a single treatment holdout, we establish a threshold of 10 unique compounds. This means that a cell injury type must have at least 10 unique compounds to qualify for selection in the treatment holdout. Any cell injury types failing to meet this criterion will be disregarded.
#
# Once the cell injuries are identified for treatment holdout, we select our holdout treatment by grouping each injury type and choosing the treatment with the fewest wells. This becomes our treatment holdout dataset.
#

# In[10]:

Expand Down Expand Up @@ -320,6 +324,7 @@
# ### Well holdout
#
# To generate the well hold out data, each plate was iterated and random wells were selected. However, an additional step was condcuting which was to seperate the control wells and the treated wells, due to the large label imbalance with the controls. Therefore, 5 wells were randomly selected and 10 wells were randomly selected from each individual plate
#

# In[13]:

Expand Down Expand Up @@ -374,34 +379,22 @@


# ## Saving training dataset
#

# Once the data holdout has been generated, the next step is to save the training dataset that will serve as the basis for training the multi-class logistic regression model.
#

# In[14]:


# Showing the amount of data we have after removing the holdout data
meta_injury = []
for injury_type, df in fs_profile_df.groupby("injury_type"):
# extract n_wells, n_compounds and unique compounds per injury_type
n_wells = df.shape[0]
injury_code = df["injury_code"].unique()[0]
unique_compounds = list(df["Compound Name"].unique())
n_compounds = len(unique_compounds)

# store information
meta_injury.append(
[injury_type, injury_code, n_wells, n_compounds, unique_compounds]
)

# creating data frame
injury_meta_df = pd.DataFrame(
meta_injury,
columns=["injury_type", "injury_code", "n_wells", "n_compounds", "compound_list"],
).sort_values("n_wells", ascending=False)
# get summary cell injury dataset treatment and well info after holdouts
injury_after_holdout_info_df = get_injury_treatment_info(
profile=fs_profile_df, groupby_key="injury_type"
)

# display
injury_meta_df
print("shape:", injury_after_holdout_info_df.shape)
injury_after_holdout_info_df


# In[15]:
Expand Down Expand Up @@ -448,7 +441,120 @@
compression="gzip",
index=False,
)

# display
print("Metadata shape", cell_injury_metadata.shape)
cell_injury_metadata.head()


# ## Generating data split summary file

# In[18]:


def get_and_rename_injury_info(
profile: pd.DataFrame, groupby_key: str, column_name: str
) -> pd.DataFrame:
"""Gets injury treatment information and renames the specified column.
Parameters
----------
profile : DataFrame
The profile DataFrame containing data to be processed.
groupby_key : str
The key to group by in the injury treatment information.
column_name : str
The new name for the 'n_wells' column.
Returns
-------
DataFrame
A DataFrame with the injury treatment information and the 'n_wells' column renamed.
"""
return get_injury_treatment_info(profile=profile, groupby_key=groupby_key).rename(
columns={"n_wells": column_name}
)


# name of the columns
data_col_name = [
"Number of Wells (Total Data)",
"Number of Wells (Train Split)",
"Number of Wells (Test Split)",
"Number of Wells (Plate Holdout)",
"Number of Wells (Treatment Holdout)",
"Number of Wells (Well Holdout)",
]


# Total amount summary
injury_before_holdout_info_df = injury_before_holdout_info_df.rename(
columns={"n_wells": data_col_name[0]}
)

# Data splits train test summary
injury_train_info_df = get_and_rename_injury_info(
profile=X_train.merge(
fs_profile_df[meta_cols], how="left", right_index=True, left_index=True
)[meta_cols + feat_cols],
groupby_key="injury_type",
column_name=data_col_name[1],
)

injury_test_info_df = get_and_rename_injury_info(
profile=X_test.merge(
fs_profile_df[meta_cols], how="left", right_index=True, left_index=True
)[meta_cols + feat_cols],
groupby_key="injury_type",
column_name=data_col_name[2],
)

# Holdouts summary
injury_plate_holdout_info_df = get_and_rename_injury_info(
profile=plate_holdout_df, groupby_key="injury_type", column_name=data_col_name[3]
)

injury_treatment_holdout_info_df = get_and_rename_injury_info(
profile=treatment_holdout_df,
groupby_key="injury_type",
column_name=data_col_name[4],
)

injury_well_holdout_info_df = get_and_rename_injury_info(
profile=wells_heldout_df, groupby_key="injury_type", column_name=data_col_name[5]
)

# Select interested columns
total_data_summary = injury_before_holdout_info_df[["injury_type", data_col_name[0]]]
train_split_summary = injury_train_info_df[["injury_type", data_col_name[1]]]
test_split_summary = injury_test_info_df[["injury_type", data_col_name[2]]]
plate_holdout_info_df = injury_plate_holdout_info_df[["injury_type", data_col_name[3]]]
treatment_holdout_summary = injury_treatment_holdout_info_df[
["injury_type", data_col_name[4]]
]
well_holdout_summary = injury_well_holdout_info_df[["injury_type", data_col_name[5]]]


# In[19]:


# merge the summary data splits into one, update data type to integers
merged_summary_df = (
total_data_summary.merge(train_split_summary, on="injury_type", how="outer")
.merge(test_split_summary, on="injury_type", how="outer")
.merge(plate_holdout_info_df, on="injury_type", how="outer")
.merge(treatment_holdout_summary, on="injury_type", how="outer")
.merge(well_holdout_summary, on="injury_type", how="outer")
.fillna(0)
.set_index("injury_type")
)[data_col_name].astype(int)

# update index and rename it 'injury_type' to "Cellular Injury"
merged_summary_df = merged_summary_df.reset_index().rename(
columns={"injury_type": "Cellular Injury"}
)

# save as csv file
merged_summary_df.to_csv(data_split_dir / "summary_data_split.csv", index=False)

# display
merged_summary_df
5 changes: 4 additions & 1 deletion notebooks/2.modeling/2.modeling.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,10 @@
"outputs": [],
"source": [
"# shuffle feature space\n",
"shuffled_X_train = shuffle_features(X_train.values, seed=seed)"
"shuffled_X_train = shuffle_features(X_train, features=shared_features, seed=seed)\n",
"\n",
"# checking if the shuffled and original feature space are the same\n",
"assert not X_train.equals(shuffled_X_train), \"DataFrames are the same!\""
]
},
{
Expand Down
Loading

0 comments on commit e65ecb7

Please sign in to comment.