Files

.circleci
CIQA
On_Combining_Bags_to_Better_Learn_from_Label_Proportions
STraTA
aav
abps
abstract_nas
action_angle_networks
action_gap_rl
activation_clustering
adaptive_learning_rate_tuner
adaptive_prediction
adaptive_surrogates
after_kernel
al_for_fep
albert
algae_dice
aloe
alx
amortized_bo
anthea
aptamers_mlpd
aqt
aquadem
arxiv_latex_cleaner
assemblenet
assessment_plan_modeling
attribution
automatic_structured_vi
automl_zero
autoregressive_diffusion
aux_tasks
axial
bam
bangbang_qaoa
basisnet
batch_science
behavior_regularized_offline_rl
bertseq2seq
better_storylines
bigg
bisimulation_aaai2020
bitempered_loss
blur
bnn_hmc
bonus_based_exploration
building_detection
bustle
c_learning
cache_replacement
caltrain
capsule_em
caql
cascaded_networks
cate
cell_embedder
cell_mixer
cfq
cfq_pt_vs_sa
charformer
ciw_label_noise
class_balanced_distillation
clay
cluster_gcn
clustering_normalized_cuts
cnn_quantization
cochlear_implant
code_as_policies
codistillation
cognate_inpaint_neighbors
coherent_gradients
cola
cold_posterior_bnn
cold_posterior_flax
collocated_irradiance_network
coltran
combiner
comisr
compgen_d2t
compositional_rl
compositional_transformers
concept_explanations
conqur
constrained_language_typology
contrack
contrastive_rl
correct_batch_effects_wdn
- README.md
- distance.py
- distance_test.py
- evaluate.py
- evaluate_metrics.py
- evaluate_test.py
- forgetting_nuisance.py
- forgetting_nuisance_test.py
- io_utils.py
- io_utils_test.py
- ljosa_embeddings_to_h5.py
- ljosa_preprocessing.py
- metadata.py
- requirements.txt
- transform.py
- transform_test.py
correlated_compression
correlation_clustering
covid_epidemiology
cube_unfoldings
cubert
d3pm
dac
darc
data_free_distillation
data_selection
dataset_or_not
dble
dedal
deep_homography
deep_representation_one_class
demogen
dense_representations_for_entity_retrieval
depth_and_motion_learning
depth_from_video_in_the_wild
design_bipartite_experiments
dialogue_ope
dictionary_learning
didi_dataset
differentiable_data_selection
differentially_private_gnns
diffusion_distillation
direction_net
disarm
distracting_control
dnn_predict_accuracy
do_wide_and_deep_networks_learn_the_same_things
docent
domain_conditional_predictors
dot_vs_learned_similarity
dp_multiq
dp_regression
dp_topk
dql_grasping
dreamfields
dreg_estimators
drfact
dselect_k_moe
dual_dice
dual_pixels
dvrl
ebp
editable_graph_temporal
eeg_modelling
eim
eli5_retrieval_large_lm
enas_lm
es_enas
es_maml
es_optimization
etcmodel
etcsum
evanet
evolution
experience_replay
explaining_risk_increase
extreme_memorization
f_divergence_estimation_ram_mc
f_net
factorize_a_city
factors_of_influence
fair_submodular_maximization_2020
fair_survival_analysis
fairness_and_bias_in_online_selection
fairness_teaching
fast_k_means_2020
fastconvnets
fat
federated_vision_datasets
felix
fisher_brc
flare_removal
flax_models
floatseg
flood_forecasting
frechet_audio_distance
frechet_video_distance
frequency_analysis
frmt
frost
fully_dynamic_submodular_maximization
func_dist
gaternet
ged_tts
gen_patch_neural_rendering
generalization_representations_rl_aistats22
generalized_rates
generative_trees
genomics_ood
gfsa
ghum
gift
gigamol
goemotions
gon
gradient_based_tuning
graph_compression
graph_embedding
graph_sampler
graph_temporal_ai
grbm
group_agnostic_fairness
grouptesting
grow_bert
gumbel_max_causal_gadgets
gwikimatch
hal
hierarchical_foresight
hipi
hist_thresh
hitnet
hmc_swindles
homophonous_logography
hspace
human_object_interaction
hybrid_zero_dynamics
hyperbolic
hyperbolic_discount
hypertransformer
ials
icetea
ieg
igt_optimizer
ime
imghum
implicit_constrained_optimization
implicit_pdf
incremental_gain
inerf
infinite_nature
infinite_nature_zero
infinite_uncertainty
intent_recognition
interpretability_benchmark
invariant_explanations
investigating_m4
ipagnn
isl
isolating_factors
jax_dft
jax_mpc
jax_particles
jaxnerf
jaxraytrace
jaxsel
jaxstronomy
jrl
jslm
keypose
kip
kobe
kws_streaming
l2da
l2tl
label_bias
lamp
large_margin
large_scale_voting
latent_programmer
layout-blt
learn_to_infer
learning_parameter_allocation
learnreg
ledge
lego
light_field_neural_rendering
lighthouse
linear_dynamical_systems
linear_eval
linear_identifiability
linear_vae
lista_design_space
lm_memorization
local_forward_gradient
locoprop
logic_inference_dataset
logit_adjustment
low_rank_local_connectivity
m_layer
m_theory
many_constraints
mbpp
meena
memento
memory_efficient_attention
menger_rl
mentormix
meta_augmentation
meta_learning_without_memorization
meta_pseudo_labels
meta_reward_learning
metapose
mico
micronet_challenge
microscope_image_quality
milking_cowmask
minigrid_basics
missing_link
ml_debiaser
mobilebert
model_pruning
moew
mol_dqn
moment_advice
motion_blur
mpi_extrapolation
mqm_viewer
muNet
mucped22
multi_game_dt
multi_resolution_rec
multimodalchat
multiple_user_representations
munchausen_rl
musiq
mutual_information_representation_learning
muzero
ncsnv3
negative_cache
nested_rhat
neural_additive_models
neural_guided_symbolic_regression
neutra
ngrammer
nigt_optimizer
nngp_nas
non_decomp
non_semantic_speech_benchmark
nopad_inception_v3_fcn
norml
npy_array
numbert
online_belief_propagation
online_correlation_clustering
opt_list
optimizing_interpretability
osf
pair_ngram
pairwise_fairness
pali
parallel_clustering
pde_preconditioner
performer
persistent_es
perso_arabic_norm
perturbations
pgdl
playrooms
poem
policy_eval
polish
poly_kernel_sketch
pretrained_conv
prime
primer
privacy_poison
private_covariance_estimation
private_sampling
private_text_transformers
procedure_cloning
property_linking
protein_lm
protoattend
protseq
proxy_rewards
pruning_identified_exemplars
pse
psycholab
ptopk_patch_selection
pwil
qanet
quantum_sample_learning
r4r
rank_ckpt
rankgen
ravens
rcc_algorithms
rce
readtwice
realformer
recs_ecosystem_creator_rl
recursive_optimizer
regnerf
rembert
remote_sensing_representations
repnet
representation_batch_rl
representation_similarity
reset_free_learning
resolve_ref_exp_elements_ml
restarting_FOM_for_LP
rl4circopt
rl_metrics_aaai2021
rl_repr
rllim
robust_count_sketch
robust_loss
robust_loss_jax
robust_optim
robust_retrieval
rouge
routing_transformer
rpc
rrlfd
rs_gnn
saccader
saf
sail_rl
saycan
scalable_shampoo
scaling_transformers
scann
schema_guided_dst
schptm_benchmark
screen2words
scrna_benchmark
seq2act
sgk
sign_language_detection
simpdom
simple_probabilistic_programming
simulation_research
single_view_mpi
sketching
sliding_window_clustering
slot_attention
sm3
smart_eval
smith
smu
smug_saliency
smurf
snerg
snlds
sobolev
social_rl
socraticmodels
soft_sort
soft_topk
soil_moisture_retrieval
solver1d
sorb
spaceopt
sparse_data
sparse_mixers
special_orthogonalization
specinvert
spectral_bias
speech_embedding
spin_spherical_cnns
spreadsheet_coder
squiggles
stable_transfer
stacked_capsule_autoencoders
standalone_self_attention_in_vision_models
star_cfq
state_of_sparsity
stochastic_polyak
stochastic_to_deterministic
storm_optimizer
strategic_exploration
streetview_contrails_dataset
structformer
structured_multihashing
student_mentor_dataset_cleaning
subclass_distillation
sufficient_input_subsets
summae
supcon
supervised_pixel_contrastive_loss
symbolic_functionals
t5_closed_book_qa
tabnet
tag
talk_about_random_splits
taperception
task_set
task_specific_learned_opt
tcc
tf3d
tf_trees
tft
time_varying_optimization
tiny_video_nets
topological_transformer
towards_gan_benchmarks
transformer_modifications
trimap
truss_decomposition
tunas
uflow
ul2
uncertainties
understanding_convolutions_on_graphs
universal_embedding_challenge
unprocessing
uq_benchmark_2019
using_dl_to_annotate_protein_universe
vae_ood
value_dice
value_function_polytope
vatt
vbmi
vct
vdvae_flax
video_structure
visual_relationship
weak_disentangle
widget-caption
widget_caption
wiki_split_bleu_eval
wt5
xirl
yeast_transcription_network
yoto
zebraix
.gitignore
CONTRIBUTING.md
LICENSE
README.md
__init__.py
compile_protos.sh

correct_batch_effects_wdn

yilei

and

copybara-github

Make this code compatible with Python 3.10.

Sep 13, 2022

f5abbce · Sep 13, 2022

History

This branch is 1274 commits behind google-research/google-research:master.

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Release code for submission "Correcting for Batch Effects Using Wasse…	Mar 14, 2019
distance.py	distance.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
distance_test.py	distance_test.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
evaluate.py	evaluate.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
evaluate_metrics.py	evaluate_metrics.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
evaluate_test.py	evaluate_test.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
forgetting_nuisance.py	forgetting_nuisance.py	Make this code compatible with Python 3.10.	Sep 13, 2022
forgetting_nuisance_test.py	forgetting_nuisance_test.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
io_utils.py	io_utils.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
io_utils_test.py	io_utils_test.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
ljosa_embeddings_to_h5.py	ljosa_embeddings_to_h5.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
ljosa_preprocessing.py	ljosa_preprocessing.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
metadata.py	metadata.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
requirements.txt	requirements.txt	Release code for submission "Correcting for Batch Effects Using Wasse…	Mar 14, 2019
transform.py	transform.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022
transform_test.py	transform_test.py	Remove unused comments related to Python 2 compatibility.	Apr 8, 2022

README.md

Correcting for Batch Effects Using Wasserstein Distance

This directory contains reference code for the paper Correcting for Batch Effects Using Wasserstein Distance.

The code is implemented in Tensorflow and the required packages are listed in requirements.txt.

Datasets

The datasets are two different types of embeddings derived from the raw image dataset: https://data.broadinstitute.org/bbbc/BBBC021/. They are CellProfiler embeddings and deep neural network embeddings.

CellProfiler Embeddings

The original CellProfiler embeddings were downloaded from http://pubs.broadinstitute.org/ljosa_jbiomolscreen_2013/ as csv files.

To convert it into a dataframe and save it as an h5 file:

python -m correct_batch_effects_wdn.ljosa_embeddings_to_h5 \
--ljosa_data_directory=${LJOSA_DATA_DIRECTORY}

The h5 file would be saved at

${LJOSA_DATA_DIRECTORY}/ljosa_embeddings_462.h5

We follow the paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3884769/ to preprocess the CellProfiler embeddings:

python -m correct_batch_effects_wdn.ljosa_preprocessing \
--original_df=${LJOSA_DATA_DIRECTORY}/ljosa_embeddings_462.h5 \
--post_normalization_path=${LJOSA_DATA_DIRECTORY}/ljosa_embeddings_post_normalized.h5 \
--post_fa_path=${LJOSA_DATA_DIRECTORY}/ljosa_embeddings_post_fa.h5

This would generate two h5 files. The first file is at

${LJOSA_DATA_DIRECTORY}/ljosa_embeddings_post_normalized.h5,

where each dimension of the embeddings has been normalized by percentile matching.

The second file is at

${LJOSA_DATA_DIRECTORY}/ljosa_embeddings_post_fa.h5,

where the post-normalized embeddings have been projected to embeddings with dimension 50 by factor analysis.

Deep Neural Network Embeddings

Deep neural network embeddings are obtained by running a pipeline on the raw image dataset. In the pipeline, the raw images are corrected for imaging artifacts, cell patches are obtained by cell center finding, and a pre-trained deep neural network is applied to the patch images to obtain embeddings. Each embedding is of dimension 192, with 64 dimensions for each of the three stains. More details can be found in the paper https://ai.google/research/pubs/pub46293. Due to the proprietary reason, the code and generated embeddings cannot be open sourced here. Readers who are interested in testing the code can instead use the feature vectors generated from inception_v3 on TensorFlow Hub.

Model Training

A Wasserstein distance network is trained to correct for batch effects.

CellProfiler Embeddings

python -m correct_batch_effects_wdn.forgetting_nuisance \
--network_type=WassersteinNetwork \
--input_df="${LJOSA_DATA_DIRECTORY}/ljosa_embeddings_post_fa.h5" \
--num_steps_pretrain=100000 \
--num_steps=5000 \
--save_dir="${SAVE_DIR}/ljosa_embeddings_post_fa" \
--disc_steps_per_training_step=50 \
--checkpoint_interval=2000 \
--nuisance_levels=batch \
--batch_n=100 \
--target_levels=compound \
--feature_dim=50 \
--layer_width=2 \
--num_layers=2 \
--learning_rate=1e-4

Deep Neural Network Embeddings

python -m correct_batch_effects_wdn.forgetting_nuisance \
--network_type=WassersteinNetwork \
--input_df="${LJOSA_DATA_DIRECTORY}/ljosa_deep_post_tvn.h5" \
--num_steps_pretrain=100000 \
--num_steps=5000 \
--save_dir="${SAVE_DIR}/ljosa_deep_post_tvn" \
--disc_steps_per_training_step=50 \
--checkpoint_interval=2000 \
--nuisance_levels=batch \
--batch_n=100 \
--target_levels=compound \
--feature_dim=192 \
--layer_width=2 \
--num_layers=2 \
--learning_rate=1e-4

Model Evaluation

Model performance is evaluated by a number of metrics, quantifying how much biological signal is preserved in the embeddings and how much batch effect has been removed after applying the learned transformation.

CellProfiler Embeddings

DF_DIR="${SAVE_DIR}/ljosa_embeddings_post_fa/(('input_df', \
'ljosa_embeddings_post_fa.h5'), ('network_type', 'WassersteinNetwork'), \
('num_steps_pretrain', 100000), ('num_steps', 5000), ('batch_n', 100), \
('learning_rate', 0.0001), ('feature_dim', 50), \
('disc_steps_per_training_step', 50), ('target_levels', \
('compound',)), ('nuisance_levels', ('batch',)), ('layer_width', 2), \
('num_layers', 2), ('lambda_mean', 0.0), ('lambda_cov', 0.0), \
('cov_fix', 0.001))"

python -m correct_batch_effects_wdn.evaluate_metrics \
--transformation_file="${DF_DIR}/data.pkl" \
--input_df="${LJOSA_DATA_DIRECTORY}/ljosa_embeddings_post_fa.h5" \
--output_file="${DF_DIR}/evals.pkl" \
--num_bootstrap=200

Deep Neural Network Embeddings

DF_DIR="${SAVE_DIR}/ljosa_deep_post_tvn/(('input_df', \
'ljosa_deep_post_tvn.h5'), ('network_type', 'WassersteinNetwork'), \
('num_steps_pretrain', 100000), ('num_steps', 5000), ('batch_n', 100), \
('learning_rate', 0.0001), ('feature_dim', 192), \
('disc_steps_per_training_step', 50), ('target_levels', ('compound',)), \
('nuisance_levels', ('batch',)), ('layer_width', 2), ('num_layers', 2), \
('lambda_mean', 0.0), ('lambda_cov', 0.0), ('cov_fix', 0.001))"

python -m correct_batch_effects_wdn.evaluate_metrics \
--transformation_file="${DF_DIR}/data.pkl" \
--input_df="${LJOSA_DATA_DIRECTORY}/ljosa_deep_post_tvn.h5" \
--output_file="${DF_DIR}/evals.pkl" \
--num_bootstrap=200

Sample Code for Loading `evals.pkl`

import six.moves.cPickle as pickle
from tensorflow import gfile

def load_contents(file_path):
  with gfile.GFile(file_path, mode="r") as f:
    contents = f.read()
    contents = pickle.loads(contents)
  return contents

evals = load_contents(path)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

correct_batch_effects_wdn

correct_batch_effects_wdn

README.md

Correcting for Batch Effects Using Wasserstein Distance

Datasets

CellProfiler Embeddings

Deep Neural Network Embeddings

Model Training

CellProfiler Embeddings

Deep Neural Network Embeddings

Model Evaluation

CellProfiler Embeddings

Deep Neural Network Embeddings

Sample Code for Loading `evals.pkl`

Files

correct_batch_effects_wdn

Directory actions

More options

Directory actions

More options

Latest commit

History

correct_batch_effects_wdn

Folders and files

parent directory

README.md

Correcting for Batch Effects Using Wasserstein Distance

Datasets

CellProfiler Embeddings

Deep Neural Network Embeddings

Model Training

CellProfiler Embeddings

Deep Neural Network Embeddings

Model Evaluation

CellProfiler Embeddings

Deep Neural Network Embeddings

Sample Code for Loading evals.pkl

Sample Code for Loading `evals.pkl`