The following guidelines contain best practices for running SINGE on a new dataset. Please post an issue with any questions about how to run SINGE for a particular dataset.
We strongly recommend using the complete SINGE workflow, which includes the SINGE GLG and SINGE Aggregate modes. We consider each output of the GLG test resulting from a unique hyperparameter combination and subsampled input to be a partial estimate of the true regulatory network. SINGE's ensembling and modified Borda aggregation is motivated by stability selection from these networks.
Note that our main intent in choosing the hyperparameter values is to demonstrate SINGE usage, and we do not use different hyperparameter values for different datasets.
However, the choice of hyperparameters can be tuned to prior information regarding the dataset in the form of the type of biological process (time resolution and number of lags), number of genes, number of cells, sparsity of the dataset and expected sparsity of the network (lambda
, prob-remove-sample
, and prob-zero-removal
), quality of pseudotime (kernel-width
), etc.
For count-based data, our default recommendation is to use the hyperparameter --family poisson
.
However, for larger datasets, the glmnet_matlab package encounters memory segmentation violations with this setting (see #32).
Until this glmnet_matlab issue is resolved, we recommend log-transforming the count-based data (X1 = log(1+X)
) and using the --family gaussian
hyperparameter instead.
A lambda=0
option results in the GLG output being a complete graph between all genes.
When the dataset has a large number of genes (>1000), it is not meaningful or practical to have a complete graph.
With that in mind, we propose that for datasets with large number of genes, the hyperparameter value lambda=0
should be avoided.
We also observed that setting lambda=0
results in the glmnet_matlab
routine to run for longer durations, which is one of the key reasons for the segmentation violations described above.
Thus, avoiding lambda=0
also lowers the risk of segmentation violations.
The default_hyperparameters.txt
corresponds to an ensemble obtained using ten subsampled replicates for each hyperparameter combination.
However, as noted in the manuscript, the user can substantially reduce the computational runtime by using two to five replicates instead of ten at the cost of moderate precision-recall performance degradation.
This strategy combined with using a regulator list as described in the following section can drastically reduce the computational requirements for large datasets.
See the instructions for generating a hyperparameter file for further details.
SINGE version 0.3.0 introduced functionality where a subset of the gene list is earmarked as candidate regulators.
This is achieved by including a vector regix
accompanying the expression matrix X
in the input .mat
file.
The vector regix
contains indices corresponding to the rows of X
which themselves correspond to the genes in the vector gene_list
.
Given a single-cell dataset, pseudotimes can be generated by inferring a trajectory from the cells.
This process helps to order the cells along the biological process, often also assigning pseudotemporal values to each cell.
We found the dynverse
(http://dynverse.org/) package to be extremely useful for this purpose.
This package has access to 50+ trajectory inference algorithms, catering to various trajectory types, dataset sizes, etc.
The dynverse
package provides a streamlined user interface that guides the user through the steps for selecting a trajectory inference method.
If the trajectory inference method itself does not assign pseudotimes, they can be assigned after the fact using the add_pseudotime
function.
The current SINGE version operates on linear or branching trajectories.
SINGE version 0.5.0 added new support for branching trajectories.
In branching trajectories, two cells with very different states can have the same pseudotime.
For applying SINGE on such a dataset, we recommend breaking down the complete trajectory into sub-trajectories (branches) corresponding to each cell fate and independently analyzing each trajectory.
SINGE supports separate sub-trajectories via the optional indicator matrix branches
in the input data matfile.
branches
has a column for each sub-trajectory.
1s in the column indicate which cells belong to that sub-trajectory.
Cells can belong to multiple sub-trajectories.
By default, SINGE will aggregate all GLG tests from all sub-trajectories to produce a single regulatory network.
Alternatively, a user could run the SINGE aggregation step separately on the adjacency matrices generated for each sub-trajectory.
The example in the data_bifurcated
directory demonstrates how to construct the branches
data structure using dynverse
.