CellHint is an automated tool for cell type harmonisation and integration.
- harmonisation: match and harmonise cell types defined by independent datasets
- integration: integrate cell and cell types with supervision from harmonisation
Using CellHint for cell type harmonisation
Using CellHint for annotation-aware data integration
pip install cellhint
conda install -c conda-forge cellhint
1. Cross-dataset cell type harmonisation
-
1.1. Cell type harmonisation
The input AnnData needs two columns in
.obs
representing dataset origin and cell original annotation respectively. The aim is to harmonise cell types across datasets using cellhint.harmonize.Internally, transcriptional distances between cells and cell types (denoted here as the cell centroid) will first be calculated. Since cell type is usually defined at the cluster level and no cluster is 100% pure, you can set
filter_cells = True
(default toFalse
) to filter out cells whose gene expression profiles do not correlate most with the cell type they belong to. This will speed up the run as only a subset of cells are used, but will render these filtered cells unannotated (see2.2.
). Distances are calculated at either gene or low-dimensional space. The latter is preferred to denoise the data by providing a latent representation via the argumentuse_rep
(default to PCA coordinates).#`use_rep` can be omitted here as it defaults to 'X_pca'. alignment = cellhint.harmonize(adata, dataset = 'dataset_column', cell_type = 'celltype_column', use_rep = 'X_pca')
If
X_pca
is not detected in.obsm
and no other latent representations are provided viause_rep
, gene expression matrix in.X
will be used to calculate the distances. In such case, subsetting the AnnData to informative genes (e.g. highly variable genes) is suggested and.X
should be log-normalised (to a constant total count per cell).The resulting
alignment
is an instance of the class DistanceAlignment as defined by CellHint, and can be written out as follows.#Save the harmonisation output. alignment.write('/path/to/local/folder/some_name.pkl')
-
1.2. Cell type harmonisation with PCT
Inferring cell type relationships based on directly calculated distances will suffice in most cases due to a normalisation procedure applied to the derived distances. If a very strong batch effect exists across datasets, you can turn on
use_pct = True
(default toFalse
) to predict instead of calculate these distances. Through this parameter, a predictive clustering tree (PCT) is built for each dataset, and distances between cells in query datasets and cell types in the reference dataset are predicted, often resulting in unbiased distance measures.#Use PCT to predict transcriptional cell-cell distances across datasets. alignment = cellhint.harmonize(adata, dataset = 'dataset_column', cell_type = 'celltype_column', use_rep = 'X_pca', use_pct = True)
Due to the nonparametric nature of PCT, the format of the expression
.X
in the AnnData is flexible (normalised, log-normalised, z-scaled, etc.), but subsetting the AnnData to highly variable genes is always suggested. To avoid overfitting, each PCT is pruned at nodes where no further splits are needed based on F-test, which is turned on by default (F_test_prune = True
). You can increase the p-value cutoff (default to 0.05,p_thres = 0.05
) to prune fewer nodes for improved accuracy at the cost of reduced generalisability. -
1.3. Specify the dataset order
In CellHint, datasets are iteratively incorporated and harmonised. The order of datasets can be specified by providing a list of dataset names to the argument
dataset_order
. Otherwise, the order will be determined by CellHint through iteratively adding a dataset that is most similar (i.e., more shared cell types) to the datasets already incorporated. This behaviour can be disabled by settingreorder_dataset = False
(default toTrue
) and an alphabetical order of datasets will be used.#Specify the order of datasets to be harmonised. alignment = cellhint.harmonize(adata, dataset = 'dataset_column', cell_type = 'celltype_column', use_rep = 'X_pca', dataset_order = a_list_of_datasets)
-
1.4. Categories of harmonised cell types
Four kinds of harmonisations are anchored with cellhint.harmonize:
- Novel cell types as determined by
maximum_novel_percent
(default to0.05
). In each harmonisation iteration, a cell type (or meta-cell-type) whose maximal alignment fraction is <maximum_novel_percent
with any cell types in any other datasets is designated as a novel cell type (NONE
). - One-to-one aligned cell types as determined by
minimum_unique_percents
andminimum_divide_percents
. If the alignments (in both directions) between two cell types from two respective datasets are greater thanminimum_unique_percents
, plus that these alignments are not one-to-many (see the third point below), this will be an 1:1 (=
) match. Dynamic thresholds ofminimum_unique_percents
(default to 0.4, 0.5, 0.6, 0.7, 0.8) andminimum_divide_percents
(default to 0.1, 0.15, 0.2) are exhaustively tested until the least number of alignments is found between datasets. - One-to-many (or many-to-one) aligned cell types as determined by
minimum_unique_percents
andminimum_divide_percents
. If one cell type has more than two cell types aligned in the other dataset with a match proportion greater thanminimum_divide_percents
, and these matched cell types have a back-match proportion greater thanminimum_unique_percents
, this will be an 1:N (∋
) or N:1 (∈
) match. Dynamic thresholds ofminimum_unique_percents
(default to 0.4, 0.5, 0.6, 0.7, 0.8) andminimum_divide_percents
(default to 0.1, 0.15, 0.2) are exhaustively tested until the least number of alignments is found between datasets. - Unharmonised cell types. If after the above categorisation, a cell type remains unharmonised, then this cell type will be an unharmonised cell type (
UNRESOLVED
).
If there are many datasets to harmonise and each dataset has many cell types, harmonisation may take longer time. You can restrict the test scope of minimum_unique_percents
andminimum_divide_percents
to reduce runtime. The default is a 15 (5X3) combo test; setting the two parameters to, for example a 3X2 combo, can decrease 60% of the runtime.#`minimum_unique_percents` is set to three values (default is 0.4, 0.5, 0.6, 0.7, 0.8). #`minimum_divide_percents` is set to two values (default is 0.1, 0.15, 0.2). alignment = cellhint.harmonize(adata, dataset = 'dataset_column', cell_type = 'celltype_column', use_rep = 'X_pca', minimum_unique_percents = [0.5, 0.6, 0.7], minimum_divide_percents = [0.1, 0.15])
- Novel cell types as determined by
2. Inspection of the harmonisation result
-
2.1. Harmonisation table
The previously saved harmonisation object can be loaded using
cellhint.DistanceAlignment.load
.alignment = cellhint.DistanceAlignment.load('/path/to/local/folder/some_name.pkl')
In
alignment
, the harmonisation table, which summarises cell types across datasets into semantically connected ones, is stored as the attribute.relation
(alignment.relation
). One illustrative example is:D1 relation D2 relation D3 A = B = C D = NONE = UNRESOLVED E ∈ G = H F ∈ G = I J = K ∋ L J = K ∋ M The table columns are the dataset1 name, relation, dataset2 name, ..., all the way to the name of the last dataset. Accordingly, each row of the table is a list of cell types connected by predefined symbols of
=
,∈
, and∋
. In addition to cell type names, there are two extra definitions ofNONE
andUNRESOLVED
in the table, representing two levels of novelties (see1.4.
).The table should be interpreted from left to right. For example, for the first row
A = B = C
, although it may look like an 1:1 match between A and B plus an 1:1 match between B and C, a correct interpretation should be an 1:1 match between A and B, resulting in a meta cell type ofA = B
. This meta cell type, as a whole, has an 1:1 match with C, further leading toA = B = C
. Similarly, for the second rowD = NONE = UNRESOLVED
, instead of a novel cell type D in dataset1, this cell type should be read as a dataset1-specific cell type not existing in dataset2 (D = NONE
), which as a whole is unharmonised when aligning with dataset3 (D = NONE = UNRESOLVED
).Extending this interpretation to the third and fourth rows, they denote two cell types (E and F) in dataset1 collectively constituting the cell type G in dataset2. The resulting subtypes (
E ∈ G
andF ∈ G
) are 1:1 matched with H and I in dataset3, respectively. For the last two rows, they describe the subdivision of a meta cell type (J = K
) into L and M in dataset3, being more than a subdivision of K.In the table, each row corresponds to a harmonised low-hierarchy cell type, in other words, the most fine-grained level of annotation that can be achieved by automatic alignment. At a high hierarchy, some cell types such as
E ∈ G = H
andF ∈ G = I
belong to the same group. CellHint defines a high-hierarchy cell type as fully connected rows in the harmonisation table. As a result, each high-hierarchy cell type is a cell type group independent of each other. This information can be accessed in the attribute.groups
which is an array/vector with an length of the number of rows in the harmonisation table.#Access the high-hierarchy cell types (cell type groups). alignment.groups
-
2.2. Cell reannotation
After cell type harmonisation, each cell can be assigned a cell type label corresponding to a given row of the harmonisation table, denoted as the process of cell reannotation. By default, reannotation is enabled (
reannotate = True
) when using cellhint.harmonize and information of reannotated cell types is already in place as the attribute.reannotation
.#Access the cell reannotation information. alignment.reannotation
This is a data frame with an example shown below. Unless
filter_cells = True
is set (see1.1.
), all cells in the AnnData will be present in this data frame.dataset cell_type reannotation group cell1 D1 A A = B = C Group1 cell2 D1 D D = NONE = UNRESOLVED Group2 cell3 D2 G E ∈ G = H Group3 cell4 D2 G F ∈ G = I Group3 cell5 D3 L J = K ∋ L Group4 cell6 D3 M J = K ∋ M Group4 The four columns represent information of dataset origin, original author annotation, reannotated low- and high-hierarchy annotation, respectively. For the last column, it contains grouping (high-hierarchy) information, and each group corresponds to a subset of the harmonisation table. You can check this correspondence by coupling the table (
alignment.relation
) with the grouping (alignment.groups
) (see2.1.
). -
2.3. Meta-analysis
A distance matrix-like instance, which is from the class Distance as defined by CellHint, is also stashed in
alignment
as the attribute.base_distance
.#Access the distance object. alignment.base_distance
The main content of this object is the distance matrix (
alignment.base_distance.dist_mat
) between all cells (rows) and all cell types (columns). Values in this matrix are either calculated (the default) or inferred (ifuse_pct
isTrue
) bycellhint.harmonize
, and after a normalisation procedure, lie between 0 and 1. If there are strong cross-dataset batches, an inferred distance matrix obtained from the PCT algorithm is usually more accurate. Metadata of cells and cell types for this matrix can be found inalignment.base_distance.cell
andalignment.base_distance.cell_type
, which record raw information such as the dataset origin and original author annotation.During the internal harmonisation process, each cell is assigned the most similar cell type from each dataset. This result is stored in the assignment matrix (
alignment.base_distance.assignment
), with rows being cells (cell metadata can be found inalignment.base_distance.cell
as mentioned above), columns being datasets, and elements being the assigned cell types in different datasets. This matrix can be interpreted as a summary of multi-data label transfers.#Access the cell type assignment result. alignment.base_distance.assignment
Each column (corresponding to one dataset) of the assignment matrix can be thought as a unified naming schema when all cells are named by this given dataset.
CellHint provides a quick way to summarise the above information including cells' distances and assignments into meta-analysis at the cell type level. Specifically, a distance matrix among all cell types can be obtained by:
#Get the cell-type-to-cell-type distance matrix. alignment.base_distance.to_meta()
An optional
turn_binary = True
(default toFalse
) can be added to turn the distance matrix into a cell membership matrix before meta-analysis, showing how cell types are assigned across datasets.#Get the cell-type-to-cell-type membership matrix. alignment.base_distance.to_meta(turn_binary = True)
3. Reharmonisation
-
3.1. Change the dataset order
The order of datasets used by
cellhint.harmonize
can be found in the attribute.dataset_order
(alignment.dataset_order
), which is either auto-determined by CellHint or specified by the user (via thedataset_order
parameter incellhint.harmonize
). This order is also reflected by the column order of the harmonisation table.Along the order of datasets, optimal choices of
minimum_unique_percents
andminimum_divide_percents
(see1.4.
) in each iteration can be found inalignment.minimum_unique_percents
andalignment.minimum_divide_percents
. For instance, harmonising five datasets requires four iterations, and thus both.minimum_unique_percents
and.minimum_divide_percents
have a length of four.CellHint provides a method best_align to change the order of datasets post-harmonisation. Through this, datasets will be reharmonised in a different order (this post-harmonisation adjustment is more efficient than re-running
cellhint.harmonize
with a new order).#Reharmonise cell types across datasets with a different dataset order. alignment.best_align(dataset_order = a_list_of_new_dataset_order)
As in
cellhint.harmonize
, the combos ofminimum_unique_percents
andminimum_divide_percents
will be tested to find the best alignment in each iteration. Importantly, as well as a full dataset list, you can provide a subset of datasets for reharmonisation. This is useful in terms of focusing on part of the data for inspection or visualisation (see4.
).#Reharmonise cell types across datasets with part of datasets. alignment.best_align(dataset_order = a_subset_of_dataset_names)
A new harmonisation table will be generated in
alignment.relation
, which only includes datasets specified in.best_align
..minimum_unique_percents
and.minimum_divide_percents
are also overridden by new values used during reharmonisation. -
3.2. Reannotation
After changing the dataset order and reharmonising cell types, cells need to be reannotated based on the newly generated harmonisation table using the method reannotate.
#Reannotate cells based on the new harmonisation table. alignment.reannotate()
Similarly, information of reannotated cells is stored in
alignment.reannotation
.
4. Visualisation
-
4.1. Tree plot
The most intuitive way to visualise the harmonised cell types is the tree plot using the function cellhint.treeplot.
#Visualise the harmonisation result with a tree plot. cellhint.treeplot(alignment)
Alternatively, since only the harmonisation table (
alignment.relation
) is used when plotting this tree,cellhint.treeplot
also accepts the input directly from the table. This is more convenient as a table is easier to manipulate, such as writing it out as a csv file and loading it later for tree plot.#Write out the harmonisation table as a csv file. #Note - if cell type names contain commas, set a different `sep` here. alignment.relation.to_csv('/path/to/local/folder/HT.csv', sep = ',', index = False)
#Read the harmonisation table. HT = pd.read_csv('/path/to/local/folder/HT.csv', sep = ',') #Visualise the harmonisation result with a tree plot. cellhint.treeplot(HT) #Visualise the harmonisation result only for cell types (rows) of interest. cellhint.treeplot(HT[row_flag])
In a tree plot, each column is a dataset and cell types are connected across datasets. By default, cell types belonging to one low hierarchy (one row in the harmonisation table) are in the same color. You can change the color scheme by providing a data frame to the
node_color
parameter, with three consecutive columns representing dataset, cell type, and color (in hex code), respectively.node_color
can also be a data frame with columns of dataset, cell type, and numeric value (for mapping color gradient in combination withcmap
). Other parameters controlling the appearance of the tree plot (node shape, line width, label size, figure size, etc.) are detailed in cellhint.treeplot.The tree plot considers all pairs of reference-to-query assignments. Therefore, a restricted representation in two dimensionalities may overlay some cell types when they have complex 1:1 and 1:N intersections. These cross-connections are usually not solvable at 2D space; you may need to revisit the harmonisation table in some cases. By changing the dataset (column) order in each high-hierarchy cell type, broader (more divisible) cell types can be positioned to the left, followed by fine-grained cell types to the right. The resulting plot shows how different authors group these cell types, thereby being more characteristic of the potential underlying biological hierarchy. This hierarchy can be generated and visualised by adding
order_dataset = True
.#Visualise the cell type hierarchy. #Again, the input can also be a harmonisation table. cellhint.treeplot(alignment, order_dataset = True)
Because each high-hierarchy cell type is independent of each other, the new orders of datasets will be different across groups. To recognise the dataset origin of each cell type within the hierarchy, you can assign the same color or shape to cell types from the same dataset using the parameter
node_color
ornode_shape
. An example is:#Cell types from the same dataset are in the same shape. #`node_shape` should be the same length as no. datasets in the harmonisation table. cellhint.treeplot(alignment, order_dataset = True, node_shape = list_of_shapes)
Export the plot if needed.
cellhint.treeplot(alignment, show = False, save = '/path/to/local/folder/some_name.pdf')
-
4.2. Sankey plot
The other way to visualise harmonised cell types is the Sankey plot by cellhint.sankeyplot. CellHint builds this plot on the plotly package.
plotly
is not mandatory when installing CellHint, so you need to install it first if you want a visualisation form of Sankey diagram (and engines for exporting images such as kaleido).#Visualise the harmonisation result with a Sankey plot. #As with the tree plot, the input can also be a harmonisation table. cellhint.sankeyplot(alignment)
Similar to the tree plot, this diagram shows how cell types are connected across datasets. Parameters controlling the appearance of the Sankey plot (node color, link color, figure size, etc.) are detailed in cellhint.sankeyplot.
Different from the tree plot where novel (
NONE
) and unharmonised (UNRESOLVED
) cell types are blank, in the Sankey plot they are colored in white and light grey, respectively. You can adjust these by changing the values ofnovel_node_color
andremain_node_color
.Export the plot if needed.
#Export the image into html. cellhint.sankeyplot(alignment, show = False, save = '/path/to/local/folder/some_name.html') #Export the image into pdf. cellhint.sankeyplot(alignment, show = False, save = '/path/to/local/folder/some_name.pdf')
1. Supervised data integration
-
1.1. Specify batch and biological covariates
The input AnnData needs two columns in
.obs
representing the batch confounder and unified cell annotation respectively. The aim is to integrate cells by correcting batches and preserving biology (cell annotation) using cellhint.integrate.#Integrate cells with `cellhint.integrate`. cellhint.integrate(adata, batch = 'a_batch_key', cell_type = 'a_celltype_key')
With this function, CellHint will build the neighborhood graph by searching neighbors across matched cell groups in different batches, on the basis of a low-dimensional representation provided via the argument
use_rep
(default to PCA coordinates).#`use_rep` can be omitted here as it defaults to 'X_pca'. cellhint.integrate(adata, batch = 'a_batch_key', cell_type = 'a_celltype_key', use_rep = 'X_pca')
The batch confounder can be the dataset origin, donor ID, or any relevant covariate. For the biological factor, it is the consistent annotation across cells, such as manual annotations of all cells, transferred cell type labels from a single reference model, and as an example here, the harmonised cell types from the CellHint harmonisation pipeline (see the harmonisation section). Specifically, you can add two extra columns in the
.obs
of the input AnnData using the reannotation information fromalignment.reannotation
.#Insert low- and high-hierarchy annotations into the AnnData. adata.obs[['harmonized_low', 'harmonized_high']] = alignment.reannotation.loc[adata.obs_names, ['reannotation', 'group']]
Perform data integration using either of the two annotation columns.
#Integrate cells using the reannotated high-hierarchy cell annotation. cellhint.integrate(adata, batch = 'a_batch_key', cell_type = 'harmonized_high') #Not run; integrate cells using the reannotated low-hierarchy cell annotation. #cellhint.integrate(adata, batch = 'a_batch_key', cell_type = 'harmonized_low')
Finally, generate a UMAP based on the reconstructed neighborhood graph.
sc.tl.umap(adata)
-
1.2. Adjust the influence of annotation on integration
Influence of cell annotation on the data structure can range from forcibly merging the same cell types to a more lenient cell grouping. This is achieved by adjusting the parameter
n_meta_neighbors
.#Actually the default value of `n_meta_neighbors` is 3. cellhint.integrate(adata, batch = 'a_batch_key', cell_type = 'a_celltype_key', n_meta_neighbors = 3)
With
n_meta_neighbors
of 1, each cell type only has one neighboring cell type, that is, itself. This will result in strongly separated cell types in the final UMAP. Increasingn_meta_neighbors
will loosen this restriction. For example, an_meta_neighbors
of 2 allows each cell type to have, in addition to itself, one nearest neighboring cell type based on the transcriptomic distances calculated by CellHint. This parameter defaults to 3, meaning that a linear spectrum of transcriptomic structure can possibly exist for each cell type.
2. Tips for data integration
-
2.1. Partial annotation
Partial annotation (an
.obs
column combining annotated and unannotated cells) is allowed as thecell_type
parameter ofcellhint.integrate
. You need to explicitly name unannotated cells as'UNASSIGNED'
for use in CellHint (definition of symbols can be found here). -
2.2. Rare cell types
When an abundant cell type is annotated/distributed across multiple batches (e.g., datasets), sometimes not all batches can harbour adequate numbers. This leads to a rare cell type defined within the context of a specific batch. During neighborhood construction, if this batch cannot provide enough neighboring cells for this cell type, search space will be expanded to all cells in this batch.
Although this represents a safe solution in CellHint to anchor nearest neighbors for rare cell types, runtime of the algorithm will be increased and cells from this cell type may not be robustly clustered. Keeping them is fine for CellHint, but you can also remove such rare cell types in associated batches before running
cellhint.integrate
(a cell type with only a small number in a given batch naturally means that this batch may not be qualified for hosting this cell type). Example code is:#Remove cells from cell types that have <=5 cells in a batch. combined = adata.obs['a_batch_key'].astype(str) + adata.obs['a_celltype_key'].astype(str) combined_counts = combined.value_counts() remove_combn = combined_counts.index[combined_counts <= 5] adata = adata[~combined.isin(remove_combn)].copy()
-
2.3. Use CellTypist models for annotation and integration
cellhint.integrate
requires cell annotation to be stored in the AnnData. This information can be obtained by different means. One quick way is to use available CellTypist models to annotate the data of interest (see the CellTypist model list here).#Annotate the data with a relevant model (immune model as an example here). adata = celltypist.annotate(adata, model = 'Immune_All_Low.pkl', majority_voting = True).to_adata()
Then integrate cells on the basis of the predicted cell types.
#`cell_type` can also be 'majority_voting'. cellhint.integrate(adata, batch = 'a_batch_key', cell_type = 'predicted_labels')
Even the model does not exactly match the data (e.g., using an immune model to annotate a lung data), this approach can be still useful as cells from the same cell type will probably be assigned the same identity by the model, therefore containing information with respect to which cells should be placed together in the neighborhood graph.
Xu et al., Automatic cell-type harmonization and integration across Human Cell Atlas datasets. Cell 186, 5876–5891.e20 (2023). Link Download datasets used in the manuscript