The transferase system for retrieving methylomes from methbase.
The transferase program has an alias xfr
(installed as a copy of the
program) which is quicker to type and will be used below. There are
several commands within xfr
and the best way to start is to
understand the dnmtools roi
command, as the information
functionality provided by xfr
is the same. If you need to learn
about dnmtools roi
you can find the docs
here
If you are on a reasonably recent Linux (i.e., no older than 10 yeads), then you can install the binary distribution. First download it like this:
wget https://github.com/andrewdavidsmith/transferase/releases/download/v0.3.0/transferase-0.3.0-Linux.sh
Then run the downloaded installer (likely you want to first install it beneath your home dir):
sh transferase-0.3.0-Linux.sh --prefix=${PREFIX}
This will prompt you to accept the license, and then it will install
the transferase binaries in ${PREFIX}/bin
, along with some config files
in ${PREFIX}/share
. If you want to install it system-wide, and have
the admin privs, you can do:
sh transferase-0.3.0-Linux.sh --prefix=/usr/local
If you are on Debian or Ubuntu, and have admin privileges, you can use the Debian package and then transferase will be tracked in the package management system. Get the file like this:
wget https://github.com/andrewdavidsmith/transferase/releases/download/v0.3.0/transferase-0.3.0-Linux.deb
And then install it like this:
sudo dpkg -i ./transferase-0.3.0-Linux.deb
Not recommended unless you know what you are doing. You will need the following:
- A compiler that can handle most of C++23, one of
- Boost >= 1.85
- CMake >= 3.30
- ZLib, any version, just install it with
apt install zlib1g-dev
,mamba install -c conda-forge zlib
, etc. From source is fast and easy.
However you install these, remember where you put them and update your paths accordingly.
Since transferase uses CMake to generate the build system, there are multiple ways to do it, but I like this:
tar -xf transferase-0.3.0-Source.tar.gz
cd transferase-0.3.0-Source
cmake -B build -DCMAKE_BUILD_TYPE=Release # for a faster xfr
cmake --build build -j64 # i.e., if you have 64 cores
cmake --install build --prefix=${HOME} # or wherever
Before starting, an index file is required. To make the index, you
need the reference genome in a single fasta format file. If your
reference file name is hg38.fa
, then do this:
xfr index -v -g hg38.fa -x hg38.cpg_idx
If hg38.fa
is roughly 3.0G in size, then you should expect the index
file hg38.cpg_idx
to be about 113M in size. This command will also
create an "index metadata" file hg38.cpg_idx.json
, which is named by
just adding the .json
extension to the provided output file. You can
also start with a gzip format file like hg38.fa.gz
.
The starting point is a file in the xcounts
format that involves the
symmetric CpG sites. This is called a symmetric xcounts format file,
and may be gzip compressed, in which case the extension should be
.xsym.gz
. First I will explain how to start with this format, and
below I will explain how to start with the other format used by
dnmtools
. We again assume that the reference genome is hg38.fa
and
that this file was used to create the SRX012345.xsym.gz
file. This
ensures a correspondence between the SRX012345.xsym.gz
file and the
genome index (two files named hg38.cpg_idx
and hg38.cpg_idx.json
).
Here is how to convert such a file into the transferase format:
xfr format -x index_dir -g hg38 -o methylome_dir -m SRX012345.xsym.gz
As with the genome index, the methylome in transferase format will
involve two files: SRX012345.m16
and SRX012345.m16.json
, the
latter containing metadata.
If you begin with a
counts format
file, for example SRX012345.sym
, created using the dnmtools counts
and then dnmtools sym
commands, then you will need to first convert
it into .xsym
format (whether gzipped or not). You can do this as
follows:
dnmtools xcounts -z -o SRX012345.xsym.gz -c hg38.fa -v SRX012345.sym
Once again, be sure to always use the same hg38.fa
file. A hash is
generated and used internally to transferase to ensure that the index
and methylome files correspond to the same reference genome file.
This step is to make sure everything is sensible. Or you might just
want to keep using this tool for your own analysis (it can be
extremely fast). We will assume first that you have a set of genomic
intervals of interest. In this example these will be named
intervals.bed
. You also need the genome index and the methylome
generated in the above steps.
xfr query --local \
-x index_dir -g hg38 -d methylome_dir -m SRX012345 \
-o output.bed -i intervals.bed
The genome index for hg38 and the methylome for SRX012345 are the same
as explained above (each is two files). The intervals.bed
file may
contain any number of columns, but the first 3 columns must be in
3-column BED format: chrom, start, stop for each interval. The output
in the local_output.bed
file should be consistent with the
information in the command:
dnmtools roi -o intervals.roi intervals.bed SRX012345.xsym.gz
The format of these output files are different, but the methylation
levels on each line (i.e., for each interval) should be identical.
Note that dnmtools roi
can fail if intervals are nested, while
xfr query
command will still work.
The server
command can be tested first locally by using two terminal
windows. The steps here assume you have the genome index for hg38 and
a methylome named SRX012345. If you have some other methylome it's
fine, just substitute the name. These files should be in directories
named methylomes
and indexes
for this example. The methylome
directory will contain files with .m16
and .m16.json
extensions;
these correspond to the methylomes that can be served and must be in
pairs. For every .m16
file there must be a .m16.json
file, and
vice-versa. The indexes directory will contain the files ending with
.cpg_idx
and .cpg_idx.json
; these are index files for each
relevant genome assembly. For now, using the above examples, we would
have a single index and a single methylome. Here is a command that
will start the server:
xfr server -v debug -s localhost -p 5000 -m methylomes -x indexes
Above, the arguments indicate to use the host localhost with
port 5000. Note that this will fail with an appropriate error message
if port 5000 is already be in use, and you can just try 5001, etc.,
until one is free. The hostname can also be specified numerically (in
this example, 127.0.0.1). The -v debug
will ensure you see info
beyond just the errors. This informtion is logged to the terminal by
default. The server can also be run in detached mode, but if you don't
know what that means, you don't likely have any reason to do it.
We will assume for now that "remote" server is running on the local
machine (localhost) and using port is 5000 (the default). You would
have started this in the previous step using a different terminal
window. The following command should give identical results to the
earlier xfr query
command:
xfr query -v debug -s localhost -x indexes \
-o remote_output.bed -m SRX012345 -i intervals.bed
Note that now SRX012345
is not a file. Rather, it is a methylome
name, and should be available on the server. If the server can't find
the named methylome, it will respond indicating that methylome is not
available.