Skip to content

Commit

Permalink
updated database release
Browse files Browse the repository at this point in the history
  • Loading branch information
qiyunzhu committed Jan 21, 2023
1 parent a682864 commit b67c456
Show file tree
Hide file tree
Showing 5 changed files with 26 additions and 14 deletions.
10 changes: 8 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
Change Log
==========

## Version 2.0b3 (2021, ongoing)
## Version 2.0b4 (ongoing)

### Changed
- Updated pre-built default database to 2023-01-02 (after NCBI RefSeq 215).


## Version 2.0b3 (11/25/2021)

### Added
- Predicted HGT list now includes potential donors. Users can optionally specify a taxonomic rank at which they will be reported.
Expand All @@ -12,7 +18,7 @@ Change Log
- Added an option `--manual` to export URLs of sampled genomes during database construction, and let the user download them manually.

### Changed
- Updated pre-built default database to 2021-11-21 (after NCBI RefSeq 209)
- Updated pre-built default database to 2021-11-21 (after NCBI RefSeq 209).
- Repository transferred from [DittmarLab](https://github.com/DittmarLab) to [qiyunlab](https://github.com/qiyunlab).
- Updated recommended dependency versions, however the program should continue to be compatible with previous versions.
- Minor tweaks with no visible impact on program behavior.
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Build a reference [database](doc/database.md) using the default protocol:
hgtector database -o db_dir --default
```

Or [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) a pre-built database as of 2021-11-21, and [compile](doc/database.md#Manual-compiling) it.
Or download a [pre-built database](doc/database.md#pre-built-databases) as of 2023-01-02, and [compile](doc/database.md#Manual-compiling) it.

Prepare input file(s). They should be multi-Fasta files of amino acid sequences (faa). Each file represents the whole protein set of a complete or partial genome.

Expand All @@ -71,7 +71,7 @@ It is recommended that you read the [first run](doc/1strun.md), [second run](doc

## License

Copyright (c) 2013-2021, [Qiyun Zhu](mailto:qiyunzhu@gmail.com) and [Katharina Dittmar](mailto:katharinad@gmail.com). Licensed under [BSD 3-clause](http://opensource.org/licenses/BSD-3-Clause). See full license [statement](LICENSE).
Copyright (c) 2013-2023, [Qiyun Zhu](mailto:qiyunzhu@gmail.com) and [Katharina Dittmar](mailto:katharinad@gmail.com). Licensed under [BSD 3-clause](http://opensource.org/licenses/BSD-3-Clause). See full license [statement](LICENSE).


## Citation
Expand Down
2 changes: 1 addition & 1 deletion doc/2ndrun.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Otherwise, let's now make a very small test database for this quick demo. This d
hgtector database -c microbe -t 2 -s 1 -r superkingdom --reference --compile diamond -o <output_dir>
```

There will be around 120 genomes in this catalog. The multi-Fasta file of protein sequences is around 150 MB. If you have any trouble getting this database built automatically, we provide a sample database, **ref107**, for [download](https://www.dropbox.com/s/46v3uc708rvc5rc/ref107.tar.xz?dl=0).
There will be around 120 genomes in this catalog. The multi-Fasta file of protein sequences is around 150 MB. If you have any trouble getting this database built automatically, we provide a sample database, **ref107**, for [download](https://arizonastateu-my.sharepoint.com/:u:/g/personal/qzhu44_asurite_asu_edu/ESdzGQjPMCREsrnyhXw4DvEBEOtFQo_72yhoYN7LG4LjzQ?e=40pCZO) (or [here](https://www.dropbox.com/s/46v3uc708rvc5rc/ref107.tar.xz?dl=0)).

Note that however, with this small database, one really can't expect very high sensitivity in HGT prediction. We will address this later in the tutorial.

Expand Down
18 changes: 12 additions & 6 deletions doc/database.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Database
========

## Index
- [Overview](#overview) | [Default protocol](#default-protocol) | [Pre-built database](#pre-built-database) | [Database files](#database-files) | [Considerations](#considerations) | [Command-line reference](#command-line-reference)
- [Overview](#overview) | [Default protocol](#default-protocol) | [Pre-built databases](#pre-built-databases) | [Database files](#database-files) | [Considerations](#considerations) | [Command-line reference](#command-line-reference)

## Overview

Expand Down Expand Up @@ -36,15 +36,21 @@ hgtector database --output <output_dir> --cats microbe --sample 1 --rank species
```


## Pre-built database
## Pre-built databases

A database built using the default protocol on 2021-11-21 is available for [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) \([MD5](https://www.dropbox.com/s/kdopz946pk088wr/hgtdb_20211121.tar.xz.md5?dl=0)\). It needs to be [compiled](#Manual-compiling) using choice of aligner.
Databases built using the default protocol are available for download at either of the following two repositories. They need to be [compiled](#Manual-compiling) using choice of aligners prior to use.

This database, sampled from NCBI RefSeq after release, 209 contains 68,977,351 unique protein sequences from 21,754 microbial genomes, representing 3 domains, 74 phyla, 145 classes, 337 orders, 783 families, 3,753 genera and 15,932 species.
OneDrive for Business by Arizona State University:

Building this database used a maximum of 63 GB memory. Searching this database using DIAMOND v2.0.13 requires ~7 GB memory.
- https://arizonastateu-my.sharepoint.com/:f:/g/personal/qzhu44_asurite_asu_edu/ErLl2qExtFhAiS1J0sCpZqgBEebKHtBilj1IDlaitOVZXg

A previous version of the database built on 2019-10-21 is available [here](https://www.dropbox.com/s/qdnfgzdcjadlm4i/hgtdb_20191021.tar.xz?dl=0).
Dropbox of Dr. Qiyun Zhu:

- https://www.dropbox.com/sh/tevabydz6palfih/AAB-TitXKNfQl5dmnZM1VfRca?dl=0

The current release, built on 2023-01-02 (sampled after NCBI RefSeq release 215) contains 129,809,746 unique protein sequences from 40,310 microbial genomes, representing 3 domains, 75 phyla, 153 classes, 356 orders, 847 families, 4,095 genera and 40,309 species.

Previous releases, as well as a small database for test purpose ("ref107") are also provided at the two repositories.


## Database files
Expand Down
6 changes: 3 additions & 3 deletions doc/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ pip install git+https://github.com/qiyunlab/HGTector.git

### Option 2: Native installation

Download this [repository](https://github.com/qiyunlab/HGTector/archive/master.zip). Unzip. Then execute:
Download this [repository](https://github.com/qiyunlab/HGTector/archive/master.zip) or any of the [releases](https://github.com/qiyunlab/HGTector/releases). Unzip. Then execute:

```bash
python setup.py install
Expand All @@ -49,9 +49,9 @@ conda install -c bioconda diamond blast

HGTector has a command `database` for automated database construction. It defaults to the **NCBI** RefSeq microbial genomes and taxonomy. Meanwhile, we also provide instructions for using **GTDB** and custom databases. See [details](database.md).

A standard database built using the default protocol on 2021-11-21 is available for [download](https://www.dropbox.com/s/tszxy9etp52id3u/hgtdb_20211121.tar.xz?dl=0) \([MD5](https://www.dropbox.com/s/kdopz946pk088wr/hgtdb_20211121.tar.xz.md5?dl=0)\), together with [instruction](database.md#Manual-compiling) for compiling.
A standard database built using the default protocol on 2023-01-02 is available for download ([OneDrive](https://arizonastateu-my.sharepoint.com/:f:/g/personal/qzhu44_asurite_asu_edu/ErLl2qExtFhAiS1J0sCpZqgBEebKHtBilj1IDlaitOVZXg) or [Dropbox](https://www.dropbox.com/sh/tevabydz6palfih/AAB-TitXKNfQl5dmnZM1VfRca?dl=0)), together with [instruction](database.md#Manual-compiling) for compiling.

A small, pre-compiled test database is also available for [download](https://www.dropbox.com/s/46v3uc708rvc5rc/ref107.tar.xz?dl=0).
A small, pre-compiled test database is also available for download from the same websites.


## Upgrade
Expand Down

0 comments on commit b67c456

Please sign in to comment.