diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index d182bea..f7eb3a6 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -74,9 +74,6 @@ jobs: uses: actions/setup-python@v2 with: python-version: ${{ matrix.python-version }} - - name: Set default for downloads - shell: bash - run: echo "DOWNLOADS_ENABLED=false" >> $GITHUB_ENV - name: Determine if downloads are enabled for this job # for testing, limit downloads from the resource servers to only the selected job for # PRs and the master branch; note that the master branch is tested weekly via `cron`, @@ -94,21 +91,9 @@ jobs: echo "DOWNLOADS_ENABLED=true" >> $GITHUB_ENV fi - name: Install dependencies - shell: bash - env: - AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} - AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} run: | - pip install pytest-cov awscli + pip install pytest-cov pip install . - if [[ $DOWNLOADS_ENABLED == "false" ]]; then - # use cached resources on Amazon S3 - aws s3 cp s3://lineage-resources/resources.tar.gz resources.tar.gz - if [[ -f resources.tar.gz ]]; then - tar -xzf resources.tar.gz - rm resources.tar.gz - fi - fi - name: Ensure Python and source code are on same drive (Windows) if: ${{ matrix.os == 'windows-latest' }} shell: cmd diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst index ab648a6..9130af0 100644 --- a/CONTRIBUTING.rst +++ b/CONTRIBUTING.rst @@ -56,6 +56,12 @@ To set up ``lineage`` for local development: $ pipenv run pytest --cov-report=html --cov=lineage tests + .. note:: Downloads during tests are disabled by default. To enable downloads, set + the environment variable ``DOWNLOADS_ENABLED=true``. + + .. note:: If you receive errors when running the tests, you may need to specify the temporary + directory with an environment variable, e.g., ``TMPDIR="/path/to/tmp/dir"``. + .. note:: After running the tests, a coverage report can be viewed by opening ``htmlcov/index.html`` in a browser. diff --git a/README.rst b/README.rst index 28ab04e..1b791a6 100644 --- a/README.rst +++ b/README.rst @@ -9,9 +9,9 @@ lineage Capabilities ------------ -- Compute centiMorgans (cMs) of shared DNA between individuals using the HapMap Phase II genetic map +- Find shared DNA and genes between individuals +- Compute centiMorgans (cMs) of shared DNA using a variety of genetic maps (e.g., HapMap Phase II, 1000 Genomes Project) - Plot shared DNA between individuals -- Determine genes shared between individuals (i.e., genes transcribed from shared DNA segments) - Find discordant SNPs between child and parent(s) - Read, write, merge, and remaps SNPs for an individual via the `snps `_ package @@ -139,11 +139,12 @@ Not counting mtDNA SNPs, there are 37 discordant SNPs between these two datasets Find Shared DNA ''''''''''''''' ``lineage`` uses the probabilistic recombination rates throughout the human genome from the -`International HapMap Project `_ to -compute the shared DNA (in centiMorgans) between two individuals. Additionally, ``lineage`` -denotes when the shared DNA is shared on either one or both chromosomes in a pair. For example, -when siblings share a segment of DNA on both chromosomes, they inherited the same DNA from their -mother and father for that segment. +`International HapMap Project `_ +and the `1000 Genomes Project `_ to compute the shared DNA +(in centiMorgans) between two individuals. Additionally, ``lineage`` denotes when the shared DNA +is shared on either one or both chromosomes in a pair. For example, when siblings share a segment +of DNA on both chromosomes, they inherited the same DNA from their mother and father for that +segment. With that background, let's find the shared DNA between the ``User662`` and ``User663`` datasets, calculating the centiMorgans of shared DNA and plotting the results: @@ -151,8 +152,8 @@ calculating the centiMorgans of shared DNA and plotting the results: >>> results = l.find_shared_dna([user662, user663], cM_threshold=0.75, snp_threshold=1100) Downloading resources/genetic_map_HapMapII_GRCh37.tar.gz Downloading resources/cytoBand_hg19.txt.gz -Saving output/shared_dna_User662_User663.png -Saving output/shared_dna_one_chrom_User662_User663_GRCh37.csv +Saving output/shared_dna_User662_User663_HapMap2.png +Saving output/shared_dna_one_chrom_User662_User663_GRCh37_HapMap2.csv Notice that the centiMorgan and SNP thresholds for each DNA segment can be tuned. Additionally, notice that two files were downloaded to facilitate the analysis and plotting - future analyses @@ -177,11 +178,11 @@ created; these files are detailed in the documentation and their generation can ``save_output=False`` argument. In this example, the output files consist of a CSV file that details the shared segments of DNA on one chromosome and a plot that illustrates the shared DNA: -.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User662_User663.png +.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User662_User663_HapMap2.png Find Shared Genes ''''''''''''''''' -The `Central Dogma of Molecular Biology `_ +The `Central Dogma of Molecular Biology `_ states that genetic information flows from DNA to mRNA to proteins: DNA is transcribed into mRNA, and mRNA is translated into a protein. It's more complicated than this (it's biology after all), but generally, one mRNA produces one protein, and the mRNA / protein is considered a @@ -205,21 +206,27 @@ Loading SNPs('resources/4583.ftdna-illumina.3482.csv.gz') >>> user4584 = l.create_individual('User4584', 'resources/4584.ftdna-illumina.3483.csv.gz') Loading SNPs('resources/4584.ftdna-illumina.3483.csv.gz') -Now let's find the shared genes: +Now let's find the shared genes, specifying a +`population-specific `_ +1000 Genomes Project genetic map (e.g., as predicted by `ezancestry `_!): ->>> results = l.find_shared_dna([user4583, user4584], shared_genes=True) +>>> results = l.find_shared_dna([user4583, user4584], shared_genes=True, genetic_map="CEU") +Downloading resources/CEU_omni_recombination_20130507.tar Downloading resources/knownGene_hg19.txt.gz Downloading resources/kgXref_hg19.txt.gz -Saving output/shared_dna_User4583_User4584.png -Saving output/shared_dna_one_chrom_User4583_User4584_GRCh37.csv -Saving output/shared_dna_two_chroms_User4583_User4584_GRCh37.csv -Saving output/shared_genes_one_chrom_User4583_User4584_GRCh37.csv -Saving output/shared_genes_two_chroms_User4583_User4584_GRCh37.csv +Saving output/shared_dna_User4583_User4584_CEU.png +Saving output/shared_dna_one_chrom_User4583_User4584_GRCh37_CEU.csv +Saving output/shared_dna_two_chroms_User4583_User4584_GRCh37_CEU.csv +Saving output/shared_genes_one_chrom_User4583_User4584_GRCh37_CEU.csv +Saving output/shared_genes_two_chroms_User4583_User4584_GRCh37_CEU.csv The plot that illustrates the shared DNA is shown below. Note that in addition to outputting the shared DNA segments on either one or both chromosomes, the shared genes on either one or both chromosomes are also output. +.. note:: Shared DNA is not computed on the X chromosome with the 1000 Genomes Project genetic + maps since the X chromosome is not included in these genetic maps. + In this example, there are 15,976 shared genes on both chromosomes transcribed from 36 segments of shared DNA: @@ -228,7 +235,7 @@ of shared DNA: >>> len(results['two_chrom_shared_dna']) 36 -.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User4583_User4584.png +.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User4583_User4584_CEU.png Documentation ------------- diff --git a/docs/images/shared_dna_User4583_User4584.png b/docs/images/shared_dna_User4583_User4584.png deleted file mode 100644 index 0b9cac8..0000000 Binary files a/docs/images/shared_dna_User4583_User4584.png and /dev/null differ diff --git a/docs/images/shared_dna_User4583_User4584_CEU.png b/docs/images/shared_dna_User4583_User4584_CEU.png new file mode 100644 index 0000000..d9f12c9 Binary files /dev/null and b/docs/images/shared_dna_User4583_User4584_CEU.png differ diff --git a/docs/images/shared_dna_User662_User663.png b/docs/images/shared_dna_User662_User663_HapMap2.png similarity index 100% rename from docs/images/shared_dna_User662_User663.png rename to docs/images/shared_dna_User662_User663_HapMap2.png diff --git a/docs/output_files.rst b/docs/output_files.rst index 5008a71..9360b68 100644 --- a/docs/output_files.rst +++ b/docs/output_files.rst @@ -57,25 +57,26 @@ In the filenames below, ``name1`` is the name of the first :class:`~lineage.individual.Individual` and ``name2`` is the name of the second :class:`~lineage.individual.Individual`. (If more individuals are compared, all :class:`~lineage.individual.Individual` names will be included in the filenames and plot titles -using the same conventions.) +using the same conventions.) Additionally, ``genetic_map`` corresponds to the genetic map used +in the calculations of shared DNA, specified as a parameter to :meth:`~lineage.Lineage.find_shared_dna`. .. note:: Genetic maps do not have recombination rates for the Y chromosome since the Y chromosome does not recombine. Therefore, shared DNA will not be shown on the Y chromosome. -shared_dna__.png -`````````````````````````````` +shared_dna___.png +```````````````````````````````````````````` This plot illustrates shared DNA (i.e., no shared DNA, shared DNA on one chromosome, and shared DNA on both chromosomes). The centromere for each chromosome is also detailed. Two examples of this plot are shown below. -.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User662_User663.png +.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User662_User663_HapMap2.png In the above plot, note that the two individuals only share DNA on one chromosome. In this plot, the larger regions where "No shared DNA" is indicated are due to SNPs not being available in those regions (i.e., SNPs were not tested in those regions). -.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User4583_User4584.png +.. image:: https://raw.githubusercontent.com/apriha/lineage/master/docs/images/shared_dna_User4583_User4584_CEU.png In the above plot, the areas where "No shared DNA" is indicated are the regions where SNPs were not tested or where DNA is not shared. The areas where "One chromosome shared" is indicated are @@ -85,8 +86,8 @@ shared" is indicated are regions where the individuals share DNA on both chromos Note that the regions where DNA is shared on both chromosomes is a subset of the regions where one chromosome is shared. -shared_dna_one_chrom___GRCh37.csv -``````````````````````````````````````````````` +shared_dna_one_chrom___GRCh37_.csv +````````````````````````````````````````````````````````````` If DNA is shared on one chromosome, a CSV file details the shared segments of DNA. ======= =========== @@ -100,8 +101,8 @@ cMs CentiMorgans of matching DNA segment snps Number of SNPs in matching DNA segment ======= =========== -shared_dna_two_chroms___GRCh37.csv -```````````````````````````````````````````````` +shared_dna_two_chroms___GRCh37_.csv +`````````````````````````````````````````````````````````````` If DNA is shared on two chromosomes, a CSV file details the shared segments of DNA. ======= =========== @@ -128,11 +129,11 @@ In the filenames below, ``name1`` is the name of the first :class:`~lineage.individual.Individual` names will be included in the filenames using the same convention.) -shared_genes_one_chrom___GRCh37.csv -````````````````````````````````````````````````` +shared_genes_one_chrom___GRCh37_.csv +``````````````````````````````````````````````````````````````` If DNA is shared on one chromosome, this file details the genes shared between the individuals on at least one chromosome; these genes are located in the shared DNA segments specified in -`shared_dna_one_chrom___GRCh37.csv`_. +`shared_dna_one_chrom___GRCh37_.csv`_. =========== ============ Column* Description* @@ -151,10 +152,10 @@ description Description \* `UCSC Genome Browser `_ / `UCSC Table Browser `_ -shared_genes_two_chroms___GRCh37.csv -`````````````````````````````````````````````````` +shared_genes_two_chroms___GRCh37_.csv +```````````````````````````````````````````````````````````````` If DNA is shared on both chromosomes in a pair, this file details the genes shared between the individuals on both chromosomes; these genes are located in the shared DNA segments specified in -`shared_dna_two_chroms___GRCh37.csv`_. +`shared_dna_two_chroms___GRCh37_.csv`_. -The file has the same columns as `shared_genes_one_chrom___GRCh37.csv`_. +The file has the same columns as `shared_genes_one_chrom___GRCh37_.csv`_. diff --git a/src/lineage/__init__.py b/src/lineage/__init__.py index 0e078fd..9ca6d78 100644 --- a/src/lineage/__init__.py +++ b/src/lineage/__init__.py @@ -280,12 +280,13 @@ def find_shared_dna( snp_threshold=1100, shared_genes=False, save_output=True, + genetic_map="HapMap2", ): """ Find the shared DNA between individuals. - Computes the genetic distance in centiMorgans (cMs) between SNPs using the HapMap Phase II - GRCh37 genetic map. Applies thresholds to determine the shared DNA. Plots shared DNA. - Optionally determines shared genes (i.e., genes that are transcribed from the shared DNA). + Computes the genetic distance in centiMorgans (cMs) between SNPs using the specified genetic + map. Applies thresholds to determine the shared DNA. Plots shared DNA. Optionally determines + shared genes (i.e., genes transcribed from the shared DNA). All output is saved to the output directory as `CSV` or `PNG` files. @@ -300,6 +301,16 @@ def find_shared_dna( determine shared genes save_output : bool specifies whether to save output files in the output directory + genetic_map : {'HapMap2', 'ACB', 'ASW', 'CDX', 'CEU', 'CHB', 'CHS', 'CLM', 'FIN', 'GBR', 'GIH', 'IBS', 'JPT', 'KHV', 'LWK', 'MKK', 'MXL', 'PEL', 'PUR', 'TSI', 'YRI'} + genetic map to use for computation of shared DNA; `HapMap2` corresponds to the HapMap + Phase II genetic map from the + `International HapMap Project `_ + and all others correspond to the + `population-specific `_ + genetic maps generated from the + `1000 Genomes Project `_ phased OMNI data. + Note that shared DNA is not computed on the X chromosome with the 1000 Genomes + Project genetic maps since the X chromosome is not included in these genetic maps. Returns ------- @@ -339,6 +350,18 @@ def find_shared_dna( two_chrom_discrepant_snps, ) + genetic_map_dfs = self._resources.get_genetic_map(genetic_map) + + if len(genetic_map_dfs) == 0: + return self._find_shared_dna_return_helper( + one_chrom_shared_dna, + two_chrom_shared_dna, + one_chrom_shared_genes, + two_chrom_shared_genes, + one_chrom_discrepant_snps, + two_chrom_discrepant_snps, + ) + cols = ["genotype{}".format(str(i)) for i in range(len(individuals))] df = individuals[0].snps @@ -351,19 +374,17 @@ def find_shared_dna( one_x_chrom = self._is_one_individual_male(individuals) - genetic_map = self._resources.get_genetic_map_HapMapII_GRCh37() - tasks = [] chroms_to_drop = [] for chrom in df["chrom"].unique(): - if chrom not in genetic_map.keys(): + if chrom not in genetic_map_dfs.keys(): chroms_to_drop.append(chrom) continue tasks.append( { - "genetic_map": genetic_map[chrom], + "genetic_map": genetic_map_dfs[chrom], # get positions for the current chromosome "snps": pd.DataFrame(df.loc[(df["chrom"] == chrom)]["pos"]), } @@ -433,6 +454,7 @@ def find_shared_dna( two_chrom_shared_dna, one_chrom_shared_genes, two_chrom_shared_genes, + genetic_map, ) return self._find_shared_dna_return_helper( @@ -479,6 +501,7 @@ def _find_shared_dna_output_helper( two_chrom_shared_dna, one_chrom_shared_genes, two_chrom_shared_genes, + genetic_map, ): cytobands = self._resources.get_cytoBand_hg19() @@ -498,14 +521,17 @@ def _find_shared_dna_output_helper( two_chrom_shared_dna, cytobands, os.path.join( - self._output_dir, "shared_dna_{}.png".format(individuals_filename) + self._output_dir, + f"shared_dna_{individuals_filename}_{genetic_map}.png", ), - "{} shared DNA".format(individuals_plot_title), + f"{individuals_plot_title} shared DNA", 37, ) if len(one_chrom_shared_dna) > 0: - file = "shared_dna_one_chrom_{}_GRCh37.csv".format(individuals_filename) + file = ( + f"shared_dna_one_chrom_{individuals_filename}_GRCh37_{genetic_map}.csv" + ) save_df_as_csv( one_chrom_shared_dna, self._output_dir, @@ -516,7 +542,9 @@ def _find_shared_dna_output_helper( ) if len(two_chrom_shared_dna) > 0: - file = "shared_dna_two_chroms_{}_GRCh37.csv".format(individuals_filename) + file = ( + f"shared_dna_two_chroms_{individuals_filename}_GRCh37_{genetic_map}.csv" + ) save_df_as_csv( two_chrom_shared_dna, self._output_dir, @@ -527,7 +555,7 @@ def _find_shared_dna_output_helper( ) if len(one_chrom_shared_genes) > 0: - file = "shared_genes_one_chrom_{}_GRCh37.csv".format(individuals_filename) + file = f"shared_genes_one_chrom_{individuals_filename}_GRCh37_{genetic_map}.csv" save_df_as_csv( one_chrom_shared_genes, self._output_dir, @@ -537,7 +565,7 @@ def _find_shared_dna_output_helper( ) if len(two_chrom_shared_genes) > 0: - file = "shared_genes_two_chroms_{}_GRCh37.csv".format(individuals_filename) + file = f"shared_genes_two_chroms_{individuals_filename}_GRCh37_{genetic_map}.csv" save_df_as_csv( two_chrom_shared_genes, self._output_dir, diff --git a/src/lineage/resources.py b/src/lineage/resources.py index b74ee3a..c624120 100644 --- a/src/lineage/resources.py +++ b/src/lineage/resources.py @@ -76,11 +76,61 @@ def __init__(self, resources_dir="resources"): """ super().__init__(resources_dir=resources_dir) - self._genetic_map_HapMapII_GRCh37 = {} + self._genetic_map = {} + self._genetic_map_name = "" self._cytoBand_hg19 = pd.DataFrame() self._knownGene_hg19 = pd.DataFrame() self._kgXref_hg19 = pd.DataFrame() + def get_genetic_map(self, genetic_map): + """ Get specified genetic map. + + Parameters + ---------- + genetic_map : {'HapMap2', 'ACB', 'ASW', 'CDX', 'CEU', 'CHB', 'CHS', 'CLM', 'FIN', 'GBR', 'GIH', 'IBS', 'JPT', 'KHV', 'LWK', 'MKK', 'MXL', 'PEL', 'PUR', 'TSI', 'YRI'} + `HapMap2` corresponds to the HapMap Phase II genetic map from the + `International HapMap Project `_ + and all others correspond to the + `population-specific `_ + genetic maps generated from the + `1000 Genomes Project `_ phased OMNI data. + + Returns + ------- + dict + dict of pandas.DataFrame genetic maps if loading was successful, else {} + """ + if genetic_map not in [ + "HapMap2", + "ACB", + "ASW", + "CDX", + "CEU", + "CHB", + "CHS", + "CLM", + "FIN", + "GBR", + "GIH", + "IBS", + "JPT", + "KHV", + "LWK", + "MKK", + "MXL", + "PEL", + "PUR", + "TSI", + "YRI", + ]: + logger.warning("Invalid genetic map") + return {} + + if genetic_map == "HapMap2": + return self.get_genetic_map_HapMapII_GRCh37() + else: + return self.get_genetic_map_1000G_GRCh37(genetic_map) + def get_genetic_map_HapMapII_GRCh37(self): """ Get International HapMap Consortium HapMap Phase II genetic map for Build 37. @@ -98,12 +148,67 @@ def get_genetic_map_HapMapII_GRCh37(self): paper (Nature, 18th Sept 2007). The conversion from b35 to GRCh37 was achieved using the UCSC liftOver tool. Adam Auton, 08/12/2010" """ - if not self._genetic_map_HapMapII_GRCh37: - self._genetic_map_HapMapII_GRCh37 = self._load_genetic_map( + if self._genetic_map_name != "HapMap2": + self._genetic_map = self._load_genetic_map_HapMapII_GRCh37( self._get_path_genetic_map_HapMapII_GRCh37() ) + self._genetic_map_name = "HapMap2" + + return self._genetic_map + + def get_genetic_map_1000G_GRCh37(self, pop): + """ Get population-specific 1000 Genomes Project genetic map. + + Notes + ----- + From `README_omni_recombination_20130507 `_: + + Genetic maps generated from the 1000G phased OMNI data. + + [Build 37] OMNI haplotypes were obtained from the Phase 1 dataset + (`/vol1/ftp/phase1/analysis_results/supporting/omni_haplotypes/ `_). - return self._genetic_map_HapMapII_GRCh37 + Genetic maps were generated for each population separately using LDhat + (http://ldhat.sourceforge.net/). Haplotypes were split into 2000 SNP windows + with an overlap of 200 SNPs between each window. The recombination rate was + estimated for each window independently, using a block penalty of 5 for a + total of 22.5 million iterations with a sample being taken from the MCMC + chain every 15,000 iterations. The first 7.5 million iterations were + discarded as burn in. Once rates were estimated, windows were merged by + splicing the estimates at the mid-point of the overlapping regions. + + LDhat estimates the population genetic recombination rate, rho = 4Ner. In + order to convert to per-generation rates (measured in cM/Mb), the LDhat + rates were compared to pedigree-based rates from Kong et al. (2010). + Specifically, rates were binned at the 5Mb scale, and a linear regression + performed between the two datasets. The gradient of this line gives an + estimate of 4Ne, allowing the population based rates to be converted to + per-generation rates. + + Adam Auton (adam.auton@einstein.yu.edu) + May 7th, 2013 + + Returns + ------- + dict + dict of pandas.DataFrame population-specific 1000 Genomes Project genetic maps if + loading was successful, else {} + + References + ---------- + 1. Adam Auton (adam.auton@einstein.yu.edu), May 7th, 2013 + 2. The 1000 Genomes Project Consortium., Corresponding authors., Auton, A. et al. A + global reference for human genetic variation. Nature 526, 68–74 (2015). + https://doi.org/10.1038/nature15393 + 3. https://github.com/joepickrell/1000-genomes-genetic-maps + """ + if self._genetic_map_name != pop: + self._genetic_map = self._load_genetic_map_1000G_GRCh37( + self._get_path_genetic_map_1000G_GRCh37(pop) + ) + self._genetic_map_name = pop + + return self._genetic_map def get_cytoBand_hg19(self): """ Get UCSC cytoBand table for Build 37. @@ -210,8 +315,8 @@ def get_all_resources(self): return resources @staticmethod - def _load_genetic_map(filename): - """ Load genetic map (e.g. HapMapII). + def _load_genetic_map_HapMapII_GRCh37(filename): + """ Load HapMapII genetic map. Parameters ---------- @@ -229,7 +334,7 @@ def _load_genetic_map(filename): """ genetic_map = {} - with tarfile.open(filename, "r") as tar: + with tarfile.open(filename, "r:gz") as tar: # http://stackoverflow.com/a/2018576 for member in tar.getmembers(): if "genetic_map" in member.name: @@ -255,6 +360,47 @@ def _load_genetic_map(filename): return genetic_map + @staticmethod + def _load_genetic_map_1000G_GRCh37(filename): + """ Load 1000 Genomes Project genetic map. + + Parameters + ---------- + filename : str + path to archive with compressed genetic map data + + Returns + ------- + genetic_map : dict + dict of pandas.DataFrame genetic maps + + Notes + ----- + Keys of returned dict are chromosomes and values are the corresponding genetic map. + """ + genetic_map = {} + + with tarfile.open(filename, "r") as tar: + # http://stackoverflow.com/a/2018576 + for member in tar.getmembers(): + df = pd.read_csv( + tar.extractfile(member), + compression="gzip", + sep="\s+", + usecols=["Position(bp)", "Rate(cM/Mb)", "Map(cM)"], + ) + df = df.rename( + columns={ + "Position(bp)": "pos", + "Rate(cM/Mb)": "rate", + "Map(cM)": "map", + } + ) + chrom = member.name.split("-")[1] + genetic_map[chrom] = df + + return genetic_map + @staticmethod def _load_cytoBand(filename): """ Load UCSC cytoBand table. @@ -374,6 +520,21 @@ def _get_path_genetic_map_HapMapII_GRCh37(self): "genetic_map_HapMapII_GRCh37.tar.gz", ) + def _get_path_genetic_map_1000G_GRCh37(self, pop): + """ Get local path to population-specific 1000 Genomes Project genetic map, + downloading if necessary. + + Returns + ------- + str + path to {pop}_omni_recombination_20130507.tar + """ + filename = f"{pop}_omni_recombination_20130507.tar" + return self._download_file( + f"ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/{filename}", + filename, + ) + def _get_path_knownGene_hg19(self): """ Get local path to knownGene file for hg19 / GRCh37 from UCSC, downloading if necessary. diff --git a/tests/__init__.py b/tests/__init__.py index db467d2..f420367 100644 --- a/tests/__init__.py +++ b/tests/__init__.py @@ -142,3 +142,16 @@ def generic_snps(self): pos=list(range(101, 109)), genotype=["AA", "CC", "GG", "TT", np.nan, "GC", "TC", "AT"], ) + + @property + def downloads_enabled(self): + """ Property indicating if downloads are enabled. + + Only download from external resources when an environment variable named + "DOWNLOADS_ENABLED" is set to "true". + + Returns + ------- + bool + """ + return True if os.getenv("DOWNLOADS_ENABLED") == "true" else False diff --git a/tests/test_lineage.py b/tests/test_lineage.py index 4be13f3..d329d84 100644 --- a/tests/test_lineage.py +++ b/tests/test_lineage.py @@ -24,6 +24,8 @@ """ import os +import tempfile +from unittest.mock import Mock, mock_open, patch import warnings import numpy as np @@ -43,15 +45,186 @@ def get_discordant_snps(self, ind, df): return ind - def test_download_example_datasets(self): - paths = self.l.download_example_datasets() + def _generate_test_genetic_map( + self, + chrom="1", + pos=(1, 111700001), + rate=(140.443968 / (111700001 / 1e6), 0), + map_cMs=(0.000000, 140.443968), + **kwargs, + ): + return { + chrom: pd.DataFrame({"pos": pos, "rate": rate, "map": map_cMs,}), + } + + def _generate_test_cytoBand_hg19(self): + return pd.DataFrame( + { + "chrom": ["1"], + "start": [0], + "end": [111800001], + "name": ["test"], + "gie_stain": ["gneg"], + } + ) + + def _generate_test_gene_dfs( + self, + chrom="1", + len1=3811, + len2=4000, + txStart1=1000000, + txEnd1=2000000, + txStart2=111600000, + txEnd2=111800000, + **kwargs, + ): + diff = len2 - len1 + txStart = [txStart1] * len1 + txStart.extend([txStart2] * diff) + txEnd = [txEnd1] * len1 + txEnd.extend([txEnd2] * diff) + + kg = pd.DataFrame( + { + "name": [f"g{i}" for i in range(len2)], + "chrom": [chrom] * len2, + "strand": ["+"] * len2, + "txStart": txStart, + "txEnd": txEnd, + "proteinID": ["g"] * len2, + } + ) + kg.set_index("name", inplace=True) + + kgXref = pd.DataFrame( + { + "kgID": [f"g{i}" for i in range(len2)], + "geneSymbol": ["s"] * len2, + "refseq": ["s"] * len2, + "description": ["s"] * len2, + } + ) + kgXref.set_index("kgID", inplace=True) + + return kg, kgXref + + def run_find_shared_dna_test_X(self, f): + self.run_find_shared_dna_test( + f, + chrom="X", + pos=(1, 2695340, 154929412, 155270560), + rate=( + 20.837792 / (2695340 / 1e6), + 180.837755 / ((154929412 - 2695340) / 1e6), + 0.347344 / ((155270560 - 154929412) / 1e6), + 0, + ), + map_cMs=( + 0.000000, + 20.837792, + 20.837792 + 180.837755, + 20.837792 + 180.837755 + 0.347344, + ), + len1=54, + len2=3022, + txStart1=2400000, + txEnd1=2600000, + txStart2=150000000, + txEnd2=155000000, + ) + def run_find_shared_dna_test_1000G(self, f): + self.run_find_shared_dna_test( + f, + HapMap2=False, + pos=(1, 43800001), + rate=(63.0402663602 / (43800001 / 1e6), 0), + map_cMs=(0.0, 63.0402663602), + len1=2188, + ) + + def run_find_shared_dna_test(self, f, HapMap2=True, **kwargs): + if self.downloads_enabled: + f() + else: + genetic_map_patch = ( + "lineage.resources.Resources.get_genetic_map_HapMapII_GRCh37" + if HapMap2 + else "lineage.resources.Resources.get_genetic_map_1000G_GRCh37" + ) + genetic_map = self._generate_test_genetic_map(**kwargs) + cytoband = self._generate_test_cytoBand_hg19() + kg, kgXref = self._generate_test_gene_dfs(**kwargs) + + with patch( + genetic_map_patch, Mock(return_value=genetic_map), + ): + with patch( + "lineage.resources.Resources.get_cytoBand_hg19", + Mock(return_value=cytoband), + ): + with patch( + "lineage.resources.Resources.get_knownGene_hg19", + Mock(return_value=kg), + ): + with patch( + "lineage.resources.Resources.get_kgXref_hg19", + Mock(return_value=kgXref), + ): + f() + + def _assert_exists(self, files, idx): + for i, file in enumerate(files): + if i in idx: + self.assertTrue(os.path.exists(file)) + + def _assert_does_not_exist(self, files, idx): + for i, file in enumerate(files): + if i in idx: + self.assertFalse(os.path.exists(file)) + + def _make_file_exist_assertions(self, inds, exist="all", genetic_map="HapMap2"): + files = [ + f"output/shared_dna_one_chrom_{inds}_GRCh37_{genetic_map}.csv", + f"output/shared_dna_two_chroms_{inds}_GRCh37_{genetic_map}.csv", + f"output/shared_genes_one_chrom_{inds}_GRCh37_{genetic_map}.csv", + f"output/shared_genes_two_chroms_{inds}_GRCh37_{genetic_map}.csv", + f"output/shared_dna_{inds}_{genetic_map}.png", + ] + + if exist == "all": + self._assert_exists(files, list(range(5))) + elif exist == "none": + self._assert_does_not_exist(files, list(range(5))) + elif exist == "one_chrom": + self._assert_exists(files, [0, 2, 4]) + self._assert_does_not_exist(files, [1, 3]) + elif exist == "plots": + self._assert_exists(files, [4]) + self._assert_does_not_exist(files, list(range(4))) + + def _check_example_paths(self, paths): for path in paths: if path is None or not os.path.exists(path): warnings.warn("Example dataset(s) not currently available") return - assert True + def test_download_example_datasets(self): + if self.downloads_enabled: + paths = self.l.download_example_datasets() + self._check_example_paths(paths) + else: + with tempfile.TemporaryDirectory() as tmpdir: + self.l._resources._resources_dir = tmpdir + + # use a temporary directory for test resource data + with patch("urllib.request.urlopen", mock_open(read_data=b"")): + paths = self.l.download_example_datasets() + + self._check_example_paths(paths) + + self.l._resources._resources_dir = "resources" def test_find_discordant_snps(self): df = pd.read_csv( @@ -142,6 +315,9 @@ def test_find_discordant_snps(self): assert os.path.exists("output/discordant_snps_ind1_ind2_ind3_GRCh37.csv") def test_find_shared_dna_one_ind(self): + self.run_find_shared_dna_test(self._test_find_shared_dna_one_ind) + + def _test_find_shared_dna_one_ind(self): ind1 = self.simulate_snps(self.l.create_individual("ind1")) d = self.l.find_shared_dna([ind1], shared_genes=True) @@ -152,17 +328,35 @@ def test_find_shared_dna_one_ind(self): assert len(d["two_chrom_shared_genes"]) == 0 assert len(d["one_chrom_discrepant_snps"]) == 0 assert len(d["two_chrom_discrepant_snps"]) == 0 - assert not os.path.exists("output/shared_dna_one_chrom_ind1_GRCh37.csv") - assert not os.path.exists("output/shared_dna_two_chroms_ind1_GRCh37.csv") - assert not os.path.exists("output/shared_genes_one_chrom_ind1_GRCh37.csv") - assert not os.path.exists("output/shared_genes_two_chroms_ind1_GRCh37.csv") - assert not os.path.exists("output/shared_dna_ind1.png") + self._make_file_exist_assertions("ind1", exist="none") + + def test_find_shared_dna_invalid_genetic_map(self): + self.run_find_shared_dna_test(self._test_find_shared_dna_invalid_genetic_map) + + def _test_find_shared_dna_invalid_genetic_map(self): + ind1 = self.simulate_snps(self.l.create_individual("ind1")) + ind2 = self.simulate_snps(self.l.create_individual("ind2")) + + d = self.l.find_shared_dna([ind1, ind2], genetic_map="test") + + assert len(d["one_chrom_shared_dna"]) == 0 + assert len(d["two_chrom_shared_dna"]) == 0 + assert len(d["one_chrom_shared_genes"]) == 0 + assert len(d["two_chrom_shared_genes"]) == 0 + assert len(d["one_chrom_discrepant_snps"]) == 0 + assert len(d["two_chrom_discrepant_snps"]) == 0 + self._make_file_exist_assertions("ind1_ind2", exist="none") def test_find_shared_dna_two_chrom_shared(self): + self.run_find_shared_dna_test(self._test_find_shared_dna_two_chrom_shared) + + def _test_find_shared_dna_two_chrom_shared(self): ind1 = self.simulate_snps(self.l.create_individual("ind1")) ind2 = self.simulate_snps(self.l.create_individual("ind2")) - d = self.l.find_shared_dna([ind1, ind2], shared_genes=True) + d = self.l.find_shared_dna( + [ind1, ind2], shared_genes=True, genetic_map="HapMap2" + ) assert len(d["one_chrom_shared_dna"]) == 1 assert len(d["two_chrom_shared_dna"]) == 1 @@ -172,13 +366,39 @@ def test_find_shared_dna_two_chrom_shared(self): assert len(d["two_chrom_discrepant_snps"]) == 0 np.testing.assert_allclose(d["one_chrom_shared_dna"].loc[1]["cMs"], 140.443968) np.testing.assert_allclose(d["two_chrom_shared_dna"].loc[1]["cMs"], 140.443968) - assert os.path.exists("output/shared_dna_one_chrom_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_dna_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_genes_one_chrom_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_genes_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_dna_ind1_ind2.png") + self._make_file_exist_assertions("ind1_ind2") + + def test_find_shared_dna_two_chrom_shared_1000G(self): + self.run_find_shared_dna_test_1000G( + self._test_find_shared_dna_two_chrom_shared_1000G + ) + + def _test_find_shared_dna_two_chrom_shared_1000G(self): + ind1 = self.simulate_snps(self.l.create_individual("ind1"), pos_max=43800002) + ind2 = self.simulate_snps(self.l.create_individual("ind2"), pos_max=43800002) + + d = self.l.find_shared_dna([ind1, ind2], shared_genes=True, genetic_map="CEU") + + assert len(d["one_chrom_shared_dna"]) == 1 + assert len(d["two_chrom_shared_dna"]) == 1 + assert len(d["one_chrom_shared_genes"]) == 2188 + assert len(d["two_chrom_shared_genes"]) == 2188 + assert len(d["one_chrom_discrepant_snps"]) == 0 + assert len(d["two_chrom_discrepant_snps"]) == 0 + np.testing.assert_allclose( + d["one_chrom_shared_dna"].loc[1]["cMs"], 63.0402663602 + ) + np.testing.assert_allclose( + d["two_chrom_shared_dna"].loc[1]["cMs"], 63.0402663602 + ) + self._make_file_exist_assertions("ind1_ind2", genetic_map="CEU") def test_find_shared_dna_two_chrom_shared_three_ind(self): + self.run_find_shared_dna_test( + self._test_find_shared_dna_two_chrom_shared_three_ind + ) + + def _test_find_shared_dna_two_chrom_shared_three_ind(self): ind1 = self.simulate_snps(self.l.create_individual("ind1")) ind2 = self.simulate_snps(self.l.create_individual("ind2")) ind3 = self.simulate_snps(self.l.create_individual("ind3")) @@ -193,15 +413,14 @@ def test_find_shared_dna_two_chrom_shared_three_ind(self): assert len(d["two_chrom_discrepant_snps"]) == 0 np.testing.assert_allclose(d["one_chrom_shared_dna"].loc[1]["cMs"], 140.443968) np.testing.assert_allclose(d["two_chrom_shared_dna"].loc[1]["cMs"], 140.443968) - assert os.path.exists("output/shared_dna_one_chrom_ind1_ind2_ind3_GRCh37.csv") - assert os.path.exists("output/shared_dna_two_chroms_ind1_ind2_ind3_GRCh37.csv") - assert os.path.exists("output/shared_genes_one_chrom_ind1_ind2_ind3_GRCh37.csv") - assert os.path.exists( - "output/shared_genes_two_chroms_ind1_ind2_ind3_GRCh37.csv" - ) - assert os.path.exists("output/shared_dna_ind1_ind2_ind3.png") + self._make_file_exist_assertions("ind1_ind2_ind3") def test_find_shared_dna_two_chrom_shared_no_output(self): + self.run_find_shared_dna_test( + self._test_find_shared_dna_two_chrom_shared_no_output + ) + + def _test_find_shared_dna_two_chrom_shared_no_output(self): ind1 = self.simulate_snps(self.l.create_individual("ind1")) ind2 = self.simulate_snps(self.l.create_individual("ind2")) @@ -215,13 +434,12 @@ def test_find_shared_dna_two_chrom_shared_no_output(self): assert len(d["two_chrom_discrepant_snps"]) == 0 np.testing.assert_allclose(d["one_chrom_shared_dna"].loc[1]["cMs"], 140.443968) np.testing.assert_allclose(d["two_chrom_shared_dna"].loc[1]["cMs"], 140.443968) - assert not os.path.exists("output/shared_dna_one_chrom_ind1_ind2_GRCh37.csv") - assert not os.path.exists("output/shared_dna_two_chroms_ind1_ind2_GRCh37.csv") - assert not os.path.exists("output/shared_genes_one_chrom_ind1_ind2_GRCh37.csv") - assert not os.path.exists("output/shared_genes_two_chroms_ind1_ind2_GRCh37.csv") - assert not os.path.exists("output/shared_dna_ind1_ind2.png") + self._make_file_exist_assertions("ind1_ind2", exist="none") def test_find_shared_dna_one_chrom_shared(self): + self.run_find_shared_dna_test(self._test_find_shared_dna_one_chrom_shared) + + def _test_find_shared_dna_one_chrom_shared(self): ind1 = self.simulate_snps(self.l.create_individual("ind1")) ind2 = self.simulate_snps( self.l.create_individual("ind2"), complement_genotype_one_chrom=True @@ -236,13 +454,14 @@ def test_find_shared_dna_one_chrom_shared(self): assert len(d["one_chrom_discrepant_snps"]) == 0 assert len(d["two_chrom_discrepant_snps"]) == 0 np.testing.assert_allclose(d["one_chrom_shared_dna"].loc[1]["cMs"], 140.443968) - assert os.path.exists("output/shared_dna_one_chrom_ind1_ind2_GRCh37.csv") - assert not os.path.exists("output/shared_dna_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_genes_one_chrom_ind1_ind2_GRCh37.csv") - assert not os.path.exists("output/shared_genes_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_dna_ind1_ind2.png") + self._make_file_exist_assertions("ind1_ind2", exist="one_chrom") def test_find_shared_dna_one_chrom_shared_three_ind(self): + self.run_find_shared_dna_test( + self._test_find_shared_dna_one_chrom_shared_three_ind + ) + + def _test_find_shared_dna_one_chrom_shared_three_ind(self): ind1 = self.simulate_snps(self.l.create_individual("ind1")) ind2 = self.simulate_snps( self.l.create_individual("ind2"), complement_genotype_one_chrom=True @@ -258,17 +477,14 @@ def test_find_shared_dna_one_chrom_shared_three_ind(self): assert len(d["one_chrom_discrepant_snps"]) == 0 assert len(d["two_chrom_discrepant_snps"]) == 0 np.testing.assert_allclose(d["one_chrom_shared_dna"].loc[1]["cMs"], 140.443968) - assert os.path.exists("output/shared_dna_one_chrom_ind1_ind2_ind3_GRCh37.csv") - assert not os.path.exists( - "output/shared_dna_two_chroms_ind1_ind2_ind3_GRCh37.csv" - ) - assert os.path.exists("output/shared_genes_one_chrom_ind1_ind2_ind3_GRCh37.csv") - assert not os.path.exists( - "output/shared_genes_two_chroms_ind1_ind2_ind3_GRCh37.csv" - ) - assert os.path.exists("output/shared_dna_ind1_ind2_ind3.png") + self._make_file_exist_assertions("ind1_ind2_ind3", exist="one_chrom") def test_find_shared_dna_X_chrom_two_individuals_male(self): + self.run_find_shared_dna_test_X( + self._test_find_shared_dna_X_chrom_two_individuals_male + ) + + def _test_find_shared_dna_X_chrom_two_individuals_male(self): ind1 = self.simulate_snps( self.l.create_individual("ind1"), chrom="X", @@ -292,15 +508,20 @@ def test_find_shared_dna_X_chrom_two_individuals_male(self): assert len(d["two_chrom_shared_genes"]) == 54 assert len(d["one_chrom_discrepant_snps"]) == 0 assert len(d["two_chrom_discrepant_snps"]) == 0 - np.testing.assert_allclose(d["one_chrom_shared_dna"].loc[1]["cMs"], 202.022891) - np.testing.assert_allclose(d["two_chrom_shared_dna"].loc[1]["cMs"], 20.837792) - assert os.path.exists("output/shared_dna_one_chrom_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_dna_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_genes_one_chrom_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_genes_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_dna_ind1_ind2.png") + np.testing.assert_allclose( + d["one_chrom_shared_dna"].loc[1]["cMs"], 202.022891, rtol=1e-5, + ) + np.testing.assert_allclose( + d["two_chrom_shared_dna"].loc[1]["cMs"], 20.837792, rtol=1e-3, + ) + self._make_file_exist_assertions("ind1_ind2") def test_find_shared_dna_X_chrom_two_individuals_female(self): + self.run_find_shared_dna_test_X( + self._test_find_shared_dna_X_chrom_two_individuals_female + ) + + def _test_find_shared_dna_X_chrom_two_individuals_female(self): ind1 = self.simulate_snps( self.l.create_individual("ind1"), chrom="X", @@ -322,15 +543,20 @@ def test_find_shared_dna_X_chrom_two_individuals_female(self): assert len(d["two_chrom_shared_genes"]) == 3022 assert len(d["one_chrom_discrepant_snps"]) == 0 assert len(d["two_chrom_discrepant_snps"]) == 0 - np.testing.assert_allclose(d["one_chrom_shared_dna"].loc[1]["cMs"], 202.022891) - np.testing.assert_allclose(d["two_chrom_shared_dna"].loc[1]["cMs"], 202.022891) - assert os.path.exists("output/shared_dna_one_chrom_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_dna_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_genes_one_chrom_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_genes_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_dna_ind1_ind2.png") + np.testing.assert_allclose( + d["one_chrom_shared_dna"].loc[1]["cMs"], 202.022891, rtol=1e-5, + ) + np.testing.assert_allclose( + d["two_chrom_shared_dna"].loc[1]["cMs"], 202.022891, rtol=1e-5, + ) + self._make_file_exist_assertions("ind1_ind2") def test_find_shared_dna_two_chrom_shared_discrepant_snps(self): + self.run_find_shared_dna_test( + self._test_find_shared_dna_two_chrom_shared_discrepant_snps + ) + + def _test_find_shared_dna_two_chrom_shared_discrepant_snps(self): # simulate discrepant SNPs so that stitching of adjacent shared DNA segments is performed ind1 = self.simulate_snps(self.l.create_individual("ind1")) ind2 = self.simulate_snps( @@ -349,13 +575,12 @@ def test_find_shared_dna_two_chrom_shared_discrepant_snps(self): assert len(d["two_chrom_discrepant_snps"]) == 2 np.testing.assert_allclose(d["one_chrom_shared_dna"].loc[1]["cMs"], 140.443968) np.testing.assert_allclose(d["two_chrom_shared_dna"].loc[1]["cMs"], 140.443968) - assert os.path.exists("output/shared_dna_one_chrom_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_dna_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_genes_one_chrom_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_genes_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_dna_ind1_ind2.png") + self._make_file_exist_assertions("ind1_ind2") def test_find_shared_dna_no_shared_dna(self): + self.run_find_shared_dna_test(self._test_find_shared_dna_no_shared_dna) + + def _test_find_shared_dna_no_shared_dna(self): ind1 = self.simulate_snps(self.l.create_individual("ind1")) ind2 = self.simulate_snps( self.l.create_individual("ind2"), complement_genotype_two_chroms=True @@ -369,13 +594,14 @@ def test_find_shared_dna_no_shared_dna(self): assert len(d["two_chrom_shared_genes"]) == 0 assert len(d["one_chrom_discrepant_snps"]) == 0 assert len(d["two_chrom_discrepant_snps"]) == 0 - assert not os.path.exists("output/shared_dna_one_chrom_ind1_ind2_GRCh37.csv") - assert not os.path.exists("output/shared_dna_two_chroms_ind1_ind2_GRCh37.csv") - assert not os.path.exists("output/shared_genes_one_chrom_ind1_ind2_GRCh37.csv") - assert not os.path.exists("output/shared_genes_two_chroms_ind1_ind2_GRCh37.csv") - assert os.path.exists("output/shared_dna_ind1_ind2.png") + self._make_file_exist_assertions("ind1_ind2", exist="plots") def test_find_shared_dna_no_shared_dna_three_ind(self): + self.run_find_shared_dna_test( + self._test_find_shared_dna_no_shared_dna_three_ind + ) + + def _test_find_shared_dna_no_shared_dna_three_ind(self): ind1 = self.simulate_snps(self.l.create_individual("ind1")) ind2 = self.simulate_snps( self.l.create_individual("ind2"), complement_genotype_two_chroms=True @@ -390,16 +616,4 @@ def test_find_shared_dna_no_shared_dna_three_ind(self): assert len(d["two_chrom_shared_genes"]) == 0 assert len(d["one_chrom_discrepant_snps"]) == 0 assert len(d["two_chrom_discrepant_snps"]) == 0 - assert not os.path.exists( - "output/shared_dna_one_chrom_ind1_ind2_ind3_GRCh37.csv" - ) - assert not os.path.exists( - "output/shared_dna_two_chroms_ind1_ind2_ind3_GRCh37.csv" - ) - assert not os.path.exists( - "output/shared_genes_one_chrom_ind1_ind2_ind3_GRCh37.csv" - ) - assert not os.path.exists( - "output/shared_genes_two_chroms_ind1_ind2_ind3_GRCh37.csv" - ) - assert os.path.exists("output/shared_dna_ind1_ind2_ind3.png") + self._make_file_exist_assertions("ind1_ind2_ind3", exist="plots") diff --git a/tests/test_resources.py b/tests/test_resources.py index b59a6a3..a052f4b 100644 --- a/tests/test_resources.py +++ b/tests/test_resources.py @@ -23,47 +23,245 @@ """ +import gzip +import io import os +import tarfile +import tempfile +from unittest.mock import mock_open, patch import warnings +import pandas as pd + from lineage.resources import Resources from tests import BaseLineageTestCase class TestResources(BaseLineageTestCase): def setUp(self): - self.resource = Resources(resources_dir="resources") self.del_output_dir_helper() + def _reset_resource(self): + self.resource._genetic_map = {} + self.resource._genetic_map_name = "" + self.resource._cytoBand_hg19 = pd.DataFrame() + self.resource._knownGene_hg19 = pd.DataFrame() + self.resource._kgXref_hg19 = pd.DataFrame() + + def run(self, result=None): + # set resources directory based on if downloads are being performed + # https://stackoverflow.com/a/11180583 + + self.resource = Resources() + self._reset_resource() + if self.downloads_enabled: + self.resource._resources_dir = "resources" + super().run(result) + else: + # use a temporary directory for test resource data + with tempfile.TemporaryDirectory() as tmpdir: + self.resource._resources_dir = tmpdir + super().run(result) + self.resource._resources_dir = "resources" + + def _generate_test_genetic_map_HapMapII_GRCh37_resource(self): + filenames = [f"genetic_map_GRCh37_chr{chrom}.txt" for chrom in range(1, 23)] + filenames.extend( + [ + "genetic_map_GRCh37_chrX.txt", + "genetic_map_GRCh37_chrX_par1.txt", + "genetic_map_GRCh37_chrX_par2.txt", + "README.txt", + ] + ) + + # create compressed tar in memory + tar_file = io.BytesIO() + with tarfile.open(fileobj=tar_file, mode="w:gz") as out_tar: + for filename in filenames: + if filename != "README.txt": + chrom = filename[filename.find("chr") :].split(".")[0] + s = "Chromosome\tPosition(bp)\tRate(cM/Mb)\tMap(cM)\n" + s += f"{chrom}\t0\t0.0\t0.0\n" + else: + s = "test" + + # add file to tar; https://stackoverflow.com/a/40392022 + data = s.encode() + file = io.BytesIO(data) + tar_info = tarfile.TarInfo(name=filename) + tar_info.size = len(data) + out_tar.addfile(tar_info, fileobj=file) + + mock = mock_open(read_data=tar_file.getvalue()) + with patch("urllib.request.urlopen", mock): + self.resource._get_path_genetic_map_HapMapII_GRCh37() + def test_get_genetic_map_HapMapII_GRCh37(self): + def f(): + # mock download of test data + self._generate_test_genetic_map_HapMapII_GRCh37_resource() + return self.resource.get_genetic_map_HapMapII_GRCh37() + + genetic_map_HapMapII_GRCh37 = ( + self.resource.get_genetic_map_HapMapII_GRCh37() + if self.downloads_enabled + else f() + ) + + assert len(genetic_map_HapMapII_GRCh37) == 23 + + # get already loaded resource genetic_map_HapMapII_GRCh37 = self.resource.get_genetic_map_HapMapII_GRCh37() assert len(genetic_map_HapMapII_GRCh37) == 23 + # get already loaded resource + genetic_map = self.resource.get_genetic_map("HapMap2") + assert len(genetic_map) == 23 + + def _generate_test_genetic_map_1000G_GRCh37_resource(self): + filenames = [f"CEU-{chrom}-final.txt.gz" for chrom in range(1, 23)] + + # create tar in memory + tar_file = io.BytesIO() + with tarfile.open(fileobj=tar_file, mode="w") as out_tar: + for filename in filenames: + s = "Position(bp)\tRate(cM/Mb)\tMap(cM)\tFiltered\n" + s += f" 0\t0.0\t0.0\t0\n" + + # add file to tar; https://stackoverflow.com/a/40392022 + data = gzip.compress(s.encode()) + file = io.BytesIO(data) + tar_info = tarfile.TarInfo(name=filename) + tar_info.size = len(data) + out_tar.addfile(tar_info, fileobj=file) + + mock = mock_open(read_data=tar_file.getvalue()) + with patch("urllib.request.urlopen", mock): + self.resource._get_path_genetic_map_1000G_GRCh37("CEU") + + def test_get_genetic_map_1000G_GRCh37(self): + def f(): + # mock download of test data + self._generate_test_genetic_map_1000G_GRCh37_resource() + return self.resource.get_genetic_map_1000G_GRCh37("CEU") + + genetic_map = ( + self.resource.get_genetic_map_1000G_GRCh37("CEU") + if self.downloads_enabled + else f() + ) + + assert len(genetic_map) == 22 + + # get already loaded resource + genetic_map = self.resource.get_genetic_map_1000G_GRCh37("CEU") + assert len(genetic_map) == 22 + + # get already loaded resource + genetic_map = self.resource.get_genetic_map("CEU") + assert len(genetic_map) == 22 + + def test_invalid_genetic_map(self): + # https://stackoverflow.com/a/46767037 + with self.assertLogs() as log: + genetic_map = self.resource.get_genetic_map("test") + self.assertEqual(len(log.output), 1) + self.assertEqual(len(log.records), 1) + self.assertIn("Invalid genetic map", log.output[0]) + self.assertEqual(len(genetic_map), 0) + + def _generate_test_cytoBand_hg19_resource(self): + s = f"s\t0\t0\ts\ts\n" * 862 + mock = mock_open(read_data=gzip.compress(s.encode())) + with patch("urllib.request.urlopen", mock): + self.resource._get_path_cytoBand_hg19() + def test_get_cytoBand_hg19(self): + def f(): + # mock download of test data + self._generate_test_cytoBand_hg19_resource() + return self.resource.get_cytoBand_hg19() + + cytoBand_hg19 = ( + self.resource.get_cytoBand_hg19() if self.downloads_enabled else f() + ) + + assert len(cytoBand_hg19) == 862 + + # get already loaded resource cytoBand_hg19 = self.resource.get_cytoBand_hg19() assert len(cytoBand_hg19) == 862 + def _generate_test_knownGene_hg19_resource(self): + s = "s\ts\ts\t0\t0\t0\t0\t0\ts\ts\ts\ts\n" * 82960 + + mock = mock_open(read_data=gzip.compress(s.encode())) + with patch("urllib.request.urlopen", mock): + self.resource._get_path_knownGene_hg19() + def test_get_knownGene_hg19(self): + def f(): + # mock download of test data + self._generate_test_knownGene_hg19_resource() + return self.resource.get_knownGene_hg19() + + knownGene_hg19 = ( + self.resource.get_knownGene_hg19() if self.downloads_enabled else f() + ) + + assert len(knownGene_hg19) == 82960 + + # get already loaded resource knownGene_hg19 = self.resource.get_knownGene_hg19() assert len(knownGene_hg19) == 82960 + def _generate_test_kgXref_hg19_resource(self): + s = "s\ts\ts\ts\ts\ts\ts\ts\n" * 82960 + + mock = mock_open(read_data=gzip.compress(s.encode())) + with patch("urllib.request.urlopen", mock): + self.resource._get_path_kgXref_hg19() + def test_get_kgXref_hg19(self): + def f(): + # mock download of test data + self._generate_test_kgXref_hg19_resource() + return self.resource.get_kgXref_hg19() + + kgXref_hg19 = self.resource.get_kgXref_hg19() if self.downloads_enabled else f() + + assert len(kgXref_hg19) == 82960 + + # get already loaded resource kgXref_hg19 = self.resource.get_kgXref_hg19() assert len(kgXref_hg19) == 82960 def test_get_all_resources(self): - resources = self.resource.get_all_resources() + def f(): + # mock download of test data for each resource + self._generate_test_genetic_map_HapMapII_GRCh37_resource() + self._generate_test_cytoBand_hg19_resource() + self._generate_test_knownGene_hg19_resource() + self._generate_test_kgXref_hg19_resource() + + return self.resource.get_all_resources() + + resources = self.resource.get_all_resources() if self.downloads_enabled else f() + for k, v in resources.items(): - if v is None: - assert False - assert True + self.assertGreater(len(v), 0) def test_download_example_datasets(self): - paths = self.resource.download_example_datasets() + def f(): + with patch("urllib.request.urlopen", mock_open(read_data=b"")): + return self.resource.download_example_datasets() + + paths = ( + self.resource.download_example_datasets() if self.downloads_enabled else f() + ) for path in paths: if path is None or not os.path.exists(path): warnings.warn("Example dataset(s) not currently available") return - - assert True