Skip to content

everything about input data

Kamil S. Jaron edited this page Feb 3, 2025 · 4 revisions

Input data

Smudgeplot performs a kmer spectra analysis therefore anything with small error rate and decent coverage can be used as input - Illumina, PacBio HiFi or 2d ONT reads are all fine. Some other types of libraries (instead of wgs) might not be optimal, namely HiC is not a good library to genome profiling as it is disproportionally amplifies different regions of the genome based on their accessibility, the whole concept of 1n coverage is rather problematic for HiC. For other type, you should just think a bit about how is the library constructed - and would that process mess with the expected coverage? If it's yes, or maybe, proceed with caution.

How much coverage do I need?

Smudgeplot scales for any side of the input dataset and it's alway the more the better, therefore we recommend to use all the sequencing data you have. The bare minimum of coverage that could possible work is 10x per ploidy, but this really requires libraries to be PCR-free and very nicely done and sequenced (i.e. coverage variation to be as low as it can get). The more coverage variation there is, the more coverage is needed to actually produce a meaningful smudgeplot.

The coverage variation is dependent on many things

  • how recent the sequencing is (Illumina actually improved a lot in the last 5 years).
  • how many rounds of PCR were involved in the library prep (the less the better).
  • whether the genome was amplified by whole genome amplification technique
  • variation in CG content along its chromosomes (more variation can lead to greater variation in coverages).

We don't directly estimate coverage variation in smudgeplot, but you can sort of see it on how "blury" the smudges are, you can also see it on the k-mer profile produced by GenomeScope, the wider the coverage distributions are, the more variation is there.

To sum up, if you have >25x per haplotype the smudgeplot should be really nice. If you have less than that or there is something else going on (like whole genome amplification), smudgeplot might won't be very informative. Always make a k-mer spectrum too, look here for [joint-interpretation-of-smudgeplot-and-GenomeScope|details].

Why PCR libraries or WGA are problematic?

PCR and whole genome amplification especially (WGA) increase variance in coverage. For example, if my genome is sequenced on 30x using PCR free libraries, majority of positions in the genome will be sequenced let's say 20 - 40x and practically all of them will be between 10 - 50x. However, the same coverage with PCR will lead to certain positions with coverage 200x and others not sequenced at all. It practically means that the smudges no the smudgeplot will be way more smudgy (blurry/blended/?) and that needs to me compensated with a lot of coverage. We have successfully made a reasonable smudgeplot of a sample with WGA and ~100x, trying with less is playing with fire.

Garbage in, garbage out

I am not trying to be mean. But if your smudgeplot looks like a mess, it's very likely that your assembly will also be a mess, because smudgeplot is a direct visualisation of your data. More the smudges are separated, more chances there are that the assembler will be able to guess how many times the kmer is in the genome.

If you plot a smudgeplot that looks like a "tide" or even as a singular dot in the bottom right, the most common reason is not enough coverage. For example

Image

However, just to make sure there is nothing else going on, you might want to check the k-mer histogram first (does it have distinct peaks? Yes, why there are no smudge pairs), did you chose appropriate error threshold (is the number of error k-mers in the search reasonable? The -L parameter should well separate genomic and error k-mers). How to do all that you can see for example in the tutorial saccharomyces and more reasoning behind as well as detailed joint interpretation of smudgeplot and GenomeScope is also a relevant page to read.