4DN Data Portal Warning #391

wangfuzhou110 · 2023-12-06T08:31:42Z

wangfuzhou110
Dec 6, 2023

Hi,

I noticed that a warning message has recently been attached to many .mcool files in 4D Nucleome Data Portal. For example, click the note here will show the warning message:

WARNING - Due to a bug in the version of cooler (0.8.3) used in the current 4DN standard Hi-C processing pipeline some pixels may occur mulitple times at a single resolution with different counts being reported for each occurence. This duplication does not affect the higlass display of these files, howevver, downstream analyses using this file may encounter issues due to this pixel duplication. The counts from the duplicate pixels can be aggregated to determine the correct count value at that location. If this issue is problematic for your needs you should consider regenerating the matrices from the merged pairs file of the associated dataset using a more recent version of cooler. We are working to update the pipeline but do not yet have a predicted date for when this issue will be resolved.

I am curious what the bug is that caused this duplicate pixel problem? What functions of Cooler are affected by that specific bug, and has the bug been fixed in later version?

Fuzhou

Answered by nvictus

Feb 24, 2024

@wangfuzhou110 @StefanoCretti The bug was an issue during zoomification (i.e. coarsening to lower resolutions from a base resolution), where occasionally a bin would get "split" during aggregation. This would happen at the boundaries of large chunks of data, so it's rare but happens more often in very deep datasets.

In those cases, the identifier for a split pixel (bin1, bin2) ends up being reported twice with different values, but if you add the values associated with those records you would get the correct count for that pixel.

Importantly, you will only notice the issue if you use the tabular interfaces: clr.pixels() or clr.matrix(as_pixels=True, ...) because it turns out that the way …

View full answer

StefanoCretti · 2024-02-22T10:09:02Z

StefanoCretti
Feb 22, 2024

Hi @wangfuzhou110,
though I do not know exactly which was the bug causing the problem (I am not a developer of the package, just a user), as far as I know the issue has been resolved. The files on 4DN though are still bugged, therefore if you need to perform operations which are sensible to the presence of duplicate or unsorted pixels, I would suggest generating the cool files from the pairs files available on the website.

0 replies

nvictus · 2024-02-24T10:22:53Z

nvictus
Feb 24, 2024
Maintainer

@wangfuzhou110 @StefanoCretti The bug was an issue during zoomification (i.e. coarsening to lower resolutions from a base resolution), where occasionally a bin would get "split" during aggregation. This would happen at the boundaries of large chunks of data, so it's rare but happens more often in very deep datasets.

In those cases, the identifier for a split pixel (bin1, bin2) ends up being reported twice with different values, but if you add the values associated with those records you would get the correct count for that pixel.

Importantly, you will only notice the issue if you use the tabular interfaces: clr.pixels() or clr.matrix(as_pixels=True, ...) because it turns out that the way matrix balancing and 2D range querying (as numpy or scipy arrays) is done in cooler actually does such a summation on split pixels implicitly so it cancels out the effect!

That's why it went unnoticed for a while. But thankfully, it shouldn't matter for most use cases.

Importantly, split pixels only affects coarse zoom levels - the base resolution should have no split pixels. Therefore, a complete fix is to re-zoomify with a version of cooler >= 0.8.5 from the base resolution (1kb):

cooler zoomify --balance -r 4DN old.mcool::resolutions/1000 -o new.mcool

(the -r 4DN option is shorthand for --resolutions 1000,2000,5000,10000,25000,50000,100000,250000,500000,1000000,...etc.)

Open2C is working with DCIC to update their processing pipeline for a new release, as well as some other improvements, including filtering the pairs by map quality before binning (regardless of contact matrix format), which is currently not implemented.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4DN Data Portal Warning #391

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

4DN Data Portal Warning #391

wangfuzhou110 Dec 6, 2023

Replies: 2 comments

StefanoCretti Feb 22, 2024

nvictus Feb 24, 2024 Maintainer

wangfuzhou110
Dec 6, 2023

StefanoCretti
Feb 22, 2024

nvictus
Feb 24, 2024
Maintainer