4DN Data Portal Warning #391
-
Hi, I noticed that a warning message has recently been attached to many .mcool files in 4D Nucleome Data Portal. For example, click the note here will show the warning message:
I am curious what the bug is that caused this duplicate pixel problem? What functions of Cooler are affected by that specific bug, and has the bug been fixed in later version? Fuzhou |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi @wangfuzhou110, |
Beta Was this translation helpful? Give feedback.
-
@wangfuzhou110 @StefanoCretti The bug was an issue during zoomification (i.e. coarsening to lower resolutions from a base resolution), where occasionally a bin would get "split" during aggregation. This would happen at the boundaries of large chunks of data, so it's rare but happens more often in very deep datasets. In those cases, the identifier for a split pixel (bin1, bin2) ends up being reported twice with different values, but if you add the values associated with those records you would get the correct count for that pixel. Importantly, you will only notice the issue if you use the tabular interfaces: That's why it went unnoticed for a while. But thankfully, it shouldn't matter for most use cases. Importantly, split pixels only affects coarse zoom levels - the base resolution should have no split pixels. Therefore, a complete fix is to re-zoomify with a version of cooler >= 0.8.5 from the base resolution (1kb): cooler zoomify --balance -r 4DN old.mcool::resolutions/1000 -o new.mcool (the Open2C is working with DCIC to update their processing pipeline for a new release, as well as some other improvements, including filtering the pairs by map quality before binning (regardless of contact matrix format), which is currently not implemented. |
Beta Was this translation helpful? Give feedback.
@wangfuzhou110 @StefanoCretti The bug was an issue during zoomification (i.e. coarsening to lower resolutions from a base resolution), where occasionally a bin would get "split" during aggregation. This would happen at the boundaries of large chunks of data, so it's rare but happens more often in very deep datasets.
In those cases, the identifier for a split pixel (bin1, bin2) ends up being reported twice with different values, but if you add the values associated with those records you would get the correct count for that pixel.
Importantly, you will only notice the issue if you use the tabular interfaces:
clr.pixels()
orclr.matrix(as_pixels=True, ...)
because it turns out that the way …