Skip to content
André Offringa edited this page Aug 14, 2019 · 22 revisions

Dysco is a compressing storage manager for Casacore measurement sets. This Wiki contains the manual for Dysco.

The Dysco compression technique is explained in the article "Compression of interferometric radio-astronomical data", A. R. Offringa (2016; Bibtex). If you use this software, please cite the paper.

Installation instructions

To install:

mkdir build
cd build
cmake ../
make -j 4
make install

To be able to open compressed measurement sets, the dysco library ("libdyscostman.so") must be in your path.

By default, Dysco will be build with machine dependent optimisations (the '-march=native' option of gcc). To be able to use the compiled code on nodes of different types, the option "-DPORTABLE=True" can be added.

Compressing data

Dysco works best on data that has already been RFI flagged. It also works better on noise data, a somewhat counter-intuitive aspect of the algorithm. This comes from the fact that Dysco adds some noise to the data which is (more or less) relative to the signal strength in the data. If the data has low S/N ratio, that added noise is well below the system noise. If the data has (very) high S/N ratio, it might be necessary to increase the bitrate -- see below for details about this and other parameters.

File format compatibility

The file format of Dysco was fixed on 2016-10-01. I've carefully designed the file format so that no compatibility-affecting changes will be necessary; i.e., all future versions of Dysco will be able to open the current files.

Observation requirements

The Dysco storage manager has some constraints on how the data is stored in the measurement set. As far as I know, all observatories/correlators satisfy these constraints, but it is good to be aware of these. These are the requirements:

  • The measurement set has to be completely 'regular', i.e., every timestep needs to have the same structure. In particular, every timestep needs to occupy the same number of rows in the table, and the baselines have to be sorted in the same way. Almost all correlators do this, but I have heard plans for e.g. dropping a row when the row is completely flagged to save space. This is not compatible with Dysco.

    Dysco does not require a measurement set to contain all possible baselines. Auto-correlations for example can be left out, or if an antenna is not working, all correlations with that antenna can be left out, as long as every timestep has the same baselines.

  • When writing data to a compressed data column, the meta data for the row to be written has to be already available in the measurement set. This is for example because Dysco needs to make a correlation matrix and therefore requires the antenna indices. In particular, the columns ANTENNA1, ANTENNA2, DATA_DESC_ID, FIELD_ID and TIME need to be correctly set before the compressed data row is written.

    It is of course still possible to write a measurement set row by row. The only requirement is that, when a new row is added to a measurement set, the meta data columns are set before the compressed data column is written.

  • Dysco currently needs at least two time steps (or fields/spws) to be stored in the file, to be able to determine the number of rows that make up a single correlation matrix. This is not a fundamental requirement and could be fixed if required (i.e., if people nag me that this would be useful to fix).

Dysco should work with multi-field and multi-spw sets, but has not been tested well on those.

dscompress

The program 'dscompress' allows compression of a measurement set. Note that this program is mostly aimed at testing the compression. For production use I highly recommend to perform compression within a pipeline in which the initial measurement sets are created. This can be implemented by asking Casacore to store a column with the DyscoStMan. For LOFAR, Dysco was implemented in DPPP, and using DPPP is the recommended way to compress a LOFAR measurement set. For the MWA, Dysco can be enabled since Cotter version 4.

Run 'dscompress' without parameters to get help. Here's an example command:

dscompress -afnormalization -truncgaus 2.5 -data-bit-rate 4 -weight-bit-rate 12 \
   -column DATA -column WEIGHT_SPECTRUM obs.ms

It might be that this command does not decrease the size of the measurement set because, depending on the previously used storage manager, Casacore might not clear up the space of the old column after replacing it. Hence, the above statement is useful for testing the effects of compression, but not to see the size of the measurement set.

To be able to get a measure of the size, one can add -reorder to the command line. This will free up the space of the old column, but currently in order to do so, the full measurement set will be rewritten. This causes the compression to be performed twice, and hence the compression noise will be added twice. Therefore, the '-reorder' option is useful for testing the size of the compression, but should not be used in production.

Parameters

The Casacore system pass parameters to the storage manager through a so-called "Spec" Record. For Dysco, this Record can hold the following fields:

  • dataBitCount

    An integer, representing the number of bits used for each float in a data column, such as DATA or CORRECTED_DATA. Note that each visibility contains two floats (the real and imaginary). It must be set to a proper value even if only a weight column is compressed. Typical values for this are 4-12 bits, which compresses the data by a factor from 16 to 2.7. A value of 8 is generally acceptable, and adds in most cases less than 1% noise. Every extra bit divides the added noise by two. For LOFAR, it was decided that a 1% noise increase is acceptable, but to make sure that is also not exceeded in exceptional cases, the default was set to 10 for LOFAR. Because the MWA has smaller "stations" (tiles), which generate noisier data that is more easily compressed by Dysco, 8 bits is sufficient for MWA data.

    Note that Dysco currently only supports a particular set of bit counts: 2, 3, 4, 6, 8, 10, 12 and 16.

  • weightBitCount

    An integer, the number of bits used for each float in the WEIGHT_SPECTRUM column. Note that Dysco will use a single weight for all polarizations, so Dysco will typically compress the weight volume already by a factor of 4. Using 12 bits is recommended, as it has a very low error (it changes the weights insignificantly). Lower bit counts are possible, but because 12 bits compresses the weights already by a factor of 11 (because of avoid storing the 4 polarization individually), the resulting weight volume from 12 bit compression is often factors lower than the data column volume.

    Similar to the data bit count, the supported settings are 2, 3, 4, 6, 8, 10, 12 and 16 bits.

  • distribution

    A string specifying the distribution assumed for the data. Supported values are:

    • Gaussian
    • Uniform
    • StudentsT
    • TruncatedGaussian

    Generally, the TruncatedGaussian distribution delivers the best results, but the Uniform distribution is a close second. Gaussian and StudentsT are not recommended.

  • normalization

    A string specifying how to normalize the data before quantization. This can be either "AF", "RF" or "Row". AF means Antenna-Frequency normalization, RF means Row-Frequency normalization and Row means per-row normalization. Row normalization is not stable and is not recommended. AF and RF normalization produce very similar results. For reasonably noise data, AF is the recommended method. In high SNR cases, RF normalization might be better, but uses slightly more space. If unknown, use AF.

  • studentTNu

    When using the StudentsT distribution, this value specifies the degrees of freedom of the distribution. Not relevant for normal compression.

  • distributionTruncation

    When using the TruncatedGaussian distribution, this value specifies where to truncate the distribution, relative to the standard deviation. In Offringa (2016), values of 1.5, 2.5 and 3.5 were tested, and 2.5 seemed to be generally best and is the recommended value, in particular in combination with AF normalization. With RF normalization, 1.5 was slightly more accurate.

All fields are case sensitive.

Clone this wiki locally