-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raw MSI spectra as MassSpectra objects - size in memory #58
Comments
Dear Denis, thanks for you feature request. I see at least two solutions to the problem.
I would prefer the first method because it greatly reduce the memory usage. The second method roughly halfs the memory usage. You could also try to remove useless Unfortunately I am not involved in any MS/MSI project anymore. While I am still maintaining If you like to develop one of the above solutions I am happy to assist/review/accept pull request. If you don't want/can't develop something like this you may want to have a look at Cardinal which was written for large IMS data sets and keeps everything on disk. Best wishes, Sebastian |
Hi Sebastian, Thanks for your reply. I also find the first option more compelling, in particular, there is a package called Best, |
I don't think we have to rewrite the data to imzML (btw: this is how
If the m/z are identical for all spectra we could just create symlinks or keep them in memory. As I said above regardless of the underlying solution I would be more than happy to accept pull requests that implementing an on-disk feature. |
Hi Sebastian, Thanks again for your comments. I have been reading the documentation of To replicate the current class structure of Thanks, |
In ## create class unions (number or character/ mz&intensity or filename)
setClassUnion("NumericOrCharacter", c("numeric", "character"))
setClass("AbstractMassObject",
slots=list(mass="NumericOrCharacter", intensity="NumericOrCharacter",
metaData="list"),
contains="VIRTUAL")
setClass("MassSpectrum",
slots=list(mass="numeric", intensity="numeric"),
prototype=list(mass=numeric(), intensity=numeric(),
metaData=list()),
contains="AbstractMassObject")
setClass("OnDiskMassSpectrum",
slots=list(mass="character", intensity="character"),
prototype=list(mass=character(), intensity=character(), metaData=list()),
contains="AbstractMassObject")
ms <- new("MassSpectrum", mass=1:10, intensity=1:10)
oms <- new("OnDiskMassSpectrum", mass="filename", intensity="filename") A possible class hierarchy would be:
While I find
I don't know which of the solutions is harder. I don't have any experience on moving from CRAN to bioconductor (IMHO it was the wrong decision in 2011 to send the package to CRAN and not to bioconductor ...). |
Here is an short example implementation of an on-disk class using the binary filesystem structure I mentioned earlier: https://gist.github.com/sgibb/b46a543d4bd1476a2abd412d23bf3780 I don't know whether there are any performance benefits using this approach or Unfortunately methods like |
Thanks Sebastian for investing your time in the clarifications above. The class hierarchy mentioned above makes a lot of sense. Regarding the whole The following suggestion would probably belong to I find the solution above intriguing, however, at the moment I can't invest time to try it out (not in this year at least), but I will definitely inform you if I get something going regarding this. Thanks, |
Thank you for the thread. Your suggestion to use the ibd file instead of inventing a new file format is great! I would definitively do that. My approach has the disadvantage that it reads all mass/intensity data of a spectrum into memory. That's fine for most methods because they work on a single spectrum (and a single spectrum should fit in memory). While I really appreciate your help and effort and I am very interested in improving |
It's a valid point, but honestly I find the processing methods implemented in |
Nice, to hear! Thanks. |
Hi Sebastian, I implemented what we were discussing above! I created a new class
As an example, I have an example imaging dataset which consists of 10924 spectra with 10000 bins each:
In comparison, loading the spectra into memory:
I made a lot of changes and updated the documentation files for both |
Awesome! Looks promising! Could you please fork the |
Done. For For |
I fixed |
@dsammour sorry for the delay. I partly incorporated your PR and pushed a working (at least for me) version of the devtools::install("sgibb/MALDIquant@OnDiskVector")
devtools::install("sgibb/MALDIquantForeign@OnDiskVector")
library("MALDIquant")
library("MALDIquantForeign")
# example file from
# https://ms-imaging.org/wp/wp-content/uploads/2019/03/S042_Continuous_imzML1.1.1.zip
f <- "S042_Continuous.imzML"
mem <- import(f, verbose=FALSE)
ods <- import(f, attachOnly=TRUE, duplicateFiles=TRUE, verbose=FALSE)
odp <- import(f, attachOnly=TRUE, duplicateFiles=FALSE, verbose=FALSE)
mem[1:2]
# [[1]]
# S4 class type : MassSpectrum
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 8.61e+00
# Memory usage : 8.625 KiB
# File : /home/sebastian/tmp/downloads/S042_Continuous.ibd
#
# [[2]]
# S4 class type : MassSpectrum
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 1.827e+01
# Memory usage : 8.625 KiB
# File : /home/sebastian/tmp/downloads/S042_Continuous.ibd
#
ods[1:2]
# [[1]]
# S4 class type : MassSpectrumOnDisk
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 8.61e+00
# Memory usage : 7.234 KiB
# File : /tmp/Rtmpzl41n8/S042_Continuous6fb97b37cbbb.ibd
#
# [[2]]
# S4 class type : MassSpectrumOnDisk
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 1.827e+01
# Memory usage : 7.234 KiB
# File : /tmp/Rtmpzl41n8/S042_Continuous6fb97b37cbbb.ibd
#
odp[1:2]
# [[1]]
# S4 class type : MassSpectrumOnDisk
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 8.61e+00
# Memory usage : 7.281 KiB
# File : /home/sebastian/tmp/downloads/S042_Continuous.ibd
#
# [[2]]
# S4 class type : MassSpectrumOnDisk
# Number of m/z values : 300
# Range of m/z values : 225 - 249.917
# Range of intensity values: 0e+00 - 1.827e+01
# Memory usage : 7.281 KiB
# File : /home/sebastian/tmp/downloads/S042_Continuous.ibd
# |
@sgibb sorry for the huge delay as well. Just wanted to let you know that I have been working with the version I pushed for some time now ("onDiskVector" and "onDisk_support_for_imzML" branches of MALDIquant and MALDIquantForeign, respectively), everything seems to work smoothly. Note that I had to change the functionality of some methods to accommodate the MassSpectrumOnDisk objects, for example the "trim" method, does the trimming but returns a MassSpectrum object afterwards (thought that would be the easiest solution - trimming the file on disk would not make much sense!). However, under the "OnDiskVector" branch under your MALDIquant repository, I could not see these changes, did you have a different solution for the problem? |
Sorry for this huge delay. Unfortunately the write support in the Please see this stackoverflow question that describes the problem: @dsammour did it really work for you? |
Ok, I found a solution for the above problem. IMHO it is working now as expected. devtools::install_github("sgibb/MALDIquant@OnDiskVector")
devtools::install_github("sgibb/MALDIquantForeign@OnDiskVector")
library("MALDIquant")
library("MALDIquantForeign")
# example file from
# https://ms-imaging.org/wp/wp-content/uploads/2019/03/S042_Continuous_imzML1.1.1.zip
f <- "S042_Continuous.imzML"
mem <- import(f)
ods <- import(f, attachOnly=TRUE) |
Thank you very much for following up on the on disk support. I have tried your code in Rstudio cloud with R3.6.0. Simple commands such as length(ods) and summary(ods) work. But I get the same error for other commands such as:
|
@foellmelanie thanks for testing. Which kind of error do you get? For me it works like expected: #devtools::install_github("sgibb/MALDIquant@OnDiskVector")
#devtools::install_github("sgibb/MALDIquantForeign@OnDiskVector")
library("MALDIquant")
#>
#> This is MALDIquant version 1.99.1
#> Quantitative Analysis of Mass Spectrometry Data
#> See '?MALDIquant' for more information about this package.
library("MALDIquantForeign")
# example file from
# https://ms-imaging.org/wp/wp-content/uploads/2019/03/S042_Continuous_imzML1.1.1.zip
download.file(
"https://ms-imaging.org/wp/wp-content/uploads/2019/03/S042_Continuous_imzML1.1.1.zip",
"S042_Continuous_imzML1.1.1.zip"
)
unzip("S042_Continuous_imzML1.1.1.zip")
f <- "S042_Continuous.imzML"
mem <- import(f)
ods <- import(f, attachOnly=TRUE)
avgSpectra <- averageMassSpectra(ods, method="mean")
plot(ods[[1]]) any(sapply(ods, isEmpty))
#> [1] FALSE
ods <- transformIntensity(ods, method="sqrt")
ods <- calibrateIntensity(ods, method="TIC")
plot(ods[[1]]) |
@sgibb sorry forgot to paste the error. Then its probably just a problem of the path because I used Rstudio cloud:
|
Mh, I can't reproduce this error in https://rstudio.cloud/project/950000 |
Thanks @sgibb, all good now, I think something was wrong with my file. I downloaded it again and now it works. Works also for other files. |
Sorry for the late reply, I had a fare share of deadlines to meet. I did two tests; one for the branch I suggested and one for the 'onDiskVector' branch. Both are working properly. dsammour branch of MALDIquant and MALDIquantForeignNote: on a linux-based system.
onDiskVector branch of MALDIquant and MALDIquantForeignNote: On Windows.
The same plots as above were generated. NotesI think it would really make sense to generate a note when loading attached files as the one I added within ìmportImzML` for example:
I would also suggest to keep the following part which guards against the possibility that a
So it will override the àttachOnly I also didn't understand how the On another note, the above code seems not to work with imzML files generated by
where |
Hi,
I am using
MALDIquant
extensively to process MSI datasets. These datasets tend to be extremely large (<50GB when loaded into memory into lists ofMassSpectrum
objects). The issue here is that eachMassSpectrum
object stores intensities and corresponding m/z values, which means huge redundancy and unnecessary memory usage since the mass axis is exactly identical for every spectrum. I was wondering if it would be possible to exclude the mass axis fromMassSpectrum
objects and possibly create a new type of object, sayMassAxis
, which can be passed as an argument to the processing functions. Obviously, this cannot be applied toMassPeak
objects.Thanks,
Denis
The text was updated successfully, but these errors were encountered: