Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native zip support #45434

Open
arcanis opened this issue Nov 11, 2022 · 33 comments
Open

Native zip support #45434

arcanis opened this issue Nov 11, 2022 · 33 comments
Labels
feature request Issues that request new features to be added to Node.js. never-stale Mark issue so that it is never considered stale zlib Issues and PRs related to the zlib subsystem.

Comments

@arcanis
Copy link
Contributor

arcanis commented Nov 11, 2022

What is the problem this feature will solve?

This issue is to test the water a little bit.

Node supports zlib natively, which takes care of the compression / decompression aspect of sharing files, but it doesn’t include any primitive allowing to manipulate file archives of any kind. It can be implemented in userland (and in a project of mine we’re doing this by compiling the libzip to wasm), but it’s significantly slower (both in runtime, boot time, and size).

Being able to read file archives is especially useful when coupled with the ESM loaders, since it then becomes possible to implement loading modules straight from packaged vendors - a more performant alternative to bundled packages which doesn’t require to precompile the code and allows distributing it along with non-JS files.

What is the feature you are proposing to solve the problem?

I’d like to investigate whether it’d make sense to implement a basic archive manipulation library in Node.js. This isn’t unheard of, multiple interpreters having similar capabilities, sometimes even supporting multiple formats:

The API of this basic archive library would be to define, but generally we can assume some simple requirements:

  • Read an archive’s directory listing
  • Stat files from within an archive
  • Read a buffer from an archive
  • Add a file to an archive

I’d be interested to explore an implementation myself.

What alternatives have you considered?

No response

@nodejs/zlib

@arcanis arcanis added the feature request Issues that request new features to be added to Node.js. label Nov 11, 2022
@bnoordhuis
Copy link
Member

You should probably also enumerate what zip file features you think are in or out of scope. To name a few: large archive support, encryption, bzip2/lzma compression, etc.

@arcanis
Copy link
Contributor Author

arcanis commented Nov 12, 2022

I'd tend to see the following:

Filesystem access (👎 out of scope)

Node already has primitives allowing to access the filesystem, so the native archive implementation shouldn't open/write files directly. It should only accept and output raw buffers, leaving it up to the user to retrieve them (usually with readFile).

A post-mvp follow-up might be to make it work with fs.open file descriptors, if we see significant improvements in perfs by doing so (I can imagine it being useful when you only want to read a couple of files in a large zip archive).

Arbitrary compression (👍 in scope)

Generating data would require to provide the API with pre-compressed buffers (and algorithm), whereas reading data would yield the compressed data - that users would then uncompress themselves. This would also make it possible for users to generate stable archives (same input => same output), which is more difficult if the compression is left to Node.

Would it be worthwhile to have a fast path for deflate, which is the most common compression out there?

Stat / External attributes (👍 in scope)

The API should offer a way to set the files' stat, including external attributes (required for symlink support).

Encryption (🔮 post-mvp scope)

It should be technically fairly easy to do via libzip. Main consideration would be how basic we want the first iteration to be (if we want to start small with only the features that 95% of consumers would use, then encryption might not be needed).

Large Archives (🔮 post-mvp scope)

Zip is limited to 32 bits in a couple of places, but libzip supports Zip 64 which doesn't have those problems. Same thing as the previous point, I'd tend to see it as something that we could expose, but that I'd see as a second iteration (although unlike encryption it probably doesn't impact the public API much).

Splitting a single archive into multiple files (.zip.001, .zip.002, ...) is out of scope.

@GeoffreyBooth
Copy link
Member

I just discovered this issue. Here’s what I’m currently doing; I’d love to be able to do this with zlib instead. My use case is generating a .zip file for upload to AWS lambda, and as this is part of a CI script it would be convenient to be able to do this without any third-party dependencies.

import { readFile } from 'node:fs/promises'
import jszip from 'jszip'

const code = await readFile('./index.js', 'utf8')
const zip = new jszip()
zip.file('index.js', code, { unixPermissions: 755 })
const zipBuffer = await zip.generateAsync({
  type: 'nodebuffer',
  compression: 'DEFLATE',
  platform: 'UNIX',
})

@GeoffreyBooth GeoffreyBooth added the zlib Issues and PRs related to the zlib subsystem. label Nov 22, 2022
@richajak
Copy link

Another approach is by spawning a native shell tool from Linux. In this way, we do not have to install any third-party js dependencies.

first, install the zip tool (e.g. debian),
$ sudo apt-get install zip

then, in the code, simply call
execSync('zip test.zip test.txt')

https://packages.debian.org/bullseye/zip
https://exploringjs.com/nodejs-shell-scripting/ch_nodejs-child-process.html#synchronous-helper-functions-based-on-spawnasync

@MoLow
Copy link
Member

MoLow commented Nov 23, 2022

Another approach is by spawning a native shell tool from Linux. In this way, we do not have to install any third-party js dependencies.

that will only work for linux/unix , and only assuming the tool is installed - which is not the case for many ditros

@arcanis
Copy link
Contributor Author

arcanis commented Nov 28, 2022

I started a prototype implementation over at #45651

@mscdex
Copy link
Contributor

mscdex commented Nov 28, 2022

-1 I know this is an unpopular opinion, but I'm not convinced this should live in node core. Sure it's a common file format, but then again so are tar, rar, 7z, zstd, xz, and others, which also don't belong in node core.

I feel like adding such modules to node core is further leading to feature creep. I get that other platforms like PHP and such may have zip modules, but they also include a ton of other modules that make them "kitchen sink" platforms, which I would hate to see node.js become.

@arcanis
Copy link
Contributor Author

arcanis commented Nov 28, 2022

I understand the general concern, and I wouldn't want to increase the Node.js surface if the value wasn't well identified. But while some modules can happen in userland, I think archive support (whether it's zip or something else) is important:

  • Archive manipulation can be a security risk if done incorrectly. As far as I know, most JS implementations of zip / tar parsing have been vulnerable through their history to some sort of path traversal attack (some recently). At times they even made their ways into package managers - and some of these vulnerabilities were reported through the Node.js security program. Having a trustable common implementation in the core would be beneficial to the ecosystem's security.

  • It answers a fairly common need: how to store multiple files / data structures without encumbering the filesystem? This need becomes more pressing now with the ESM loaders on their way to graduation: loading modules from archives rather than the filesystem will be a common pattern part of the critical bootstrap path (Yarn can already generate such a loader when requested).

    • Another area of exploration - early, but promising - are packaged Node.js applications. For example node-sea already allows packing a full package into a single self-executable Node.js script (using the WASM Zip library I mention below).
  • Parsing archives is one thing, but doing efficiently is another. While WASM is an option (it's what we currently use), it makes the startup slower, blocks the shutdown, the memory footprint much larger, increases the risks of memory leaks, and GC integration is subpar at best. As I mentioned, the need to read from archives will be a regular occurence in Node.js loaders, and primitives making this work easier and more stable would be welcomed.

To summarize, I think an archive manipulation library would have a reasonable amount of practical use cases, for a comparatively low cost. It may not be a feature commonly used by end-users, but I expect tooling authors to benefit from it in ways that will still benefit end-users.

Sure it's a common file format, but then again so are tar, rar, 7z, zstd, xz, and others, which also don't belong in node core.

I don't see it as a slippery slope situation - this feature request is about providing efficient support for a use case that cannot be implemented efficiently today. Once this use case is fulfilled, regardless which format is picked, you will then be able to point out to future requesters that Node.js already supports an archive format well - which isn't the case today.

As for why zip, I went with it for a couple of reasons:

  • It's a format integrated with most OS distributions, interpreters, and text editors.
  • It supports various compression algorithms (including none) on a per-file basis.
  • It offers random access capabilities, which isn't the case of tar, and is important for the loader / packaged app use case.

Other formats may have better compression, but since the goal of this work is first and foremost to address use cases, I went with the format that offered a good compromise between popularity and feature set.

@tniessen
Copy link
Member

Archive manipulation can be a security risk if done incorrectly. (...) Having a trustable common implementation in the core would be beneficial to the ecosystem's security.

I'm pretty sure we've heard this argument for just about any proposed feature. In this issue alone, the same thing can be said about tar, xz, etc., which is the slippery slope that @mscdex mentioned.

Parsing archives is one thing, but doing efficiently is another. While WASM is an option (it's what we currently use), it makes the startup slower, the memory footprint much larger, increases the risks of memory leaks, and GC integration is subpar at best.

This can, again, be said about everything that's best implemented as a native C++ addon right now. Native addons are a burden, and WebAssembly can be a meaningful alternative, but we can't add everything to Node.js that would otherwise require a native addon.

Once this use case is fulfilled, regardless which format is picked, you will then be able to point out to future requesters that Node.js already supports an archive format well

I might be mistaken but I generally consider Linux-like operating systems the primary domain of Node.js, maybe with the exception of Electron apps. And isn't zip terrible at preserving unix file attributes, for example? I don't see how we could possibly reject requests for tar or other archive formats by pointing to zip.

@arcanis
Copy link
Contributor Author

arcanis commented Nov 28, 2022

This can, again, be said about everything that's best implemented as a native C++ addon right now. Native addons are a burden, and WebAssembly can be a meaningful alternative, but we can't add everything to Node.js that would otherwise require a native addon.

I'm not aware of other Node.js C++ addons that would pretend to be used in the critical path of loading an entire Node.js application. Please consider the arguments I exposed as a whole - I'm sure we can find reasons why any single one wouldn't be a good enough reason if alone, but once taken all together I find they make a lot more sense.

And isn't zip terrible at preserving unix file attributes, for example?

Symlinks can be represented just fine, mtime as well, and chmod equally. It doesn't support uid / gid / dev, but I never found that a limitation - at least not one I saw when I used it in my tools, which are used in production across all systems. And since you mention Electron apps, ASAR archives don't contain any of these metadata, with the exception of the +x bit. That seems to satisfy their needs just fine.

To be clear, I'm not rejecting tar, or any other format. As I said, since the goal of this work is first and foremost to address use cases, I went with the format that offered a good compromise between popularity and feature set. Random access is a very important feature, as it's what supports the use cases I described.

@GeoffreyBooth
Copy link
Member

For me this feels like a missing feature that we should add. We have utilities for all sorts of compression and decompression algorithms in zlib; I thought for quite a while that some combination of them would work for creating a .zip file, and I was surprised and frustrated when I discovered that it was impossible. You could argue that maybe we never should have added zlib, but we did, and therefore I think that we should support what’s one of the most popular archive formats in use.

@mscdex
Copy link
Contributor

mscdex commented Nov 29, 2022

We have utilities for all sorts of compression and decompression algorithms in zlib

I believe the original reason for adding a zlib binding was for the purposes of compressing HTTP responses, not just for the sake of having common compression formats available. Now you might argue the relevancy of deflate for HTTP responses in modern day HTTP servers, but that was most likely why it was originally added (back in 2011).

If zlib had some kind of built-in support for zip, that would be one thing and would mean we wouldn't need to pull in an additional dependency and such, but that's not the case today.

@mscdex
Copy link
Contributor

mscdex commented Nov 29, 2022

I should add that there is minizip in the zlib source code, but I'm not entirely sure of the quality/reliability of that code (since it lives in the contrib/ dir) and/or if its feature set/performance would suffice for everyone interested in zip support.

Additionally I'm not sure if the Chromium team would want to keep the minizip files up to date since I presume they're not actively using them? Otherwise we'd have to pull in minizip changes from the madler/zlib repo.

@arcanis
Copy link
Contributor Author

arcanis commented Nov 29, 2022

I believe the original reason for adding a zlib binding was for the purposes of compressing HTTP responses, not just for the sake of having common compression formats available.

And the reason for adding an archive binding is for the purpose of efficiently storing and transferring multiple files inside portable packages, not just for the sake of having common archive formats available 🙂

If zlib had some kind of built-in support for zip, that would be one thing and would mean we wouldn't need to pull in an additional dependency and such, but that's not the case today.

The zlib project only covers the deflate algorithm, not file archives or other compression algorithms1, so the minizip in their contrib folder has little to no maintenance; its very website suggests minizip-ng instead (or libzip, which is used in both Chrome and the PHP-Zip library).

Footnotes

  1. Which wasn't a problem in Feature Idea: Brotli support in core #18964, where Brotli support was added via a new dep on brotli, and whose folder received a single commit since then - it seems to support that file formats have relatively low maintenance cost.

@mscdex
Copy link
Contributor

mscdex commented Nov 29, 2022

And the reason for adding an archive binding is for the purpose of efficiently storing and transferring multiple files inside portable packages

HTTP was a core feature of node from the beginning, storing and transferring multiple files inside a portable package was not.

  1. Which wasn't a problem in Feature Idea: Brotli support in core #18964, where Brotli support was added via a new dep on brotli, and whose folder received a single commit since then - it seems to support that file formats have relatively low maintenance cost

For the record, I wasn't keen on adding brotli either, for similar reasons as zip.

@tniessen
Copy link
Member

I'm not aware of other Node.js C++ addons that would pretend to be used in the critical path of loading an entire Node.js application. Please consider the arguments I exposed as a whole - I'm sure we can find reasons why any single one wouldn't be a good enough reason if alone, but once taken all together I find they make a lot more sense.

And the reason for adding an archive binding is for the purpose of efficiently storing and transferring multiple files inside portable packages, not just for the sake of having common archive formats available 🙂

If the experimental SEA implementation moves forward to the point where Node.js internally needs support for any specific archive format, be it ASAR, tar, or even zip, that is still not the same as adding a public API for a specific archive format. If there is a requirement for user code to access archive members when packaged as some SEA, that also does not necessarily require a general-purpose API for said archive format.

@arcanis
Copy link
Contributor Author

arcanis commented Nov 29, 2022

SEA-related experiments are a use case, not the use case. This issue is about the general need, not an Node.js-internal one.

@bnoordhuis
Copy link
Member

Now you might argue the relevancy of deflate for HTTP responses in modern day HTTP servers, but that was most likely why it was originally added (back in 2011).

Specifically, it was added so npm could do its thing.

(Hi, node's living memory here!)

@tniessen
Copy link
Member

I just discovered this issue. Here’s what I’m currently doing; I’d love to be able to do this with zlib instead.

We have utilities for all sorts of compression and decompression algorithms in zlib; I thought for quite a while that some combination of them would work for creating a .zip file, and I was surprised and frustrated when I discovered that it was impossible. You could argue that maybe we never should have added zlib, but we did, and therefore I think that we should support what’s one of the most popular archive formats in use.

IMHO, the fact that archives (tar) or members of archives (zip) are commonly compressed does not mean that supporting compression is any justification for supporting archives.

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented Nov 29, 2022

IMHO, the fact that archives (tar) or members of archives (zip) are commonly compressed does not mean that supporting compression is any justification for supporting archives.

I think the average user probably doesn’t grasp the distinction between compressed data and a compressed archive. I was one such user who thought that zlib could do what I wanted; it even has an unzip function.

We don’t have clear criteria (that I’m aware of) for what belongs in core versus not, and lately we’ve been relaxing a bit from the project’s earlier “small core” devotion. It’s obviously always debatable, but this does have several aspects that usually lean in favor of inclusion: it’s something that benefits from native compilation, it’s related to existing APIs, and it covers a use case commonly handled in scripts (where importing third-party dependencies is cumbersome). Yes Node scripts could call their platforms’ zip utility but having this functionality included ensures it’s both available and will work cross-platform.

Ultimately it’s a subjective call whether it’s useful enough, and I think we usually settle decisions like those via votes since that’s a hard question to reach consensus on.

@tniessen
Copy link
Member

tniessen commented Dec 4, 2022

I think the average user probably doesn’t grasp the distinction between compressed data and a compressed archive. I was one such user who thought that zlib could do what I wanted; it even has an unzip function.

I don't know if this really is a common misconception. I don't remember seeing many feature requests related to this.

(node:zlib should already be sufficient for managing every aspect of compression of tar archives, only the archive format itself must be implemented. zip, on the other hand, isn't a compressed archive but rather an archive of (potentially) compressed files, so it is likely more difficult to implement based on node:zlib.)

it covers a use case commonly handled in scripts (where importing third-party dependencies is cumbersome). Yes Node scripts could call their platforms’ zip utility but having this functionality included ensures it’s both available and will work cross-platform.

Scripts are an interesting use case. Again, I have no data to base this claim on, but I would assume that Node.js scripts much more often run on non-Windows platforms than on Windows. On non-Windows platforms, tar is almost universally available whereas zip often is not. (On Windows, unfortunately, the opposite is true.)

Consider, for example, the scripts that produce Node.js builds -- the Node.js 19.2.0 download directory contains 19 tar files (source archives, binaries for non-Windows platforms) and only 2 zip files (binaries for Windows).

This does not seem in line with @arcanis's earlier argument:

Once this use case is fulfilled, regardless which format is picked, you will then be able to point out to future requesters that Node.js already supports an archive format well - which isn't the case today.

@richardlau
Copy link
Member

Scripts are an interesting use case. Again, I have no data to base this claim on, but I would assume that Node.js scripts much more often run on non-Windows platforms than on Windows. On non-Windows platforms, tar is almost universally available whereas zip often is not. (On Windows, unfortunately, the opposite is true.)

FWIW Windows 10 got tar some years ago:

@Neustradamus

This comment was marked as off-topic.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 3, 2023

There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment.

For more information on how the project manages feature requests, please consult the feature request management document.

@github-actions github-actions bot added the stale label Jun 3, 2023
@arcanis
Copy link
Contributor Author

arcanis commented Jun 3, 2023

I still want to bring it to the finish line; last blocker so far is a memory leak reported by Asan which I found difficult to investigate. I'll make another attempt in the coming weeks.

@tniessen tniessen removed the stale label Jun 3, 2023
@tniessen
Copy link
Member

tniessen commented Jun 3, 2023

@arcanis I don't think the memory leak is the main issue. There simply is no consensus that this should be added.

@arcanis
Copy link
Contributor Author

arcanis commented Jun 3, 2023

I think both pros and cons have been clearly stated, yes. As you say, consensus on yes/no hasn't been reached yet. My understanding is that in cases like this whether to merge it or not will be a TSC decision.

Given that the major concerns you and other have raised are around scope and maintainability, I'd prefer to avoid theoretical back and forth, and have a working prototype showing exactly what's the envisioned scope and maintenance burden At the very least, this might prove I'm committed to maintain it myself on the long term, and be enough to address some concerns on this side.

Worst case, I lose some of my time, but I believe this feature is worth giving a real try, and seeing the upvotes, I'm not alone 🙂

(In any case, my previous comment was mostly intended to avoid stale close, not to add new elements to the discussion)

Copy link
Contributor

github-actions bot commented Dec 1, 2023

There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment.

For more information on how the project manages feature requests, please consult the feature request management document.

@Elkfoot2
Copy link

I would very much like to see this feature.

Copy link
Contributor

There has been no activity on this feature request for 5 months. To help maintain relevant open issues, please add the never-stale Mark issue so that it is never considered stale label or close this issue if it should be closed. If not, the issue will be automatically closed 6 months after the last non-automated comment.
For more information on how the project manages feature requests, please consult the feature request management document.

@github-actions github-actions bot added the stale label Jun 12, 2024
@avivkeller avivkeller moved this from Awaiting Triage to Triaged in Node.js feature requests Jun 26, 2024
@BurningEnlightenment
Copy link

🪄 unstale: The basic feature set outlined in the OP would cover my needs and would allow me to drop a native addon dependency.

@github-actions github-actions bot removed the stale label Jun 30, 2024
Copy link
Contributor

There has been no activity on this feature request for 5 months. To help maintain relevant open issues, please add the never-stale Mark issue so that it is never considered stale label or close this issue if it should be closed. If not, the issue will be automatically closed 6 months after the last non-automated comment.
For more information on how the project manages feature requests, please consult the feature request management document.

@github-actions github-actions bot added the stale label Dec 27, 2024
@Neustradamus
Copy link

To remove stale...

@GeoffreyBooth GeoffreyBooth added never-stale Mark issue so that it is never considered stale and removed stale labels Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js. never-stale Mark issue so that it is never considered stale zlib Issues and PRs related to the zlib subsystem.
Projects
Development

No branches or pull requests