[WIP] support df.to_parquet and df.read_parquet() #165

machichima · 2025-01-27T15:50:33Z

Add write() function for BufferedFileSimple used whan calling fsspec.open().

obstore/obstore/python/obstore/fsspec.py

Lines 177 to 186 in b40d59b

    
               def _open(self, path, mode="rb", **kwargs): 
        
                   """Return raw bytes-mode file-like from the file-system""" 
        
                   return BufferedFileSimple(self, path, mode, **kwargs) 
        
           class BufferedFileSimple(fsspec.spec.AbstractBufferedFile): 
        
               def __init__(self, fs, path, mode="rb", **kwargs): 
        
                   if mode != "rb": 
        
                       raise ValueError("Only 'rb' mode is currently supported") 
        
                   super().__init__(fs, path, mode, **kwargs)

Related to issue #164

kylebarron · 2025-01-29T15:34:34Z

obstore/python/obstore/fsspec.py

@@ -28,6 +28,7 @@
 import fsspec.spec

 import obstore as obs
+from obstore import open_reader, open_writer


If the fsspec classes are async shouldn't we use the async reader and writer?

Ah yes, I used this as original BufferedFileSimple inherit from AbstractBufferedFile, which is not async. I will change it to AbstractAsyncStreamedFile and use async method

It seems like df.to_parquet() and df.read_parquet() does not support async, as I got error like RuntimeWarning: coroutine 'AbstractAsyncStreamedFile.write' was never awaited if trying to change to async

obstore/python/obstore/fsspec.py

machichima · 2025-01-30T07:50:52Z

I found that there's also some bug in checking if parquet file exists in info() so I rename the title

machichima · 2025-01-30T13:46:12Z

Hi @kylebarron ,

I am wondering about the test here. Originally, fs.info("dir") for directory will raise file not found error, which cause error in using df.to_parquet(). After fixing it, the line fs.cat("dir", recursive=True) will raise FileNotFoundError for "dir" as fs.info("dir") has no error so "dir" will be processed.

obstore/tests/test_fsspec.py

Lines 47 to 59 in 428a66d

    
           def test_multi_file_ops(fs): 
        
               data = {"dir/test1": b"test data1", "dir/test2": b"test data2"} 
        
               fs.pipe(data) 
        
               out = fs.cat(list(data)) 
        
               assert out == data 
        
               out = fs.cat("dir", recursive=True) 
        
               assert out == data 
        
               fs.cp("dir", "dir2", recursive=True) 
        
               out = fs.find("", detail=False) 
        
               assert out == ["afile", "dir/test1", "dir/test2", "dir2/test1", "dir2/test2"] 
        
               fs.rm(["dir", "dir2"], recursive=True) 
        
               out = fs.find("", detail=False) 
        
               assert out == ["afile"]

Should I try to make its output as {"dir/test1": b"test data1", "dir/test2": b"test data2"} here? Which requires to override _cat() in fsspec as follow

    async def _cat(
        self, path, recursive=False, on_error="raise", batch_size=None, **kwargs
    ):
        paths = await self._expand_path(path, recursive=recursive)
        coros = [self._cat_file(path, **kwargs) for path in paths if not self._isdir(path)]   # ignore dir for cat_file
        batch_size = batch_size or self.batch_size

Refer to fsspec, it simply gives FileNotFoundError when doing so. Maybe we can just remove this line or make it assert if FileNotFound raise?

kylebarron · 2025-01-30T15:33:40Z

@martindurant wrote that test and is obviously more familiar with fsspec than I am... @martindurant do you have any suggestions here?

martindurant · 2025-01-30T15:41:05Z

Keep pinging me until I have a chance to look at this :)

kylebarron · 2025-02-02T21:20:22Z

obstore/python/obstore/fsspec.py

        """Return raw bytes-mode file-like from the file-system"""
        assert mode in (
            "rb",
            "wb",
        ), f"Only 'rb' and 'wb' mode is currently supported, got: {mode}"
+
+        _, path = self._split_path(path)


We should assert that the bucket of the path matches the bucket of the store.

Hi @kylebarron ,
I run print(dir(store)) and it outputs:

['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'from_env', 'from_session', 'from_url']

It seems like obstore does not provide any attribute to get the bucket name from the store instance. Maybe we should add something like this in Rust?

#[getter] fn bucket(&self) -> Option<String> { self.config.bucket.clone() }

For the present release I think we could add a private method _bucket, however bucket won't always be defined, such as when the store is created by url.

This is fixed in #210, so you can access bucket information out.

martindurant

I made some comments on the code as it stands.

However, the outstanding issues is: how to construct these instances via fsspec.open(). It would mean

registering each of the expected protocols (s3, gs, ab) to override the fsspec default ones. Perhaps a top-level function in obstore would do this explicitly (I wouldn't do it implicitly on import).
writing a _get_kwargs_from_urls to create the right obstore instance for the given path(s), including the bucket. This would also be a way to stash the value of the bucket, for later asserting the paths are right.

The alternative way, annoying for the user, would be to explicitly pass a premade instance with filesystem= (sometimes fs=) to the given loading function.

martindurant · 2025-02-05T14:28:44Z

obstore/python/obstore/fsspec.py

@@ -81,6 +83,32 @@ def __init__(
            *args, asynchronous=asynchronous, loop=loop, batch_size=batch_size
        )

+    def _split_path(self, path: str) -> Tuple[str, str]:


I wold call this _split_bucket to avoid confusion with fsspec's _split_protocol

I named it as _split_path to align with the naming in s3fs here, as this function is doing the same thing as s3fs's split_path

martindurant · 2025-02-05T14:30:10Z

obstore/python/obstore/fsspec.py

+            path_li = path.split("/")
+            bucket = path_li[0]
+            file_path = "/".join(path_li[1:])
+            return (bucket, file_path)


Suggested change

path_li = path.split("/")

bucket = path_li[0]

file_path = "/".join(path_li[1:])

return (bucket, file_path)

return path.split("/", 1)

would do this; but what about the "://" when the protocol is included?

I added following code in #198 but haven't sync here yet. This can solve the :// issue

if path.startswith(self.protocol + "://"): path = path[len(self.protocol) + 3 :] elif path.startswith(self.protocol + "::"): path = path[len(self.protocol) + 2 :] path = path.rstrip("/")

fsspec also allows "s3a" for the same thing - I don't know if you want to allow that.

Also, I don't know how "::" can appear here - it exists to join path elements, not protocol to path.

Oh I actually copied this from fsspec and thought obstore may do the similar stuff. It looks like obstore does not need this, I'll take them out then.

martindurant · 2025-02-05T14:31:57Z

obstore/python/obstore/fsspec.py

+                "version": head["version"],
+            }
+        except FileNotFoundError:
+            # try ls, refer to the info implementation in fsspec


Why does this PR need the extra code? Are you trying open() with globs? I don't know the details on head_async, whether it might already achieve this.

For the code that stores the parquet as: file.csv/00000, file.csv/00001, ...etc, when reading the file.csv/ from s3, the info() will give FileNotFoundError. As I known, s3's folder is not an object but a prefix, which cause this error from happending. So I add the code if getting FileNotFoundError in head_async to solve it

OK, the same old "is it a folder" problem - I am well familiar with this.

obstore/python/obstore/fsspec.py

martindurant · 2025-02-05T14:36:20Z

obstore/python/obstore/fsspec.py

+        else:
+            return False
+
+    def close(self):


Should also set self.closed = True

obstore/python/obstore/fsspec.py

martindurant · 2025-02-05T14:37:35Z

obstore/python/obstore/fsspec.py

+        """
+        Called every time fsspec flushes the write buffer
+        """
+        if self.buffer and len(self.buffer.getbuffer()) > 0:


It shouldn't be possible to get here without this condition being True

I think we can remove self.buffer here, but for len(self.buffer.getbuffer()) > 0, I think we might need this when flush(force=True) when closing? Which does not ensure buffer contains data. See: https://github.com/fsspec/filesystem_spec/blob/f30bc759f30327dfb499f37e967648f175750fac/fsspec/spec.py#L2041

Co-authored-by: Martin Durant <martindurant@users.noreply.github.com>

kylebarron · 2025-02-05T16:16:16Z

I made some comments on the code as it stands.

Thank you!

registering each of the expected protocols (s3, gs, ab) to override the fsspec default ones. Perhaps a top-level function in obstore would do this explicitly (I wouldn't do it implicitly on import).

I'm in favor of this approach. I definitely wouldn't do it explicitly on import, but I'd propose we have obstore.fsspec.register() which would register these protocols with fsspec's registry.

machichima · 2025-02-06T13:52:10Z

I made some comments on the code as it stands.

However, the outstanding issues is: how to construct these instances via fsspec.open(). It would mean

registering each of the expected protocols (s3, gs, ab) to override the fsspec default ones. Perhaps a top-level function in obstore would do this explicitly (I wouldn't do it implicitly on import).

writing a _get_kwargs_from_urls to create the right obstore instance for the given path(s), including the bucket. This would also be a way to stash the value of the bucket, for later asserting the paths are right.

The alternative way, annoying for the user, would be to explicitly pass a premade instance with filesystem= (sometimes fs=) to the given loading function.

Hi @martindurant ,

I've opened a new draft PR for this to ensure consistency in how instances are constructed across methods. My goal is to align the usage with fsspec.

With this PR, obstore can be registered as an fsspec storage backend using:

fsspec.register_implementation("s3", S3FsspecStore)

The bucket is extracted from the file path and used as a cache key when creating obstore objects. Here's an example usage that I would like to achieve:

fsspec.register_implementation("s3", S3FsspecStore)
fs: AsyncFsspecStore = fsspec.filesystem(
    "s3",
    config={
        "endpoint": "http://localhost:30002",
        "access_key_id": "minio",
        "secret_access_key": "miniostorage",
        "virtual_hosted_style_request": True,  # path contain bucket name
    },
    client_options={"timeout": "99999s", "allow_http": "true"},
    retry_config={
        "max_retries": 2,
        "backoff": {
            "base": 2,
            "init_backoff": timedelta(seconds=2),
            "max_backoff": timedelta(seconds=16),
        },
        "retry_timeout": timedelta(minutes=3),
    },
)

fs.cat_file("my-s3-bucket/test.txt")

Does this align with your expectations? Please let me know if you have any suggestions!
Thanks!

machichima · 2025-02-06T13:56:46Z

I'm in favor of this approach. I definitely wouldn't do it explicitly on import, but I'd propose we have obstore.fsspec.register() which would register these protocols with fsspec's registry.

Hi @kylebarron ,

I think we can directly use fsspec.register() for this? Which can be used as: fsspec.register_implementation("s3", AsyncFsspecStore). Or do you mean that we can do something like: obstore.fsspec.register("s3") so that we do not need to create more classes inherit from AsyncFsspecStore?

kylebarron · 2025-02-06T15:47:44Z

obstore/python/obstore/fsspec.py

+                "version": head["version"],
+            }
+        except FileNotFoundError:
+            # try ls, refer to the info implementation in fsspec


Do we need to duplicate this from upstream? Can we just call self.info, to call the upstream code without vendoring it here?

I'll try this out

It works, I updated here:

obstore/obstore/python/obstore/fsspec.py

Lines 191 to 194 in ff5d6bd

except FileNotFoundError:

# use info in fsspec.AbstractFileSystem

loop = asyncio.get_running_loop()

return await loop.run_in_executor(None, super().info, path, **kwargs)

kylebarron · 2025-02-06T15:54:54Z

Or do you mean that we can do something like: obstore.fsspec.register("s3") so that we do not need to create more classes inherit from AsyncFsspecStore?

I like this because it means that our fsspec subclasses could potentially stay private. So in theory the only API exported from obstore.fsspec would be register(). In practice, that might not be enough for all fsspec use cases.

But overall I think having obstore.fsspec.register, even if that function is a one-liner that wraps fsspec.register, is useful for simplicity.

martindurant · 2025-02-06T15:59:38Z

I think having obstore.fsspec.register, even if that function is a one-liner that wraps fsspec.register

Exactly what I was thinking - the user can call register themselves as int he example above, but it would be useful to provide a utility function that knows what to register, so the user only needs to call one thing once.

machichima · 2025-02-09T04:22:39Z

I will continue on this once this PR is merge, so that we can use the new way to construct the obstore insntance in open() too

machichima added 3 commits January 27, 2025 19:37

feat: add write() for open() in fsspec

2b5f6b5

temp: upload with iterator

361b30d

refactor: rename data_li to buffer

e0ec01a

kylebarron mentioned this pull request Jan 28, 2025

Add support for writable file-like objects #167

Merged

5 tasks

machichima added 2 commits January 29, 2025 23:20

feat: buffered write in fsspec

d75f3e8

fix: remove unused code

75d3734

machichima force-pushed the fsspec-open-write branch from 212fe05 to 75d3734 Compare January 29, 2025 15:20

kylebarron reviewed Jan 29, 2025

View reviewed changes

obstore/python/obstore/fsspec.py Show resolved Hide resolved

obstore/python/obstore/fsspec.py Show resolved Hide resolved

machichima added 3 commits January 30, 2025 13:11

Merge branch 'main' into fsspec-open-write

f958ec8

fix: assert mode is either rb or wb

b2f9d6f

fix: correctly detect file exist for read_parquet

07ae55d

machichima changed the title ~~[WIP] fsspec write method for open()~~ [WIP] support df.to_parquet and df.read_parquet() Jan 30, 2025

run pre-commit

428a66d

feat: split bucket name from path in fsspec _open

bc4ffaa

kylebarron mentioned this pull request Feb 2, 2025

Support obstore as storage for df.to_parquet() #164

Open

kylebarron reviewed Feb 2, 2025

View reviewed changes

martindurant reviewed Feb 5, 2025

View reviewed changes

Update obstore/python/obstore/fsspec.py

79e40a1

Co-authored-by: Martin Durant <martindurant@users.noreply.github.com>

fix: move incorrect mode exception into else

74dd9ed

fix: remove writer in init and add self.closed=True

8797944

kylebarron reviewed Feb 6, 2025

View reviewed changes

This was referenced Feb 7, 2025

[FEAT] Create obstore store in fsspec on demand #198

Open

[WIP] Apply obstore as storage backend flyteorg/flytekit#3033

Open

machichima added 2 commits February 9, 2025 12:15

fix: self._writer not exist error in close

cf1856a

fix: use info() in AbstractFileSystem

ff5d6bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] support df.to_parquet and df.read_parquet() #165

[WIP] support df.to_parquet and df.read_parquet() #165

machichima commented Jan 27, 2025 •

edited

Loading

kylebarron Jan 29, 2025

machichima Jan 30, 2025

machichima Jan 30, 2025

machichima commented Jan 30, 2025

machichima commented Jan 30, 2025

kylebarron commented Jan 30, 2025

martindurant commented Jan 30, 2025

kylebarron Feb 2, 2025

machichima Feb 3, 2025

kylebarron Feb 3, 2025

kylebarron Feb 6, 2025

martindurant left a comment

martindurant Feb 5, 2025

machichima Feb 6, 2025

martindurant Feb 5, 2025

machichima Feb 6, 2025

martindurant Feb 6, 2025

machichima Feb 7, 2025

martindurant Feb 5, 2025

machichima Feb 6, 2025

martindurant Feb 6, 2025

martindurant Feb 5, 2025

martindurant Feb 5, 2025

machichima Feb 6, 2025

kylebarron commented Feb 5, 2025

machichima commented Feb 6, 2025 •

edited

Loading

machichima commented Feb 6, 2025

kylebarron Feb 6, 2025 •

edited

Loading

machichima Feb 7, 2025

machichima Feb 9, 2025

kylebarron commented Feb 6, 2025

martindurant commented Feb 6, 2025

machichima commented Feb 9, 2025

	def _open(self, path, mode="rb", **kwargs):
	"""Return raw bytes-mode file-like from the file-system"""
	return BufferedFileSimple(self, path, mode, **kwargs)


	class BufferedFileSimple(fsspec.spec.AbstractBufferedFile):
	def __init__(self, fs, path, mode="rb", **kwargs):
	if mode != "rb":
	raise ValueError("Only 'rb' mode is currently supported")
	super().__init__(fs, path, mode, **kwargs)

	except FileNotFoundError:
	# use info in fsspec.AbstractFileSystem
	loop = asyncio.get_running_loop()
	return await loop.run_in_executor(None, super().info, path, **kwargs)

[WIP] support df.to_parquet and df.read_parquet() #165

Are you sure you want to change the base?

[WIP] support df.to_parquet and df.read_parquet() #165

Conversation

machichima commented Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machichima commented Jan 30, 2025

machichima commented Jan 30, 2025

kylebarron commented Jan 30, 2025

martindurant commented Jan 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Feb 5, 2025

machichima commented Feb 6, 2025 • edited Loading

machichima commented Feb 6, 2025

kylebarron Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Feb 6, 2025

martindurant commented Feb 6, 2025

machichima commented Feb 9, 2025

machichima commented Jan 27, 2025 •

edited

Loading

machichima commented Feb 6, 2025 •

edited

Loading

kylebarron Feb 6, 2025 •

edited

Loading