Show pageOld revisionsBacklinksBack to top This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong. ====== Using FSMap with Zarr v2 for Remote Zip Stores ====== ===== Overview ===== This guide explains how to open Zarr archives stored as zip files in remote object storage (MinIO/S3) using ''fsspec'''s ''FSMap'' with Zarr v2. ===== The Problem ===== Zarr v2's native ''ZipStore'' only accepts local file paths, not remote files or file-like objects. To access a ''.zip'' file stored in MinIO/S3, we need a different approach that bridges the gap between remote storage and Zarr's storage interface. ===== The Solution Architecture ===== The solution uses a chain of abstractions: - **S3FileSystem**: Provides filesystem-like access to MinIO/S3 buckets - **ZipFileSystem**: Wraps a zip file (local or remote) as a virtual filesystem - **FSMap**: Translates filesystem operations into a key-value mapping that Zarr understands ==== Chain of Components ==== <code> MinIO/S3 Storage ↓ (accessed via) S3FileSystem (treats bucket as filesystem) ↓ (opens file) S3File object (file-like interface to remote zip) ↓ (wrapped by) ZipFileSystem (treats zip contents as filesystem) ↓ (mapped to) FSMap (key-value store interface) ↓ (consumed by) Zarr (reads arrays and metadata) </code> ===== How It Works ===== ==== 1. S3FileSystem ==== ''S3FileSystem'' from the ''s3fs'' library provides a Python filesystem interface (''fsspec'') to S3/MinIO: <code python> import s3fs fs = s3fs.S3FileSystem( client_kwargs={ 'endpoint_url': 'https://minio.example.com', 'verify': '/path/to/ca.crt' }, key='access_key', secret='secret_key', use_ssl=True ) </code> This object lets you interact with MinIO buckets using familiar filesystem operations like ''fs.open()'', ''fs.ls()'', etc. ==== 2. ZipFileSystem ==== ''ZipFileSystem'' from ''fsspec.implementations.zip'' takes a file object (which can be remote) and exposes the zip archive's internal structure as a filesystem: <code python> from fsspec.implementations.zip import ZipFileSystem # Open the remote zip file remote_file = fs.open('bucket/path/archive.zip', 'rb') # Create a filesystem view into the zip zip_fs = ZipFileSystem(fo=remote_file) </code> The ''fo'' parameter accepts any file-like object, including remote files from ''S3FileSystem''. Now ''zip_fs'' treats the zip's contents as if they were a directory tree. ==== 3. FSMap ==== ''FSMap'' from ''fsspec.mapping'' implements Python's ''MutableMapping'' interface (dict-like behavior) on top of any ''fsspec'' filesystem: <code python> from fsspec.mapping import FSMap # Create a mapping store store = FSMap(root='', fs=zip_fs) </code> The ''root=%%''%%'' parameter means "start at the zip's root directory". ''FSMap'' now translates dictionary-style access (''store[key]'') into filesystem operations (''zip_fs.open(key)''). ==== 4. Zarr Integration ==== Zarr v2 expects stores to implement the ''MutableMapping'' interface, which ''FSMap'' provides. When you open a Zarr group: <code python> import zarr root = zarr.open(store, mode='r') </code> Zarr performs operations like: * ''store['.zgroup']'' → reads the root metadata * ''store['array_name/.zarray']'' → reads array metadata * ''store['array_name/0.0.0']'' → reads a specific chunk Each of these translates through the chain: - ''FSMap'' → ''ZipFileSystem'' → ''S3File'' → MinIO/S3 HTTP request ===== Complete Example ===== <code python> import s3fs import zarr from fsspec.implementations.zip import ZipFileSystem from fsspec.mapping import FSMap # 1. Configure S3/MinIO access fs = s3fs.S3FileSystem( client_kwargs={ 'endpoint_url': 'https://minio.example.com', 'verify': '/path/to/ca.crt' }, key='your_access_key', secret='your_secret_key', use_ssl=True ) # 2. Open the remote zip file as a filesystem s3_path = 'my-bucket/data/experiment.zarr.zip' zip_fs = ZipFileSystem(fo=fs.open(s3_path, 'rb')) # 3. Create a mapping store for Zarr store = FSMap(root='', fs=zip_fs) # 4. Open with Zarr root = zarr.open(store, mode='r') # 5. Use the Zarr group normally print(root.tree()) array = root['my_array'][:] </code> ===== Key-Value Mapping Internals ===== Under the hood, a Zarr zip archive contains files like: <code> .zgroup # Root group metadata array_name/.zarray # Array metadata array_name/0.0.0 # Chunk at position (0,0,0) array_name/0.0.1 # Chunk at position (0,0,1) subgroup/.zgroup # Nested group metadata </code> When Zarr does ''store['array_name/0.0.0']'': - **FSMap** translates to ''zip_fs.open('array_name/0.0.0', 'rb').read()'' - **ZipFileSystem** locates this file in the zip's central directory - **ZipFileSystem** reads the compressed data from the underlying ''S3File'' - **S3File** makes an HTTP range request to MinIO - The decompressed chunk bytes are returned to Zarr This happens **lazily** - only when Zarr actually accesses specific data. ===== Mode Considerations ===== For read-only access (''mode='r'''), this approach works seamlessly. For write operations, limitations apply: * Zip files are **not designed for random write access** * ''ZipFileSystem'' in write mode requires recreating the entire zip * For remote storage, writing is impractical due to the need to download/reupload **Recommendation**: Use this approach for **read-only** access to pre-created Zarr zip archives. ===== Performance Notes ===== * **First access**: May be slower due to reading zip central directory * **Chunk reads**: Each chunk access makes a network request (unless cached by ''s3fs'') * **Optimization**: ''s3fs'' has built-in caching - configure with ''cache_type'' parameter * **Best for**: Datasets where you don't need to read all chunks (sparse access patterns) ===== Troubleshooting ===== ==== "TypeError: expected str, bytes or os.PathLike object" ==== You tried to pass a file object directly to ''zarr.ZipStore''. Use ''ZipFileSystem'' + ''FSMap'' instead. ==== "SSL Certificate Verify Failed" ==== Add the CA certificate path to ''client_kwargs={'verify': '/path/to/cert.pem'}''. ==== Store appears empty ==== Check that ''root=%%''%%'' in ''FSMap'' is correct. If the zip has a subdirectory, use ''root='subdirectory/path'''. ===== Alternative: Direct FSStore ===== You can also use ''zarr.storage.FSStore'' instead of ''FSMap'': <code python> store = zarr.storage.FSStore(url='', fs=zip_fs) root = zarr.open(store, mode='r') </code> Both ''FSStore'' and ''FSMap'' provide the same ''MutableMapping'' interface. ''FSMap'' is more lightweight and part of core ''fsspec''. ====== RKNS (Zarr V2) from Minio ZIP ====== Execute the following with 'uv run', the dependencies are automatically resolved. This assumes you have our internal pypi registry set up with uv. <code python> # /// script # requires-python = ">=3.8" # dependencies = [ # "boto3>=1.40.49", # "python-dotenv>=0.9.9", # "packaging>=25.0", # "rkns==0.6.2", # "s3fs[boto3]>=2023.12.0", # "typing-extensions>=4.15.0", # ] # /// import os from pathlib import Path from dotenv import load_dotenv from fsspec.implementations.zip import ZipFileSystem from fsspec.mapping import FSMap import s3fs import zarr import rkns # load credentials from .env file load_dotenv() access_key_id = os.getenv("STORAGE_ACCESS_KEY") secret_access_key = os.getenv("STORAGE_SECRET_KEY") endpoint_url = os.getenv("ENDPOINT") endpoint_url_full = os.getenv("ENDPOINT_FULL") # Specify the path to your custom CA certificate ca_cert_path = "ca.crt.cer" assert Path(ca_cert_path).is_file() # Create s3fs filesystem with custom cert fs = s3fs.S3FileSystem( client_kwargs={"endpoint_url": endpoint_url_full, "verify": str(ca_cert_path)}, key=access_key_id, secret=secret_access_key, use_ssl=True, ) s3_path = "rekonas-dataset-shhs-rkns/sub-shhs200001_ses-01_task-sleep_eeg.rkns" zip_fs = ZipFileSystem(fo=fs.open(s3_path, "rb")) store = zarr.storage.FSStore(url='', fs=zip_fs) rkns_obj = rkns.from_RKNS(store) print(rkns_obj.tree) </code> datalake/minio/zarrv2.txt Last modified: 2025/10/15 16:37by fabricio