datalake:minio:zarrv2

Using FSMap with Zarr v2 for Remote Zip Stores

This guide explains how to open Zarr archives stored as zip files in remote object storage (MinIO/S3) using fsspec's FSMap with Zarr v2.

Zarr v2's native ZipStore only accepts local file paths, not remote files or file-like objects. To access a .zip file stored in MinIO/S3, we need a different approach that bridges the gap between remote storage and Zarr's storage interface.

The solution uses a chain of abstractions:

  1. S3FileSystem: Provides filesystem-like access to MinIO/S3 buckets
  2. ZipFileSystem: Wraps a zip file (local or remote) as a virtual filesystem
  3. FSMap: Translates filesystem operations into a key-value mapping that Zarr understands
MinIO/S3 Storage
    ↓ (accessed via)
S3FileSystem (treats bucket as filesystem)
    ↓ (opens file)
S3File object (file-like interface to remote zip)
    ↓ (wrapped by)
ZipFileSystem (treats zip contents as filesystem)
    ↓ (mapped to)
FSMap (key-value store interface)
    ↓ (consumed by)
Zarr (reads arrays and metadata)

S3FileSystem from the s3fs library provides a Python filesystem interface (fsspec) to S3/MinIO:

import s3fs
 
fs = s3fs.S3FileSystem(
    client_kwargs={
        'endpoint_url': 'https://minio.example.com',
        'verify': '/path/to/ca.crt'
    },
    key='access_key',
    secret='secret_key',
    use_ssl=True
)

This object lets you interact with MinIO buckets using familiar filesystem operations like fs.open(), fs.ls(), etc.

ZipFileSystem from fsspec.implementations.zip takes a file object (which can be remote) and exposes the zip archive's internal structure as a filesystem:

from fsspec.implementations.zip import ZipFileSystem
 
# Open the remote zip file
remote_file = fs.open('bucket/path/archive.zip', 'rb')
 
# Create a filesystem view into the zip
zip_fs = ZipFileSystem(fo=remote_file)

The fo parameter accepts any file-like object, including remote files from S3FileSystem. Now zip_fs treats the zip's contents as if they were a directory tree.

FSMap from fsspec.mapping implements Python's MutableMapping interface (dict-like behavior) on top of any fsspec filesystem:

from fsspec.mapping import FSMap
 
# Create a mapping store
store = FSMap(root='', fs=zip_fs)

The root='' parameter means “start at the zip's root directory”. FSMap now translates dictionary-style access (store[key]) into filesystem operations (zip_fs.open(key)).

Zarr v2 expects stores to implement the MutableMapping interface, which FSMap provides. When you open a Zarr group:

import zarr
 
root = zarr.open(store, mode='r')

Zarr performs operations like:

  • store['.zgroup'] → reads the root metadata
  • store['array_name/.zarray'] → reads array metadata
  • store['array_name/0.0.0'] → reads a specific chunk

Each of these translates through the chain:

  1. FSMapZipFileSystemS3File → MinIO/S3 HTTP request
import s3fs
import zarr
from fsspec.implementations.zip import ZipFileSystem
from fsspec.mapping import FSMap
 
# 1. Configure S3/MinIO access
fs = s3fs.S3FileSystem(
    client_kwargs={
        'endpoint_url': 'https://minio.example.com',
        'verify': '/path/to/ca.crt'
    },
    key='your_access_key',
    secret='your_secret_key',
    use_ssl=True
)
 
# 2. Open the remote zip file as a filesystem
s3_path = 'my-bucket/data/experiment.zarr.zip'
zip_fs = ZipFileSystem(fo=fs.open(s3_path, 'rb'))
 
# 3. Create a mapping store for Zarr
store = FSMap(root='', fs=zip_fs)
 
# 4. Open with Zarr
root = zarr.open(store, mode='r')
 
# 5. Use the Zarr group normally
print(root.tree())
array = root['my_array'][:]

Under the hood, a Zarr zip archive contains files like:

.zgroup                  # Root group metadata
array_name/.zarray       # Array metadata
array_name/0.0.0         # Chunk at position (0,0,0)
array_name/0.0.1         # Chunk at position (0,0,1)
subgroup/.zgroup         # Nested group metadata

When Zarr does store['array_name/0.0.0']:

  1. FSMap translates to zip_fs.open('array_name/0.0.0', 'rb').read()
  2. ZipFileSystem locates this file in the zip's central directory
  3. ZipFileSystem reads the compressed data from the underlying S3File
  4. S3File makes an HTTP range request to MinIO
  5. The decompressed chunk bytes are returned to Zarr

This happens lazily - only when Zarr actually accesses specific data.

For read-only access (mode='r'), this approach works seamlessly.

For write operations, limitations apply:

  • Zip files are not designed for random write access
  • ZipFileSystem in write mode requires recreating the entire zip
  • For remote storage, writing is impractical due to the need to download/reupload

Recommendation: Use this approach for read-only access to pre-created Zarr zip archives.

  • First access: May be slower due to reading zip central directory
  • Chunk reads: Each chunk access makes a network request (unless cached by s3fs)
  • Optimization: s3fs has built-in caching - configure with cache_type parameter
  • Best for: Datasets where you don't need to read all chunks (sparse access patterns)

You tried to pass a file object directly to zarr.ZipStore. Use ZipFileSystem + FSMap instead.

Add the CA certificate path to client_kwargs={'verify': '/path/to/cert.pem'}.

Check that root='' in FSMap is correct. If the zip has a subdirectory, use root='subdirectory/path'.

You can also use zarr.storage.FSStore instead of FSMap:

store = zarr.storage.FSStore(url='', fs=zip_fs)
root = zarr.open(store, mode='r')

Both FSStore and FSMap provide the same MutableMapping interface. FSMap is more lightweight and part of core fsspec.

RKNS (Zarr V2) from Minio ZIP

Execute the following with 'uv run', the dependencies are automatically resolved. This assumes you have our internal pypi registry set up with uv.

# /// script
# requires-python = ">=3.8"
# dependencies = [
#     "boto3>=1.40.49",
#     "python-dotenv>=0.9.9",
#     "packaging>=25.0",
#     "rkns==0.6.2",
#     "s3fs[boto3]>=2023.12.0",
#     "typing-extensions>=4.15.0",
# ]
# ///
import os
from pathlib import Path
from dotenv import load_dotenv
from fsspec.implementations.zip import ZipFileSystem
from fsspec.mapping import FSMap
import s3fs
import zarr
import rkns
 
# load credentials from .env file
load_dotenv()
access_key_id = os.getenv("STORAGE_ACCESS_KEY")
secret_access_key = os.getenv("STORAGE_SECRET_KEY")
endpoint_url = os.getenv("ENDPOINT")
endpoint_url_full = os.getenv("ENDPOINT_FULL")
 
 
# Specify the path to your custom CA certificate
ca_cert_path = "ca.crt.cer"
assert Path(ca_cert_path).is_file()
 
 
# Create s3fs filesystem with custom cert
fs = s3fs.S3FileSystem(
    client_kwargs={"endpoint_url": endpoint_url_full, "verify": str(ca_cert_path)},
    key=access_key_id,
    secret=secret_access_key,
    use_ssl=True,
)
 
s3_path = "rekonas-dataset-shhs-rkns/sub-shhs200001_ses-01_task-sleep_eeg.rkns"
 
zip_fs = ZipFileSystem(fo=fs.open(s3_path, "rb"))
store = zarr.storage.FSStore(url='', fs=zip_fs)
rkns_obj = rkns.from_RKNS(store)
print(rkns_obj.tree)
  • datalake/minio/zarrv2.txt
  • Last modified: 2025/10/15 16:37
  • by fabricio