Skip to content

Helpful behaviour when classic-raster mode hits a multidim store #26

Description

@mdsumner

Opening a multidim store with multidim=False currently returns a meaningless empty sentinel rather than telling the user what's actually going on. GDAL's classic raster API on a multidim store gets a stub dataset back with no data variables, made-up x/y coordinates (0.5 to 511.5), and no indication that they probably wanted multidim=True or a specific 2D slice via subdataset syntax.

import os
os.environ["AWS_NO_SIGN_REQUEST"] = "YES"
os.environ["AWS_REGION"] = "us-west-2"

import gdalxarray
from gdalxarray import GDALBackendEntrypoint
backend = GDALBackendEntrypoint()

# Multidim store, classic mode:
xds = backend.open_dataset(
    "/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk",
    multidim=False,
)

# <xarray.Dataset> Size: 8kB
# Dimensions:  (x: 512, y: 512)
# Coordinates:
#   * x        (x) float64 4kB 0.5 1.5 2.5 ... 510.5 511.5
#   * y        (y) float64 4kB 0.5 1.5 2.5 ... 510.5 511.5
# Data variables:
#     *empty*

The user has no signal that they're looking at a stub. They probably wanted multidim=True, or one specific 2D slice of one variable.

What's actually possible

For a single 2D slice from a multidim Icechunk store, GDAL 3.13+ supports:

xds = backend.open_dataset(
    'ZARR:"/vsiicechunk/{/vsis3/.../v0.1.0.icechunk}":/wind_u_10m:{0}:{0}',
    multidim=False,
)
# Returns the lat × lon face at init_time=0, lead_time=0:
# Dimensions:  (y: 721, x: 1440)
# Data variables: band_1 (y, x) float32 4MB

This composition — /vsiicechunk/ VFS + ZARR: driver + :{i}:{j} slice selector — is non-obvious and worth documenting. It's the bridge between multidim cloud-native stores and classic GDAL tools (gdal_translate, gdal_warp, etc.).

Proposed behaviour

When _open_raster is called and detects that the result is a stub (empty data variables, sentinel-looking coordinates), do one of:

  • Warn (or raise, configurable) with a clear message — "store appears to be multidimensional; try multidim=True or use the subdataset syntax for a 2D slice"
  • Inspect the underlying GDAL dataset for OF_MULTIDIM_RASTER capability and, if present, raise immediately rather than returning the stub
  • Add a helper that prints a gdalinfo-like summary listing available subdataset paths and the equivalent multidim=True access

Sketch of detection:

def _open_raster(self, filename_or_obj, chunks, drop_variables):
    dataset = gdal.Open(filename_or_obj, gdal.GA_ReadOnly)
    if dataset is None:
        raise ValueError(f"Could not open {filename_or_obj} with GDAL")
    
    # Detect multidim stubs: no real bands, or sentinel-looking metadata
    if dataset.RasterCount == 0:
        # Try multidim probe
        mdim_ds = gdal.OpenEx(filename_or_obj, gdal.OF_MULTIDIM_RASTER)
        if mdim_ds is not None and mdim_ds.GetRootGroup() is not None:
            rg = mdim_ds.GetRootGroup()
            arrays = rg.GetMDArrayFullNamesRecursive()
            raise ValueError(
                f"{filename_or_obj} appears to be a multidimensional store; "
                f"try multidim=True. Found {len(arrays)} arrays at paths like "
                f"{arrays[:3]}..."
            )
    
    # ... rest of existing _open_raster

Helper for exploration

A small gdalxarray.info(url) function that prints:

  • Whether the store opens as classic, multidim, or both
  • Root group name and immediate children
  • All MDArrays at the root level
  • Suggested xarray-style access pattern
gdalxarray.info("/vsis3/dynamical-ecmwf-aifs-single/.../v0.1.0.icechunk")
# Store type:    Multidim (Icechunk via /vsiicechunk/)
# Root group:    /
# Arrays:        21 variables, 4 dimensions
#                init_time, latitude, lead_time, longitude
#                wind_u_10m (init_time, lead_time, latitude, longitude) float32
#                ...
#
# Suggested:
#   backend.open_dataset(url, multidim=True)
#   for a single 2D slice (lat × lon at init_time=0, lead_time=0):
#   backend.open_dataset(
#       'ZARR:"/vsiicechunk/{<path>}":/wind_u_10m:{0}:{0}',
#       multidim=False,
#   )

This is the kind of "introspect-then-document-paths" tool that's common in R packages (raadtools::glance, gdalraster::ds$info()) but missing from the Python multidim story.

Open questions

  • Warn vs raise on multidim stub? (Arguments both ways — raise is loud and unambiguous; warn is less disruptive to existing usage)
  • Is gdalxarray.info() worth shipping, or is gdal mdim info from the GDAL CLI already enough?
  • Should the info helper print subdataset syntax for every variable, or just summarise? (For the ECMWF AIFS store that's 21 variables — readable; for ERA5 with ~200, less so.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions