Add map_blocks and some testing support for array query expressions#11398
Add map_blocks and some testing support for array query expressions#11398mrocklin wants to merge 6 commits into
Conversation
Add a --use-dask-array pytest mode that registers the dask-array chunk manager and runs the existing suite through that backend. Generalize dask-specific tests around the active chunk manager and add dask-array expression support for xarray map_blocks.
Avoid importing the dask chunk manager in bare-minimum test environments and tighten datetime accessor types for mypy/stubtest.
|
Gentle ping. |
| ) | ||
| parser.addoption("--run-mypy", action="store_true", help="runs mypy tests") | ||
| parser.addoption( | ||
| "--use-dask-array", |
There was a problem hiding this comment.
nit: let's call this --use-dask-array-with-expr to avoid confusion.
| name == "dask" and manager is chunkmanager | ||
| for name, manager in list_chunkmanagers().items() | ||
| ) | ||
| if isinstance(chunkmanager, DaskManager) or registered_as_dask: |
There was a problem hiding this comment.
| if isinstance(chunkmanager, DaskManager) or registered_as_dask: | |
| if registered_as_dask: |
?
| metadata into a private ``dask_array`` multi-output map expression. Each output | ||
| variable is still a normal ``dask_array.Array`` child expression, so Dask can | ||
| group the children with the composite-collection protocol and ``dask_array`` can | ||
| optimize, cull, persist, and compute those arrays. |
There was a problem hiding this comment.
Can we add a test for culling please? (e.g. ds.pipe(xr.map_blocks(...)).sel(...) ≡ s.sel(...).pipe(xr.map_blocks, ...) )? Or... I guess that's slice pushdown? Do we gain that?
There was a problem hiding this comment.
Done. And yes, I'd call this slice pushdown rather than culling now.
Rename the dask-array expression test mode, centralize dask-array test helpers, and add a map_blocks slice-pushdown regression test that checks executed chunks. Also remove the unnecessary xarray vectorized-indexing chunk-manager flag and preserve explicit DaskManager tokenization behavior. Co-Authored-By: OpenAI <noreply@openai.com>
| dask_array_api = None | ||
| dask_array_type = () | ||
| has_dask_array_expr = False | ||
|
|
||
|
|
||
| def refresh_dask_chunkmanager_helpers() -> None: | ||
| global dask_array_api, dask_array_type, has_dask_array_expr | ||
|
|
||
| dask_array_api = None | ||
| dask_array_type = () | ||
| has_dask_array_expr = False | ||
| if has_dask: | ||
| dask_chunkmanager = get_dask_chunkmanager() | ||
| dask_array_api = dask_chunkmanager.array_api | ||
| dask_array_type = dask_chunkmanager.array_cls | ||
| has_dask_array_expr = dask_array_type.__module__.startswith("dask_array") | ||
|
|
||
|
|
||
| refresh_dask_chunkmanager_helpers() |
There was a problem hiding this comment.
can this just be
if has_dask:
...
else:
...
dcherian
left a comment
There was a problem hiding this comment.
Thanks! Just a couple of nits but LGTM otherwise.
| assert len(v2.__dask_graph__()) < len(v.__dask_graph__()) # type: ignore[arg-type] | ||
| assert v2.__dask_keys__() == v.__dask_keys__() | ||
| if not has_dask_array_expr: | ||
| assert v2.__dask_keys__() == v.__dask_keys__() |
There was a problem hiding this comment.
this seems important; is there an equivalent assertion?
There was a problem hiding this comment.
Yeah, it is important to understand. Optimized expressions will often have a different hash than their unoptimized progenitors. I'm changing this test to assert equivalence in the suffix (key[1:]) of every key rather than key equality.
This is a mild headache for both dask.distributed and frisky in operations like persist, because the client-side will continue to use the expression in things. We end up having to maintain a key-map.
So I believe that this change is good. It points to a genuine issue, but that issue is handled by the related infrastructure.
|
Thanks @dcherian for the help here |
|
Anything I can do to help here? |
2e31329 to
a0a7df1
Compare
|
Rebased on main. CI passes now except for the RTD failure (upstream issue I think). |
This adds a
--use-dask-arraypytest mode that registers the dask-array chunk manager and runs the existing suite through that backend. Most of the work here is making tests a bit more dask.array/dask-array agnostic. I also brought back the xarray map_blocks implementation.I haven't reviewed this thoroughly yet, and it's not complete (there are still sections of tests that would fail if they weren't marked to xfail under dask-array (flox and masked arrays are good examples), but I wanted to push up something before pushing much further forward on this so that this could get some early feedback.
cc @shoyer @dcherian
AI Disclosure
Tools: Claude, Codex