zcollection.collection.Collection.insert#

Collection.insert(ds, *, merge_callable=None, npartitions=None, validate=False, **kwargs)[source]#

Insert a dataset into the collection.

Parameters:
  • ds (Dataset | Dataset) – The dataset to insert. It can be either an xarray.Dataset or a dataset.Dataset object.

  • merge_callable (MergeCallable | None) – A function to use to merge the existing data set already stored in partitions with the new partitioned data. If None, the new partitioned data overwrites the existing partitioned data.

  • npartitions (int | None) – The maximum number of partitions to process in parallel. By default, partitions are processed one by one.

  • kwargs

    Additional keyword arguments passed to the merge callable.

    Note

    When inserting partitions, Dask parallelizes the writing of each partition across its workers. Additionally, the writing of variables within a partition is parallelized on the worker responsible for inserting that partition, using multiple threads. If you’re using a single Dask worker, partition insertion will happen sequentially and changing this parameter will have no effect.

  • validate (bool) – Whether to validate dataset metadata before insertion or not.

Returns:

A list of the inserted partitions.

Raises:

ValueError – If the dataset does not match the definition of the collection.

Warns:

UserWarning – If two different partitions use the same file (chunk), the library that handles the storage of chunked arrays (HDF5, NetCDF, Zarr, etc.) must be compatible with concurrent access.

Return type:

Iterable[str]

Notes

Each worker will process a set of independent partitions. However, be careful, two different partitions can use the same file (chunk), therefore, the library that handles the storage of chunked arrays (HDF5, NetCDF, Zarr, etc.) must be compatible with concurrent access.