Having Fun with CMOR#

pyku provides functions to CMORize data, automatically handling a significant number of attributes and unit conversions. However, global attributes must be manually set, as they cannot be inferred. The first step, as usual, is to load the xarray library and the pyku extension.

[1]:

import xarray as xr
import pyku

DRS definition#

pyku includes DRS definition files. You can display the full list of implemented CMOR and CMOR-like DRS definitions as follows:

[2]:

pyku.list_drs_standards()

[2]:

['cmip6',
 'cmip5',
 'cordex',
 'cordex_adjust',
 'cordex_interp',
 'cordex_adjust_interp',
 'cordex_cmip6',
 'hyras',
 'sk_raw',
 'sk_cmor',
 'reanalysis',
 'cosmo-rea6',
 'eobs',
 'obs4mips',
 'seamless']

In this tutorial, we will CMORize COSMO-CLM data using the CORDEX CMOR standard. Behind the scenes, pyku utilizes YAML metadata files, which are structured like this:

cordex:

    # References
    # ----------

    # http://is-enes-data.github.io/
    # http://is-enes-data.github.io/cordex_archive_specifications.pdf
    # http://is-enes-data.github.io/CORDEX_variables_requirement_table.pdf

    # The stem/parent wording stems from the pathlib library
    # https://docs.python.org/3/library/pathlib.html

    metadata:
        - CORDEX_domain
        - driving_model_id
        - driving_experiment_name
        - driving_model_ensemble_member
        - model_id
        - rcm_version_id
        - frequency
        - product
        - institute_id
        - experiment_id

    stem_pattern:
        "{variable_name}_{CORDEX_domain}_{driving_model_id}_\
         {driving_experiment_name}_{driving_model_ensemble_member}_\
         {model_id}_{rcm_version_id}_{frequency}_{start_time}-{end_time}"

    parent_pattern:
        "{product}/{CORDEX_domain}/{institute_id}/{driving_model_id}/\
         {experiment_id}/{driving_model_ensemble_member}/{model_id}/\
         {rcm_version_id}/{frequency}"

Get exemplary ICON data#

Get ICON exemplary data, which consists of raw model output without any post-processing.

[3]:

# List files using UNIX file pattern
# ----------------------------------

grib_files = pyku.resources.get_test_data('icon_grib_files')
grid_file = pyku.resources.get_test_data('icon_grid_file')

/home/runner/work/pyku/pyku/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Exploring the dataset#

We open the dataset using standard xarray options. It is important to specify how xarray should handle variables when working with multiple files. By default, georeferencing metadata (like the CRS attribute) is tied to the time dimension. Here, we instruct xarray to treat the CRS as a single variable across all files, independent of dataset dimensions, if it remains the same. Additionally, for this demonstration, we only use a subset of the data to ensure faster execution. In the background, both xarray and pyku are optimized for handling large datasets, allowing terabytes of data to be lazily loaded and processed on demand.

[4]:

ds = xr.open_mfdataset(
    grib_files,
    engine='cfgrib',
    combine='nested',
    compat='no_conflicts',
    coords='different',
    time_dims=["valid_time"]
)

ds

ecCodes provides no latitudes/longitudes for gridType='unstructured_grid'

[4]:

<xarray.Dataset> Size: 2GB
Dimensions:            (valid_time: 744, values: 659156)
Coordinates:
  * valid_time         (valid_time) datetime64[ns] 6kB 2025-01-01 ... 2025-01...
    heightAboveGround  float64 8B ...
Dimensions without coordinates: values
Data variables:
    t2m                (valid_time, values) float32 2GB dask.array<chunksize=(744, 659156), meta=np.ndarray>
Attributes:
    GRIB_edition:            2
    GRIB_centre:             edzw
    GRIB_centreDescription:  Offenbach
    GRIB_subCentre:          255
    Conventions:             CF-1.7
    institution:             Offenbach
    history:                 2026-03-20T07:20 GRIB to CDM+CF via cfgrib-0.9.1...

Modify the data#

This preprocessing step standardizes the raw GRIB data for compatibility with pyku. Specifically, we rename the valid_time dimension to time and drop the original model-run initialization time and the forecast time step. Since the dataset is strictly 2D, the heightAboveGround coordinate is removed to simplify the data structure. Finally, we rename the data variable to cell, which is a prerequisite for correctly mapping the geographic coordinate system in subsequent steps.

[5]:

ds = ds.drop_vars(['heightAboveGround'])
ds = ds.rename({'valid_time': 'time'})
ds = ds.rename_dims({'values': 'cell'})

Attach the georeferencing#

ICON model output is typically stored on an unstructured grid, where meteorological variables are separated from their spatial coordinates. To perform georeferencing, we must retrieve a standalone grid file (external coordinates) containing the latitude and longitude for each cell center. This grid information is essential for mapping the raw data to a geographic coordinate system.

[6]:

grid = xr.open_dataset(
    pyku.resources.get_test_data('icon_grid_file')
)

We can merge the grid cell center latitudes and longitudes to the data

[7]:

ds = xr.merge([
    ds,
    grid[['clon', 'clat']]
])

Regridding the data#

With the coordinates attached, we can now apply one of pyku’s pre-defined geographic projections. Note that pyku does not currently support spatial projections for dask-chunked data on unstructured grids; datasets must be loaded into memory before the transformation.

[8]:

ds = ds.chunk(chunks=-1).pyku.project('GER-0275')

2026-03-20 07:20:55,768 - pyku - geo - 822- WARNING - longitude unit conversion from radians to degrees
2026-03-20 07:20:55,770 - pyku - geo - 826- WARNING - latitude unit conversion from radians to degrees

[9]:

ds.isel(time=0).ana.one_map(var='t2m', crs='GER-0275')

<Figure size 640x480 with 0 Axes>

../_images/tutorials_CMORization_17_1.png

Selecting variables for CMORization#

pyku can select a single variable as a dataset, with all global attributes kept, and perform the CMORization process. It is important to note that CMORization should be done one variable at a time. To list all available georeferenced data that can be CMORized (excluding non-data variables like the CRS or time bounds), you can use the following pyku functionality.

[10]:

ds.pyku.get_geodata_varnames()

[10]:

['t2m']

From this list, you can select a specific variable while preserving all its attributes. Special variables, such as the CRS and time bounds, are automatically handled.

Setting CMOR attibutes and names#

The dataset is now ready for CMORization. Note that all variable attributes and conventions are applied, except for the global metadata, which still need to be set.

[11]:

cmorized = ds.pyku.get_geodataset('t2m').pyku.cmorize()
cmorized

[11]:

<xarray.Dataset> Size: 525MB
Dimensions:       (time: 744, rlat: 423, rlon: 415)
Coordinates:
  * time          (time) datetime64[ns] 6kB 2025-01-01 ... 2025-01-31T23:00:00
  * rlat          (rlat) float64 3kB 7.439 7.411 7.384 ... -4.111 -4.139 -4.166
  * rlon          (rlon) float64 3kB -10.29 -10.27 -10.24 ... 1.036 1.064 1.091
    lat           (rlat, rlon) float64 1MB 56.87 56.88 56.88 ... 46.57 46.57
    lon           (rlat, rlon) float64 1MB -0.9172 -0.869 ... 19.54 19.58
Data variables:
    tas           (time, rlat, rlon) float32 522MB dask.array<chunksize=(744, 423, 415), meta=np.ndarray>
    rotated_pole  int32 4B 1
Attributes:
    GRIB_edition:            2
    GRIB_centre:             edzw
    GRIB_centreDescription:  Offenbach
    GRIB_subCentre:          255
    Conventions:             CF-1.7
    institution:             Offenbach
    history:                 2026-03-20T07:20 GRIB to CDM+CF via cfgrib-0.9.1...
    CORDEX_domain:           undefined
    frequency:               1hr

For verification, run a DRS check using the following command:

[12]:

cmorized.pyku.check_drs(standard='cordex')

[12]:

	key	value	issue
0	CORDEX_domain	undefined	None
1	driving_model_id	None	missing value
2	driving_experiment_name	None	missing value
3	driving_model_ensemble_member	None	missing value
4	model_id	None	missing value
5	rcm_version_id	None	missing value
6	frequency	1hr	None
7	product	None	missing value
8	institute_id	None	missing value
9	experiment_id	None	missing value

Setting global metadata#

Running a sanity check to verify compliance with the standard, we confirm that the global metadata have not been set, as they cannot be inferred automatically. The global metadata must be manually provided to the cmorize function in the form of a dictionary. For operational use, it is recommended to define this metadata in a YAML file, which can then be loaded into a dictionary

[13]:

global_metadata = {
    'title': (
        "COSMO_CLM_5.00_clm16 simulation (0.0275 Deg) with ERA5T (direct nest, since "
        "2022) forcing, domain Germany, FPS-C tuned setup"
    ),
    'CORDEX_domain': 'GER-0275',
    'driving_model_id': 'ECMWF-ERA5T',
    'driving_experiment_name': 'evaluation',
    'experiment_id': 'evaluation',
    'driving_experiment': 'ECMWF-ERA5T, evaluation, r1i1p1',
    'driving_model_ensemble_member': 'r1i1p1',
    'experiment': 'evaluation run with ECMWF-ERA5T reanalysis forcing',
    'model_id': 'CLMcom-DWD-CCLM5-0-16',
    'institute_id': 'CLMcom-DWD',
    'institution': (
        "DWD (Deutscher Wetterdienst, Offenbach, Germany) in collaboration with the "
        "CLM-Community"
    ),
    'rcm_version_id': 'x0n1-v1',
    'contact': 'klima.offenbach@dwd.de',
    'nesting_levels': '1',
    'comment_nesting': 'these are results of a direct downscaling (one nest) approach',
    'comment_1stNest': (
        "convection permitting simulation driven by ECMWF-ERA5T (direct downscaling) "
        "comment_2ndNest: not used "
    ),
    'comment': (
        "Convection-permitting evaluation run for Germany (HoKliSim-DeT) performed by "
        "DWD Offenbach. Configuration adapted from the CORDEX FPS Convection setup, "
        "which was developed and tuned by the CRCS working group of the CLM Community. "
        "Computing system: Aurora NEC at DWD. The Deutscher Wetterdienst (DWD) is the "
        "producer of the data. The General Terms and Conditions of Business and "
        "Delivery apply for services provided by DWD "
        "(http://www.dwd.de/EN/service/terms/terms.html)."
    ),
    "rcm_config_cclm": "GER-0275_CLMcom-DWD-CCLM5-0-16_config",
    "rcm_config_int2lm": "GER-0275_CLMcom-DWD-INT2LM2-0-6-modif_config",
    "source": "Climate Limited-area Modelling Community (CLM-Community)",
    "references": "http://www.clm-community.eu, http://www.dwd.de",
    "product": "output",
    "project_id": "DWD-CPS",
}

Now, we CMORize the data again, this time setting the global attributes for CMOR compliance. Afterward, we perform another sanity check and observe that the global attributes have been correctly set.

[14]:

cmorized = ds.pyku.get_geodataset('t2m')
cmorized = cmorized.pyku.cmorize(global_metadata=global_metadata)
cmorized.pyku.check_drs(standard='cordex')

[14]:

	key	value	issue
0	CORDEX_domain	GER-0275	None
1	driving_model_id	ECMWF-ERA5T	None
2	driving_experiment_name	evaluation	None
3	driving_model_ensemble_member	r1i1p1	None
4	model_id	CLMcom-DWD-CCLM5-0-16	None
5	rcm_version_id	x0n1-v1	None
6	frequency	1hr	None
7	product	output	None
8	institute_id	CLMcom-DWD	None
9	experiment_id	evaluation	None

An additional sanity check can be performed using the pyku check function to identify any potential issues:

[15]:

cmorized.pyku.check_metadata(standard='cordex')

[15]:

	key	value	issue	description
0	y_projection_coordinate_exist	True	None	Checking if y projection coordinate available
1	x_projection_coordinate_exist	True	None	Checking if x projection coordinate available
2	lat_geographic_coordinate_exist	True	None	Checking if lat geographic coordinate available
3	lon_geographic_coordinate_exist	True	None	Checking if lon geographic coordinate available
4	y_projection_coordinate_unit_correct	True	None	Checking if y projection coordinates units
5	x_projection_coordinate_unit_correct	True	None	Checking if x projection coordinates units
6	y_projection_coordinate_standard_name_correct	True	None	Checking y projection coordinates standard_name
7	x_projection_coordinate_standard_name_correct	True	None	Checking x projection coordinates standard_name
8	lat_geographic_coordinate_unit_correct	True	None	Checking lat geographic coordinate units
9	lon_geographic_coordinate_unit_correct	True	None	Checking if lat geographic coordinate units
10	lat_geographic_coordinate_standard_name_correct	True	None	Checking lat geographic coordinate standard_name
11	lon_geographic_coordinate_standard_name_correct	True	None	Checking lon geographic coordinate standard_name
12	cf_area_def_readable	True	None	Check if CF projection metadata are readable
13	area_extent_is_readable	True	None	Check if the area extent can be determined fro...
14	longitudes_within_180W_and_180E	True	None	Check that the longitudes are within 180 degre...
15	tas_units_can_be_read	True	None	Check if units can be read automatically
16	tas	tas	None	NaN
17	is_cmor_standard_name	True	None	If possible, check if standard name is CMOR co...
18	is_cmor_long_name	True	None	Check if long_name is CMOR conform
19	is_cmor_units	True	None	Check if units is CMOR conform
20	frequency_can_be_inferred_from_data	True	None	Tried to infer frequency from the time labels
21	frequency_can_be_determined	True	None	Check if frequency can be determined from the ...
22	geodata_vars	[tas]	None	NaN
23	geographic_latlon	(lat, lon)	None	NaN
24	projection_yx	(rlat, rlon)	None	NaN
25	time_dependent_vars	[tas]	None	NaN
26	time_bounds_var	None	None	NaN
27	spatial_bounds_vars	[]	None	NaN
28	spatial_vertices_vars	[]	None	NaN
29	crs_var	rotated_pole	None	NaN
30	unidentified_vars	[]	None	NaN
31	has_time_dimension	True	None	Check if data have a time dimension
32	time_is_numpy_datetime64_or_cftime	True	None	Check the data type of the time stamps
33	CORDEX_domain	GER-0275	None	NaN
34	driving_model_id	ECMWF-ERA5T	None	NaN
35	driving_experiment_name	evaluation	None	NaN
36	driving_model_ensemble_member	r1i1p1	None	NaN
37	model_id	CLMcom-DWD-CCLM5-0-16	None	NaN
38	rcm_version_id	x0n1-v1	None	NaN
39	frequency	1hr	None	NaN
40	product	output	None	NaN
41	institute_id	CLMcom-DWD	None	NaN
42	experiment_id	evaluation	None	NaN

CMORized data filename#

From here, we can verify the name that the final data file should have.

[16]:

cmorized.pyku.drs_filename(standard='cordex')

[16]:

'output/GER-0275/CLMcom-DWD/ECMWF-ERA5T/evaluation/r1i1p1/CLMcom-DWD-CCLM5-0-16/x0n1-v1/1hr/tas/tas_GER-0275_ECMWF-ERA5T_evaluation_r1i1p1_CLMcom-DWD-CCLM5-0-16_x0n1-v1_1hr_2025010100-2025013123.nc'

Geographic projections#

The data can be reprojected as needed, with all available CMOR metadata set automatically. Note how the grid has been modified in the metadata to comply with the CMOR standard.

[17]:

cmorized = ds.pyku.get_geodataset('t2m')
cmorized = cmorized.pyku.cmorize(global_metadata=global_metadata)
cmorized = cmorized.pyku.project('HYR-LAEA-5')
cmorized.pyku.check_drs(standard='cordex')

[17]:

	key	value	issue
0	CORDEX_domain	undefined	None
1	driving_model_id	ECMWF-ERA5T	None
2	driving_experiment_name	evaluation	None
3	driving_model_ensemble_member	r1i1p1	None
4	model_id	CLMcom-DWD-CCLM5-0-16	None
5	rcm_version_id	x0n1-v1	None
6	frequency	1hr	None
7	product	output	None
8	institute_id	CLMcom-DWD	None
9	experiment_id	evaluation	None

The filename should now be as follows:

[18]:

cmorized.pyku.drs_filename(standard='cordex')

[18]:

'output/undefined/CLMcom-DWD/ECMWF-ERA5T/evaluation/r1i1p1/CLMcom-DWD-CCLM5-0-16/x0n1-v1/1hr/tas/tas_undefined_ECMWF-ERA5T_evaluation_r1i1p1_CLMcom-DWD-CCLM5-0-16_x0n1-v1_1hr_2025010100-2025013123.nc'

Resampling data#

Similarly, we can transform the data through resampling.

[19]:

cmorized = ds.pyku.get_geodataset('t2m')
cmorized = cmorized.pyku.project('HYR-LAEA-5')
cmorized = cmorized.pyku.resample_datetimes(frequency='1D', how='mean')
cmorized = cmorized.pyku.cmorize(global_metadata=global_metadata)
cmorized

2026-03-20 07:21:09,201 - pyku - timekit - 314- WARNING - t2m: 'cell_methods' was not set automatically

[19]:

<xarray.Dataset> Size: 8MB
Dimensions:    (time: 31, y: 220, x: 258, bnds: 2)
Coordinates:
  * time       (time) datetime64[ns] 248B 2025-01-01T12:00:00 ... 2025-01-31T...
  * y          (y) float64 2kB 3.586e+06 3.582e+06 ... 2.496e+06 2.492e+06
  * x          (x) float64 2kB 3.808e+06 3.814e+06 ... 5.088e+06 5.094e+06
    lat        (y, x) float64 454kB 55.13 55.13 55.14 ... 45.09 45.08 45.08
    lon        (y, x) float64 454kB 1.95 2.028 2.106 2.184 ... 19.7 19.76 19.83
Dimensions without coordinates: bnds
Data variables:
    tas        (time, y, x) float32 7MB dask.array<chunksize=(31, 220, 258), meta=np.ndarray>
    crs        int32 4B 1
    time_bnds  (time, bnds) datetime64[ns] 496B 2025-01-01 ... 2025-02-01
Attributes: (12/30)
    GRIB_edition:                   2
    GRIB_centre:                    edzw
    GRIB_centreDescription:         Offenbach
    GRIB_subCentre:                 255
    Conventions:                    CF-1.7
    institution:                    DWD (Deutscher Wetterdienst, Offenbach, G...
    ...                             ...
    rcm_config_cclm:                GER-0275_CLMcom-DWD-CCLM5-0-16_config
    rcm_config_int2lm:              GER-0275_CLMcom-DWD-INT2LM2-0-6-modif_config
    source:                         Climate Limited-area Modelling Community ...
    references:                     http://www.clm-community.eu, http://www.d...
    product:                        output
    project_id:                     DWD-CPS

[20]:

cmorized.time_bnds

[20]:

<xarray.DataArray 'time_bnds' (time: 31, bnds: 2)> Size: 496B
array([['2025-01-01T00:00:00.000000000', '2025-01-02T00:00:00.000000000'],
       ['2025-01-02T00:00:00.000000000', '2025-01-03T00:00:00.000000000'],
       ['2025-01-03T00:00:00.000000000', '2025-01-04T00:00:00.000000000'],
       ['2025-01-04T00:00:00.000000000', '2025-01-05T00:00:00.000000000'],
       ['2025-01-05T00:00:00.000000000', '2025-01-06T00:00:00.000000000'],
       ['2025-01-06T00:00:00.000000000', '2025-01-07T00:00:00.000000000'],
       ['2025-01-07T00:00:00.000000000', '2025-01-08T00:00:00.000000000'],
       ['2025-01-08T00:00:00.000000000', '2025-01-09T00:00:00.000000000'],
       ['2025-01-09T00:00:00.000000000', '2025-01-10T00:00:00.000000000'],
       ['2025-01-10T00:00:00.000000000', '2025-01-11T00:00:00.000000000'],
       ['2025-01-11T00:00:00.000000000', '2025-01-12T00:00:00.000000000'],
       ['2025-01-12T00:00:00.000000000', '2025-01-13T00:00:00.000000000'],
       ['2025-01-13T00:00:00.000000000', '2025-01-14T00:00:00.000000000'],
       ['2025-01-14T00:00:00.000000000', '2025-01-15T00:00:00.000000000'],
       ['2025-01-15T00:00:00.000000000', '2025-01-16T00:00:00.000000000'],
       ['2025-01-16T00:00:00.000000000', '2025-01-17T00:00:00.000000000'],
       ['2025-01-17T00:00:00.000000000', '2025-01-18T00:00:00.000000000'],
       ['2025-01-18T00:00:00.000000000', '2025-01-19T00:00:00.000000000'],
       ['2025-01-19T00:00:00.000000000', '2025-01-20T00:00:00.000000000'],
       ['2025-01-20T00:00:00.000000000', '2025-01-21T00:00:00.000000000'],
       ['2025-01-21T00:00:00.000000000', '2025-01-22T00:00:00.000000000'],
       ['2025-01-22T00:00:00.000000000', '2025-01-23T00:00:00.000000000'],
       ['2025-01-23T00:00:00.000000000', '2025-01-24T00:00:00.000000000'],
       ['2025-01-24T00:00:00.000000000', '2025-01-25T00:00:00.000000000'],
       ['2025-01-25T00:00:00.000000000', '2025-01-26T00:00:00.000000000'],
       ['2025-01-26T00:00:00.000000000', '2025-01-27T00:00:00.000000000'],
       ['2025-01-27T00:00:00.000000000', '2025-01-28T00:00:00.000000000'],
       ['2025-01-28T00:00:00.000000000', '2025-01-29T00:00:00.000000000'],
       ['2025-01-29T00:00:00.000000000', '2025-01-30T00:00:00.000000000'],
       ['2025-01-30T00:00:00.000000000', '2025-01-31T00:00:00.000000000'],
       ['2025-01-31T00:00:00.000000000', '2025-02-01T00:00:00.000000000']],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 248B 2025-01-01T12:00:00 ... 2025-01-31T12...
Dimensions without coordinates: bnds

xarray.DataArray

'time_bnds'

time: 31
bnds: 2

2025-01-01 2025-01-02 2025-01-02 ... 2025-01-31 2025-01-31 2025-02-01

array([['2025-01-01T00:00:00.000000000', '2025-01-02T00:00:00.000000000'],
       ['2025-01-02T00:00:00.000000000', '2025-01-03T00:00:00.000000000'],
       ['2025-01-03T00:00:00.000000000', '2025-01-04T00:00:00.000000000'],
       ['2025-01-04T00:00:00.000000000', '2025-01-05T00:00:00.000000000'],
       ['2025-01-05T00:00:00.000000000', '2025-01-06T00:00:00.000000000'],
       ['2025-01-06T00:00:00.000000000', '2025-01-07T00:00:00.000000000'],
       ['2025-01-07T00:00:00.000000000', '2025-01-08T00:00:00.000000000'],
       ['2025-01-08T00:00:00.000000000', '2025-01-09T00:00:00.000000000'],
       ['2025-01-09T00:00:00.000000000', '2025-01-10T00:00:00.000000000'],
       ['2025-01-10T00:00:00.000000000', '2025-01-11T00:00:00.000000000'],
       ['2025-01-11T00:00:00.000000000', '2025-01-12T00:00:00.000000000'],
       ['2025-01-12T00:00:00.000000000', '2025-01-13T00:00:00.000000000'],
       ['2025-01-13T00:00:00.000000000', '2025-01-14T00:00:00.000000000'],
       ['2025-01-14T00:00:00.000000000', '2025-01-15T00:00:00.000000000'],
       ['2025-01-15T00:00:00.000000000', '2025-01-16T00:00:00.000000000'],
       ['2025-01-16T00:00:00.000000000', '2025-01-17T00:00:00.000000000'],
       ['2025-01-17T00:00:00.000000000', '2025-01-18T00:00:00.000000000'],
       ['2025-01-18T00:00:00.000000000', '2025-01-19T00:00:00.000000000'],
       ['2025-01-19T00:00:00.000000000', '2025-01-20T00:00:00.000000000'],
       ['2025-01-20T00:00:00.000000000', '2025-01-21T00:00:00.000000000'],
       ['2025-01-21T00:00:00.000000000', '2025-01-22T00:00:00.000000000'],
       ['2025-01-22T00:00:00.000000000', '2025-01-23T00:00:00.000000000'],
       ['2025-01-23T00:00:00.000000000', '2025-01-24T00:00:00.000000000'],
       ['2025-01-24T00:00:00.000000000', '2025-01-25T00:00:00.000000000'],
       ['2025-01-25T00:00:00.000000000', '2025-01-26T00:00:00.000000000'],
       ['2025-01-26T00:00:00.000000000', '2025-01-27T00:00:00.000000000'],
       ['2025-01-27T00:00:00.000000000', '2025-01-28T00:00:00.000000000'],
       ['2025-01-28T00:00:00.000000000', '2025-01-29T00:00:00.000000000'],
       ['2025-01-29T00:00:00.000000000', '2025-01-30T00:00:00.000000000'],
       ['2025-01-30T00:00:00.000000000', '2025-01-31T00:00:00.000000000'],
       ['2025-01-31T00:00:00.000000000', '2025-02-01T00:00:00.000000000']],
      dtype='datetime64[ns]')

Coordinates: (1)

time

(time)

datetime64[ns]

2025-01-01T12:00:00 ... 2025-01-...

standard_name :: time
long_name :: time
axis :: T

array(['2025-01-01T12:00:00.000000000', '2025-01-02T12:00:00.000000000',
       '2025-01-03T12:00:00.000000000', '2025-01-04T12:00:00.000000000',
       '2025-01-05T12:00:00.000000000', '2025-01-06T12:00:00.000000000',
       '2025-01-07T12:00:00.000000000', '2025-01-08T12:00:00.000000000',
       '2025-01-09T12:00:00.000000000', '2025-01-10T12:00:00.000000000',
       '2025-01-11T12:00:00.000000000', '2025-01-12T12:00:00.000000000',
       '2025-01-13T12:00:00.000000000', '2025-01-14T12:00:00.000000000',
       '2025-01-15T12:00:00.000000000', '2025-01-16T12:00:00.000000000',
       '2025-01-17T12:00:00.000000000', '2025-01-18T12:00:00.000000000',
       '2025-01-19T12:00:00.000000000', '2025-01-20T12:00:00.000000000',
       '2025-01-21T12:00:00.000000000', '2025-01-22T12:00:00.000000000',
       '2025-01-23T12:00:00.000000000', '2025-01-24T12:00:00.000000000',
       '2025-01-25T12:00:00.000000000', '2025-01-26T12:00:00.000000000',
       '2025-01-27T12:00:00.000000000', '2025-01-28T12:00:00.000000000',
       '2025-01-29T12:00:00.000000000', '2025-01-30T12:00:00.000000000',
       '2025-01-31T12:00:00.000000000'], dtype='datetime64[ns]')

Again, note that the metadata have been set in accordance with the CMOR standard during this operation.

[21]:

cmorized.pyku.check_drs(standard='cordex')

[21]:

	key	value	issue
0	CORDEX_domain	GER-0275	None
1	driving_model_id	ECMWF-ERA5T	None
2	driving_experiment_name	evaluation	None
3	driving_model_ensemble_member	r1i1p1	None
4	model_id	CLMcom-DWD-CCLM5-0-16	None
5	rcm_version_id	x0n1-v1	None
6	frequency	day	None
7	product	output	None
8	institute_id	CLMcom-DWD	None
9	experiment_id	evaluation	None

Pay particular attention to how the variable attributes and the cell_methods have been set:

[22]:

cmorized.tas.attrs

[22]:

{'GRIB_paramId': 167,
 'GRIB_dataType': 'fc',
 'GRIB_numberOfPoints': 659156,
 'GRIB_typeOfLevel': 'heightAboveGround',
 'GRIB_stepUnits': 1,
 'GRIB_stepType': 'instant',
 'GRIB_gridType': 'unstructured_grid',
 'GRIB_NV': 0,
 'GRIB_cfName': 'air_temperature',
 'GRIB_cfVarName': 't2m',
 'GRIB_gridDefinitionDescription': 'General unstructured grid',
 'GRIB_missingValue': 3.4028234663852886e+38,
 'GRIB_name': '2 metre temperature',
 'GRIB_shortName': '2t',
 'GRIB_units': 'K',
 'long_name': 'Near-Surface Air Temperature',
 'standard_name': 'air_temperature',
 'grid_mapping': 'crs',
 'units': 'K',
 'cell_methods': 'time: point'}

Writing NetCDF data#

The final step is to write the data, which can be quite large, into packets that conform to the CMOR standard. Specifically, this means:

Hourly data will be split into packets of one year.
Three-hourly data will be split into packets of one year.
Six-hourly data will be split into packets of one year.
Twelve-hourly data will be split into packets of five years.
Daily data will be split into packets of five years.
Monthly data will be split into packets of ten years.
Seasonal data will be split into packets of ten years.
Yearly data and above will be split into packets of one hundred years.

For this demonstration, we are using only a subset of data to ensure the tutorial runs in a timely manner, rather than showcasing the full capabilities. However, this functionality is built-in and will operate automatically for large datasets without requiring user input. First, we create a temporary directory to store the data. This directory will be deleted at the end of this session.

[23]:

import tempfile

temp_dir = tempfile.TemporaryDirectory()
print("Temporary directory:", temp_dir.name)

Temporary directory: /tmp/tmp1ru5xm7b

[24]:

pyku.drs.to_drs_netcdfs(cmorized, base_dir=temp_dir.name, standard='cordex')

[25]:

!find {temp_dir.name} -type f

/tmp/tmp1ru5xm7b/output/GER-0275/CLMcom-DWD/ECMWF-ERA5T/evaluation/r1i1p1/CLMcom-DWD-CCLM5-0-16/x0n1-v1/day/tas/tas_GER-0275_ECMWF-ERA5T_evaluation_r1i1p1_CLMcom-DWD-CCLM5-0-16_x0n1-v1_day_20250101-20250131.nc

[26]:

file = !find {temp_dir.name} -type f
!ncdump -h {file[0]}

/bin/bash: line 1: ncdump: command not found

And finally cleanup the temporary directory

[27]:

temp_dir.cleanup()

Having Fun with CMOR

Contents