pyku.meta#

The pyku.meta module provides functions for working with metadata in xarray.Dataset, particularly in the context of climate and geospatial data. These functions assist in managing coordinate variables, spatial information, and temporal metadata, while ensuring compatibility with common conventions and formats.

Metadata retrieval

Functions such as pyku.meta.get_geographic_latlon_varnames(), pyku.meta.get_crs_varname(), pyku.meta.get_geodata_varnames(), and pyku.meta.get_spatial_varnames() enable retrieval of specific standard climate variable names from xarray.Dataset.

Spatial metadata

Determine if datasets are georeferenced (pyku.meta.is_georeferenced()) or have projection coordinates (pyku.meta.has_projection_coordinates()).

Temporal Metadata:

pyku.meta.get_frequency() is a specialized function for detecting temporal frequency with support for bounds checks and multiple output formats (freqstr, DateOffset, Timedelta). Functions like pyku.meta.get_time_bounds(), and pyku.meta.has_time_bounds() provide tools to inspect, validate, and manage spatial and temporal information.

Example Usage

Below are examples of typical usage:

import pyku

# Retrieve a test dataset
# -----------------------

ds = pyku.resources.get_test_data('hyras')

# Find variable names of georeferenced data in dataset
# ----------------------------------------------------

ds.pyku.get_geodata_varnames()

# Get dataset frequency
# ---------------------

ds.pyku.get_frequency(dtype='freqstr')

# Check if the dataset is georeferenced
# -------------------------------------

ds.pyku.is_georeferenced()

For more detailed information on each function, refer to their respective docstrings.

pyku.meta.filter_incomplete_datetimes(*args, **kwargs)[source]#

This function has moved to pyku.timekit.filter_incomplete_datetimes()

pyku.meta.find_match(searched_words, words, excluded_words=None)[source]#

Finds the best match for a target set of names from available coordinates.

Parameters:
  • target_names (list) – List of potential names to match, e.g., [‘lat’, ‘lats’, ‘latitude’].

  • available_coords (list) – List of available coordinate names, e.g., [ ‘time’, ‘lat_3’, ‘lon_3’, ‘x’, ‘y’].

  • exclude (list) – Optional. List of names to exclude from matching, e.g., [‘rlat’, ‘lat_bnds’].

Returns:

The best matching coordinate name.

Return type:

str

Example

For example, if we are looking for latitude, which could be represented by names such as [‘lat’, ‘lats’, ‘latitude’], we want to identify the best match from a set of available coordinates like [‘time’, ‘lat_3’, ‘lon_3’, ‘x’, ‘y’].

To refine the search, certain words should be excluded to prevent them from being returned as matches. For instance, when searching for geographic latitude, terms like rlat or lat_bnds should not be considered valid matches.

In [1]: import pyku.meta as meta
   ...: meta.find_match(
   ...:    searched_words=['lat', 'lats', 'latitude'],
   ...:    words=['time', 'lat_3', 'lon_3', 'y_3', 'x_3'],
   ...:    excluded_words=['rlat', 'clats']
   ...: )
   ...: 
Out[1]: 'lat_3'
pyku.meta.get_crs_varname(ds)[source]#

Get name of the crs variable

Parameters:

ds (xarray.Dataset) – The input Dataset.

Returns:

Name of the crs variable.

Return type:

str

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_crs_varname()
   ...: 
Out[1]: 'crs'
pyku.meta.get_dataset_size(ds)[source]#

Get dataset size in GB

Parameters:

ds (xarray.Dataset) – The in put dataset

Returns:

Dataset size

Return type:

str

pyku.meta.get_frequency(ds, dtype='Timedelta')[source]#

This function differs from the standard xarray function xarray.infer_freq() by additionally checking time bounds.

Parameters:
  • ds (xarray.Dataset) – The input dataset.

  • dtype (str) –

    Specifies the desired data type for frequency representation. Choose one of the following:

    • ’freqstr’: Represents the frequency as a string. This is the recommended default.

    • ’DateOffset’: Represents the frequency using pandas’ DateOffset.

    • ’Timedelta’: Represents the frequency using pandas’ Timedelta.

Returns:

freqstr, pandas.tseries.offsets.DateOffset, pandas.Timedelta: The inferred frequency of the dataset.

Examples

In [1]: import pyku
   ...: 
   ...: # Get the dataset
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: 

In [2]: # Get the frequency string
   ...: ds.pyku.get_frequency(dtype='freqstr')
   ...: 
Out[2]: 'D'

In [3]: # Get the frequency as DateOffset
   ...: ds.pyku.get_frequency(dtype='DateOffset')
   ...: 
Out[3]: <Day>

In [4]: # Get the frequency as DateOffset

To create an offset that can be compared, use to_offset, which converts a frequency string into an offset object. This ensures that the frequency of your data can be compared unambiguously.

In [5]: import pyku
   ...: from pandas.tseries.frequencies import to_offset
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: myoffset = ds.pyku.get_frequency(dtype='DateOffset')
   ...: to_offset('1D') == myoffset
   ...: 
Out[5]: True
pyku.meta.get_geodata_varnames(ds)[source]#

Get variable names of georeferenced data from dataset.

The minimal requirement for a variable to be deemed georeferenced is to have either geographic or projection coordinates.

Parameters:

ds (xarray.Dataset) – The input dataset.

Returns:

Names of the georeferenced variables.

Return type:

list

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras-tas-monthly')
   ...: ds.pyku.get_geodata_varnames()
   ...: 
Out[1]: ['tas']
pyku.meta.get_geodataset(ds, var)[source]#

Get dataset for georeferenced dataset. This function is usefull because it gets the variable with all climate data associated.

Parameters:
  • ds (xarray.Dataset) – The input dataset.

  • var (str, List(str)) – The variable name(s).

Returns:

The geodata variable(s) with all associated climate data variables.

Return type:

xarray.Dataset

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_geodataset(var='tas')
   ...: 
Out[1]: 
<xarray.Dataset> Size: 139MB
Dimensions:    (time: 730, y: 178, x: 133, bnds: 2)
Coordinates:
  * time       (time) datetime64[ns] 6kB 1981-01-01 1981-01-02 ... 1982-12-31
  * y          (y) float64 1kB 3.562e+06 3.556e+06 ... 2.682e+06 2.676e+06
  * x          (x) float64 1kB 4.024e+06 4.028e+06 ... 4.678e+06 4.684e+06
    lat        (y, x) float64 189kB dask.array<chunksize=(178, 133), meta=np.ndarray>
    lon        (y, x) float64 189kB dask.array<chunksize=(178, 133), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
    tas        (time, y, x) float64 138MB dask.array<chunksize=(730, 178, 133), meta=np.ndarray>
    crs        int32 4B ...
    time_bnds  (time, bnds) datetime64[ns] 12kB dask.array<chunksize=(730, 2), meta=np.ndarray>
Attributes: (12/23)
    source:                 surface observations
    institution:            Deutscher Wetterdienst (DWD)
    Conventions:            CF-1.11
    title:                  gridded_temperature_dataset_(HYRAS TAS)
    realization:            v6-1
    project_id:             HYRAS
    ...                     ...
    ConventionsURL:         http://cfconventions.org/Data/cf-conventions/cf-c...
    license:                The HYRAS data, produced by DWD, is licensed unde...
    filename:               tas_hyras_1_1981_v6-1_de.nc
    comment:                Please be aware that the parameters are stored as...
    unique_dataset_id:      DWD_HYRAS_DE_tas_v6-1_1981_3a0bd428-c11d-47f6-9fb...
    CORDEX_domain:          undefined
pyku.meta.get_geographic_latlon_varnames(ds)[source]#

Identify the variables holding geographic latitudes and longitudes within the dataset.

Parameters:

ds (xarray.Dataset) – The input dataset.

Returns:

Name of variables holding geographic latitudes and longitudes.

Return type:

tuple[str]

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras-tas-monthly')
   ...: ds.pyku.get_geographic_latlon_varnames()
   ...: 
Out[1]: ('lat', 'lon')
pyku.meta.get_latlon_bounds_varnames(ds)[source]#

Get name of geographic lat/lon bounds variable name

Parameters:

ds (xarray.Dataset) – The input dataset.

Returns:

Names of the geographic bounds varname

Return type:

list

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_latlon_bounds_varnames()
   ...: 
Out[1]: (None, None)
pyku.meta.get_projection_yx_varnames(ds)[source]#

Get the name of projection coordinate names

Parameters:

ds (xarray.Dataset) – Input dataset.

Returns:

(y, x) Name of projection coordinates in dataset.

Return type:

tuple[str]

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras-tas-monthly')
   ...: ds.pyku.get_projection_yx_varnames()
   ...: 
Out[1]: ('y', 'x')
pyku.meta.get_pyku_metadata()[source]#

Get pyku metadata

Returns:

dictionary of pyku metadata

Return type:

dict

pyku.meta.get_spatial_bounds_varnames(ds)[source]#

Get name of spatial bounds variable

Parameters:

ds (xarray.Dataset) – The input dataset.

Returns:

Names of the time bounds

Return type:

list

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_spatial_bounds_varnames()
   ...: 
Out[1]: []
pyku.meta.get_spatial_varnames(ds)[source]#

Get name of spatial variables:

  • spatial_vertices_varnames

  • spatial_bounds_varnames

  • geographic_latlon_varnames

  • projection_yx_varnames

  • crs_varname

Parameters:

ds (xarray.Dataset) – The input dataset

Returns:

Names of the time bounds

Return type:

list

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_spatial_varnames()
   ...: 
Out[1]: ['lat', 'lon', 'y', 'x', 'crs']
pyku.meta.get_spatial_vertices_varnames(ds)[source]#

Get name of spatial vertices variables

Parameters:

ds (xarray.Dataset) – The input dataset

Returns:

The names of the time bounds

Return type:

list

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_spatial_vertices_varnames()
   ...: 
Out[1]: []
pyku.meta.get_time_bounds(ds, which=None)[source]#

Get time bounds from dataset

Parameters:
  • ds (xarray.Dataset) – The input dataset.

  • which (str) – Either None, lower, or upper. Default is None.

Returns:

Array of time bounds.

Return type:

numpy.ndarray

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_time_bounds()[0:5]
   ...: 
Out[1]: 
array([['1981-01-01T00:00:00.000000000', '1981-01-02T00:00:00.000000000'],
       ['1981-01-02T00:00:00.000000000', '1981-01-03T00:00:00.000000000'],
       ['1981-01-03T00:00:00.000000000', '1981-01-04T00:00:00.000000000'],
       ['1981-01-04T00:00:00.000000000', '1981-01-05T00:00:00.000000000'],
       ['1981-01-05T00:00:00.000000000', '1981-01-06T00:00:00.000000000']],
      dtype='datetime64[ns]')
pyku.meta.get_time_bounds_varname(ds)[source]#

Get name of time bounds variable

Parameters:

ds (xarray.Dataset) – The input dataset.

Returns:

Name of the time bounds.

Return type:

str

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_time_bounds_varname()
   ...: 
Out[1]: 'time_bnds'
pyku.meta.get_time_dependent_varnames(ds)[source]#

Get time dependent variables

Parameters:

ds (xarray.Dataset) – The input dataset.

Returns:

List of variables depending on time

Return type:

list(str)

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_time_dependent_varnames()
   ...: 
Out[1]: ['number_of_stations', 'tas', 'time_bnds']
pyku.meta.get_time_intervals(ds)[source]#

Get time intervals between consecutive datapoints.

Parameters:

ds (xarray.Dataset) – The input dataset.

Returns:

Dataset with time intervals

Return type:

xarray.Dataset

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_time_intervals().interval.values[0:5]
   ...: 
Out[1]: array([86400., 86400., 86400., 86400., 86400.])
pyku.meta.get_unidentified_varnames(ds)[source]#

Get name of unidentified variables

Parameters:

ds (xarray.Dataset) – The input dataset.

Returns:

The names of unidentified variables.

Return type:

List[str]

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.get_unidentified_varnames()
   ...: 
Out[1]: ['number_of_stations']
pyku.meta.has_geographic_coordinates(dat)[source]#

Determine if the data has geographic coordinates.

Parameters:

dat (xarray.Dataset) – The input data.

Returns:

True if data has geograpic coordinates.

Return type:

bool

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.has_geographic_coordinates()
   ...: 
Out[1]: True
pyku.meta.has_ordered_dimensions_and_coordinates(ds)[source]#

Checks whether the dimensions and coordinates of the dataset are ordered according to Pyku’s recommendations:

  • ‘time’ appears first (if it exists)

  • ‘lat’ and ‘lon’ are positioned last (if they exist)

  • All other coordinates retain their relative order

While Pyku can handle any order of dimensions and coordinates, following this recommended structure ensures a more standardized data layout, reducing the likelihood of encountering edge cases.

Parameters:

dataset (xarray.Dataset) – The input dataset.

Returns:

Whether the dataset ordering of dimensions and coordinates corresponds pyku’s recommendations.

Return type:

bool

pyku.meta.has_projection_coordinates(dat)[source]#

Determine if the data has y/x projection coordinates.

Parameters:

dat (xarray.Dataset) – The input data

Returns:

True if data has projection coordinates.

Return type:

bool

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.has_projection_coordinates()
   ...: 
Out[1]: True
pyku.meta.has_time_bounds(ds)[source]#

Check if dataset has time bounds

Parameters:

ds (xarray.Dataset) – The input dataset

Returns:

True if dataset has time bounds, False otherwise.

Return type:

bool

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.has_time_bounds()
   ...: 
Out[1]: True
pyku.meta.has_unstructured_geographic_coordinates(ds)[source]#

Determine if the lat/lon geographic coordinates are unstructured.

Parameters:

ds (xarray.Dataset) – The input dataset.

Returns:

True if the lat/lon geographic coordinates are unstructured.

Return type:

bool

pyku.meta.is_georeferenced(ds)[source]#

Determine if the dataset is georeferenced.

A dataset is considered georeferenced if projection information is available in any supported format (CF, EPSG, WKT, or PROJ string) and either geographic or projected coordinates are present to compute the lower-left and upper-right corners.

Parameters:

dat (xarray.Dataset) – The input dataset.

Returns:

True if the dataset is georeferenced, False otherwise.

Return type:

bool

Example

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('hyras')
   ...: ds.pyku.is_georeferenced()
   ...: 
Out[1]: True
pyku.meta.reorder_dimensions_and_coordinates(ds)[source]#

Reorders dataset dimensions and coordinates to ensure: - ‘time’ comes first (if it exists) - ‘lat’ and ‘lon’ come last (if they exist) - All other coordinates maintain their relative order between them.

While Pyku can handle any order of dimensions and coordinates, following this recommended structure ensures a more standardized data layout, reducing the likelihood of encountering edge cases.

:param xarray.Dataset: The input dataset.

Returns

xarray.Dataset: Dataset with reordered dimensions and coordinates.

Examples

In [1]: import pyku
   ...: ds = pyku.resources.get_test_data('fake_cmip6_data')
   ...: 
   ...: # Shuffle dimensions and coordinates
   ...: # ----------------------------------
   ...: 
   ...: ds = ds.transpose('lon', 'lat', 'time')
   ...: ds = ds.assign_coords({
   ...:     'lon': ds.lon, 'lat': ds.lat, 'time': ds.time
   ...: })
   ...: 
   ...: # Apply pyku default dimensions and coordinates ordering
   ...: # ------------------------------------------------------
   ...: 
   ...: ds.pyku.reorder_dimensions_and_coordinates()
   ...: 
Out[1]: 
<xarray.Dataset> Size: 189MB
Dimensions:  (time: 365, lon: 360, lat: 180)
Coordinates:
  * time     (time) datetime64[ns] 3kB 2023-01-01 2023-01-02 ... 2023-12-31
  * lon      (lon) float64 3kB -180.0 -179.0 -178.0 -177.0 ... 178.0 179.0 180.0
  * lat      (lat) float64 1kB -90.0 -88.99 -87.99 -86.98 ... 87.99 88.99 90.0
Data variables:
    tas      (time, lon, lat) float64 189MB 28.06 25.39 20.84 ... 6.376 7.319
Attributes: (12/51)
    name:                   /ccc/work/cont003/gencmip6/checagar/IGCM_OUT/IPSL...
    Conventions:            CF-1.7 CMIP-6.2
    creation_date:          2020-10-18T15:18:15Z
    tracking_id:            hdl:21.14100/4f03accf-6a30-44d9-a20e-8ac4fde7055f
    description:            CMIP6 historical
    title:                  IPSL-CM6A-LR-INCA model output prepared for CMIP6...
    ...                     ...
    variable_id:            zg
    variant_label:          r1i1p1f1
    EXPID:                  historical
    CMIP6_CV_version:       cv=6.2.15.1
    dr2xml_md5sum:          b6f602401512e82e2d7cadc2c6f36c2a
    model_version:          6.1.11
pyku.meta.select_common_datetimes(*args, **kwargs)[source]#

This function has moved to pyku.timekit.select_common_datetimes()

pyku.meta.set_time_bounds(*args, **kwargs)[source]#

This function has changed name and moved to timekit.set_time_bounds_from_time_labels.

pyku.meta.set_time_labels_from_time_bounds(*args, **kwargs)[source]#

This function has moved to pyku.timekit.set_time_labels_from_time_bounds()

pyku.meta.to_gregorian_calendar(*args, **kwargs)[source]#

This function has moved to pyku.timekit.to_gregorian_calendar().

pyku.meta.to_netcdf(ds, output_file)[source]#

Deprecated. Use pyku.magic.to_netcdf() instead