Data structures - IO demo

28/06/22

This notebook outlines various IO methods for reading and writing data. For a general datastructures overview, see the Data structures intro doc.

Note that this page details functional forms, for class usage see the base class intro page. However, as of June 2022, file writers are not fully implemented for the data class.

Load libraries

[1]:
from pathlib import Path

import epsproc as ep

# Set data path
# Note this is set here from ep.__path__, but may not be correct in all cases - depends on where the Github repo is.
epDemoDataPath = Path(ep.__path__[0]).parent/'data'
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
* sparse not found, sparse matrix forms not available.
* natsort not found, some sorting functions not available.
* Setting plotter defaults with epsproc.basicPlotters.setPlotters(). Run directly to modify, or change options in local env.
* Set Holoviews with bokeh.
* pyevtk not found, VTK export not available.

Loading ePolyScat data

To start with, load data from an ePolyScat output file. This will load matrix elements & cross-sections, for more details see the basic demo page and the ePolyScat basics page.

[2]:
# Load data from modPath\data
dataPath = Path(epDemoDataPath, 'photoionization')
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2.inp.out')  # Set for sample N2 data for testing

# Scan data file
dataSet = ep.readMatEle(fileIn = dataFile.as_posix())
data = dataSet[0]
# dataXS = ep.readMatEle(fileIn = dataFile.as_posix(), recordType = 'CrossSection')  # XS info currently not set in NO2 sample file.
*** ePSproc readMatEle(): scanning files for DumpIdy segments.

*** Scanning file(s)
['/home/jovyan/github/epsproc/data/photoionization/n2_3sg_0.1-50.1eV_A2.inp.out']

*** FileListSort
  Prefix: /home/jovyan/github/epsproc/data/photoionization/n2_3sg_0.1-50.1eV_A2.inp.out
  1 groups.

*** Reading ePS output file:  /home/jovyan/github/epsproc/data/photoionization/n2_3sg_0.1-50.1eV_A2.inp.out
*** IO.fileParse() found 1 segments with
        Start: ScatEng
        End: ['#'].
Expecting 51 energy points.
*** IO.fileParse() found 2 segments with
        Start: ScatSym
        End: ['FileName', '\n'].
Expecting 2 symmetries.
*** IO.fileParse() found 102 segments with
        Start: DumpIdy - dump
        End: ['+ Command', 'Time Now'].
Found 102 dumpIdy segments (sets of matrix elements).

Processing segments to Xarrays...
Processed 102 sets of DumpIdy file segments, (0 blank)
[3]:
# All data is pushed to Xarrays
data
[3]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' (LM: 18, Eke: 51, Sym: 2,
                                                  mu: 3, it: 1, Type: 2)>
array([[[[[[           nan          +nanj,
                       nan          +nanj]],

          [[           nan          +nanj,
                       nan          +nanj]],

          [[           nan          +nanj,
                       nan          +nanj]]],


         [[[           nan          +nanj,
                       nan          +nanj]],

          [[           nan          +nanj,
                       nan          +nanj]],

          [[-1.7757076e+00+6.3474768e-01j,
            -1.9403462e+00+6.9465999e-01j]]]],


...


        [[[[           nan          +nanj,
                       nan          +nanj]],

          [[           nan          +nanj,
                       nan          +nanj]],

          [[           nan          +nanj,
                       nan          +nanj]]],


         [[[ 8.9213389e-06-4.6971505e-06j,
                       nan          +nanj]],

          [[           nan          +nanj,
                       nan          +nanj]],

          [[           nan          +nanj,
                       nan          +nanj]]]]]])
Coordinates:
  * LM       (LM) MultiIndex
  - l        (LM) int64 1 1 1 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11
  - m        (LM) int64 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1
  * mu       (mu) int64 -1 0 1
  * Type     (Type) <U1 'L' 'V'
  * it       (it) int64 1
  * Sym      (Sym) MultiIndex
  - Cont     (Sym) object 'SU' 'PU'
  - Targ     (Sym) object 'SG' 'SG'
  - Total    (Sym) object 'SU' 'PU'
  * Eke      (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1
    Ehv      (Eke) float64 15.68 16.68 17.68 18.68 ... 62.68 63.68 64.68 65.68
    SF       (Eke) complex128 (2.1560627+3.741674j) ... (4.4127053+1.8281945j)
Attributes:
    dataType:  matE
    file:      n2_3sg_0.1-50.1eV_A2.inp.out
    fileBase:  /home/jovyan/github/epsproc/data/photoionization
    fileList:  n2_3sg_0.1-50.1eV_A2.inp.out

Xarray methods

For Xarrays, these can be written to disk and read back with ep.IO.writeXarray() and ep.IO.readXarray().

These wrap various backends, including Xarray’s netCDF writer, and implement some additional complex number and dim handling for ePSproc cases (depending on the backend). In general, the file writers aim to preserve all dimensions, including stacking, if possible.

For other low-level data IO options, see the Xarray documentation. Higher level routines are currently in development for ePSproc & PEMtk, see further notes at https://github.com/phockett/ePSproc/issues/8 and https://github.com/phockett/PEMtk/issues/6.

Complex number and attribute handling

Note that native Xarray methods have some limitations…

  1. Issues with complex data, which is not supported by netcdf - either convert to Re+Im format or use h5netcdf backend with forceComplex=True to workaround (but need to set this on file read too).

  2. Issues with nested dict attribs.

  3. Issues with MultiIndex coords, esp. in non-dim coords.

With h5netcdf and invalid_netcdf option (1) is OK (see Xarray docs for details), although still needs unstack, and may also need to set ‘to_dataset’ for more control, otherwise can get arb named items in file (if dataarray name is missing).

For general attrib handling, ep.IO.sanitizeAttrsNetCDF() attempts to quickly clear this up at file IO if there is an exception raised, although may be lossy in some cases.

For non-Xarray file types, e.g. HDF5, some additional custom handling is included, mainly via dictionary conversion routines, which are discussed below.

TODO:

NetCDF methods

For Xarray’s netCDF writer, valid engines (handlers/libraries/formats) are h5netcdf (preferred), netcdf4 and scipy. The default case uses engine = 'h5netcdf', and forceComplex = False - the output file will include separate real and imaginary components and a flat representation of the array.

TODO: test for multiple Xarray per file.

[4]:
dataPath = Path(epDemoDataPath, 'photoionization','fileIOtests')
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2')  # Set for sample N2 data for testing

ep.IO.writeXarray(data, fileName = dataFile.as_posix(), filePath = dataPath.as_posix())   # Default case set as: engine = 'h5netcdf', forceComplex = False
['Written to h5netcdf format', '/home/jovyan/github/epsproc/data/photoionization/fileIOtests/n2_3sg_0.1-50.1eV_A2.nc']
[4]:
['Written to h5netcdf format',
 '/home/jovyan/github/epsproc/data/photoionization/fileIOtests/n2_3sg_0.1-50.1eV_A2.nc']
[5]:
dataIn = ep.IO.readXarray(fileName = dataFile.as_posix() + '.nc', filePath = dataPath.as_posix())  #, forceComplex=forceComplex, forceArray=False)
[6]:
dataIn  # Note dims are restacked, although ordering may change
[6]:
<xarray.DataArray (Eke: 51, mu: 3, it: 1, Type: 2, Sym: 4, LM: 18)>
array([[[[[[           nan          +nanj,
                       nan          +nanj,
            -1.7757076e+00+6.3474768e-01j, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
...
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj]]]]]])
Coordinates:
    Ehv      (Eke) float64 15.68 16.68 17.68 18.68 ... 62.68 63.68 64.68 65.68
  * Eke      (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1
  * Type     (Type) object 'L' 'V'
  * it       (it) int64 1
  * mu       (mu) int64 -1 0 1
    SF       (Eke) complex128 (2.1560627+3.741674j) ... (4.4127053+1.8281945j)
  * Sym      (Sym) MultiIndex
  - Cont     (Sym) object 'PU' 'PU' 'SU' 'SU'
  - Targ     (Sym) object 'SG' 'SG' 'SG' 'SG'
  - Total    (Sym) object 'PU' 'SU' 'PU' 'SU'
  * LM       (LM) MultiIndex
  - l        (LM) int64 1 1 1 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11
  - m        (LM) int64 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1
Attributes:
    dataType:  matE
    file:      n2_3sg_0.1-50.1eV_A2.inp.out
    fileBase:  /home/jovyan/github/epsproc/data/photoionization
    fileList:  n2_3sg_0.1-50.1eV_A2.inp.out
[7]:
# Testing for equality with the original currently returns false - dim ordering might be the issue here?
dataIn.equals(data)
[7]:
False
[8]:
# Subtraction indicates identical data however.
(dataIn - data).max()
[8]:
<xarray.DataArray ()>
array(0.+0.j)
[9]:
# For the

Low-level Xarray routines

The Xarray native routines can be used directly, although may require additional args and/or data reformatting.

Note that the netcdf writer always writes to DataSet format, although there are readers for both dataset and dataarrays. Other available methods include Pickle, Zarr…

[10]:
# Read file with base open_dataset
import xarray as xr
dataIn = xr.open_dataset(dataFile.as_posix() + '.nc', engine = 'h5netcdf')
dataIn
[10]:
<xarray.Dataset>
Dimensions:  (Cont: 2, Eke: 51, mu: 3, it: 1, Type: 2, l: 6, m: 3, Targ: 1,
              Total: 2)
Coordinates:
  * Cont     (Cont) object 'PU' 'SU'
    Ehv      (Eke) float64 15.68 16.68 17.68 18.68 ... 62.68 63.68 64.68 65.68
  * Eke      (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1
    SFi      (Eke) float64 3.742 3.628 3.524 3.428 ... 1.871 1.857 1.842 1.828
    SFr      (Eke) float64 2.156 2.224 2.289 2.353 ... 4.311 4.345 4.379 4.413
  * Targ     (Targ) object 'SG'
  * Total    (Total) object 'PU' 'SU'
  * Type     (Type) object 'L' 'V'
  * it       (it) int64 1
  * l        (l) int64 1 3 5 7 9 11
  * m        (m) int64 -1 0 1
  * mu       (mu) int64 -1 0 1
Data variables:
    Im       (Eke, mu, it, Type, l, m, Cont, Targ, Total) float64 nan ... nan
    Re       (Eke, mu, it, Type, l, m, Cont, Targ, Total) float64 nan ... nan
Attributes:
    complex:   split
    dataType:  matE
    file:      n2_3sg_0.1-50.1eV_A2.inp.out
    fileBase:  /home/jovyan/github/epsproc/data/photoionization
    fileList:  n2_3sg_0.1-50.1eV_A2.inp.out
[11]:
# Read file with base open_dataarray - this may fail for datasets
dataIn = xr.open_dataarray(dataFile.as_posix() + '.nc', engine = 'h5netcdf')
dataIn
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [11], in <cell line: 2>()
      1 # Read file with base open_dataarray - this may fail for datasets
----> 2 dataIn = xr.open_dataarray(dataFile.as_posix() + '.nc', engine = 'h5netcdf')
      3 dataIn

File /opt/conda/lib/python3.9/site-packages/xarray/backends/api.py:670, in open_dataarray(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    652 dataset = open_dataset(
    653     filename_or_obj,
    654     decode_cf=decode_cf,
   (...)
    666     **kwargs,
    667 )
    669 if len(dataset.data_vars) != 1:
--> 670     raise ValueError(
    671         "Given file dataset contains more than one data "
    672         "variable. Please read with xarray.open_dataset and "
    673         "then select the variable you want."
    674     )
    675 else:
    676     (data_array,) = dataset.data_vars.values()

ValueError: Given file dataset contains more than one data variable. Please read with xarray.open_dataset and then select the variable you want.

HDF5

HDF5 support is available if the h5py library is present, and is wrapped for Xarray use with the routines in epsproc.ioBackends.hdf5IO. These handle most python native and numpy data types. The routines also include string wrappers for unsupported data types (and/or they can be purged for saving). For more details see the dictionary conversion options section below (the Xarray objects are serialized to dictionaries prior to file IO).

The main benefit of HDF5 (vs. netCDF) is more type handling (esp. complex data), and cross-compatibility with other languages. String wrapping is currently applied to all fields except the numerical data, ensuring handling of nested dictionaries. (See source for writeXarrayToHDF5 and sanitizeAttrsNetCDF.)

NOTE: string unwrapping is done via ast.literal_eval(), which supports: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None. This should be safe in general (cannot evaluate arbitrary code), but pass evalStrings = True at file read to avoid if required.

TODO:

  • more options here, nested dict unpacking to HDF5 as well as string wrap (see .

  • fix XR name, appears as byte string

  • Test multiple items per file.

  • Test IO with, e.g., Matlab.

Other libs/tools for this to be investigated…

[12]:
dataPath = Path(epDemoDataPath, 'photoionization','fileIOtests')
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2.h5')  # Set for sample N2 data for testing

ep.IO.writeXarray(data, fileName = dataFile.as_posix(), filePath = dataPath.as_posix(), engine = 'hdf5')   # Default case set as: engine = 'h5netcdf', forceComplex = False
[12]:
PosixPath('/home/jovyan/github/epsproc/data/photoionization/fileIOtests/n2_3sg_0.1-50.1eV_A2.h5')
[13]:
dataPath = Path(epDemoDataPath, 'photoionization','fileIOtests')
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2.h5')  # Set for sample N2 data for testing

# HDF5 file reader, note this returns both dict and Xarray formats
dataInDict, dataInXR = ep.IO.readXarray(fileName = dataFile.as_posix(), filePath = dataPath.as_posix(), engine = 'hdf5')
*** Opened file /home/jovyan/github/epsproc/data/photoionization/fileIOtests/n2_3sg_0.1-50.1eV_A2.h5
Found top-level group(s) ['matE']
Reading group matE
[14]:
# Reconstructed array. Note this also now includes self.attrs['dimMaps'], which was added by the file writer to track dimensions.
dataInXR
[14]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' (Eke: 51, mu: 3, it: 1,
                                                  Type: 2, Sym: 2, LM: 18)>
array([[[[[[           nan          +nanj,
                       nan          +nanj,
            -1.7757076e+00+6.3474768e-01j, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj]],

          [[           nan          +nanj,
                       nan          +nanj,
            -1.9403462e+00+6.9465999e-01j, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
...
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj]],

          [[-4.1606509e-01-6.4490113e-01j,
                       nan          +nanj,
                       nan          +nanj, ...,
             5.3667894e-06-2.4842580e-06j,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj]]]]]])
Coordinates:
  * mu       (mu) int64 -1 0 1
  * Type     (Type) <U1 'L' 'V'
  * it       (it) int64 1
  * Eke      (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1
    Ehv      (Eke) float64 15.68 16.68 17.68 18.68 ... 62.68 63.68 64.68 65.68
    SF       (Eke) complex128 (2.1560627+3.741674j) ... (4.4127053+1.8281945j)
  * Sym      (Sym) MultiIndex
  - Cont     (Sym) object 'PU' 'SU'
  - Targ     (Sym) object 'SG' 'SG'
  - Total    (Sym) object 'PU' 'SU'
  * LM       (LM) MultiIndex
  - l        (LM) int64 1 1 1 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11
  - m        (LM) int64 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1
Attributes:
    dataType:  matE
    file:      n2_3sg_0.1-50.1eV_A2.inp.out
    fileBase:  /home/jovyan/github/epsproc/data/photoionization
    fileList:  n2_3sg_0.1-50.1eV_A2.inp.out
    dimMaps:   {'dataDims': ('LM', 'Eke', 'Sym', 'mu', 'it', 'Type'), 'dataDi...
[15]:
# data.attrs['dimMaps'] = dataInXR.attrs['dimMaps']
print(dataInXR.equals(data))  # False...
(dataInXR - data).max()   # But data seems OK
False
[15]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' ()>
array(0.+0.j)
[16]:
# Additional data can be appended to the file
# Note that the top-level group defaults to data.dataType, so a different name may be required for this
ep.IO.writeXarray(data, fileName = dataFile.as_posix(), filePath = dataPath.as_posix(), engine = 'hdf5', dataName = 'matE2')
[16]:
PosixPath('/home/jovyan/github/epsproc/data/photoionization/fileIOtests/n2_3sg_0.1-50.1eV_A2.h5')

Low-level HDF5 IO

Native HDF5 file IO can also be used (and the files should be readable by any standard HDF5 tools), and this contains a dictionary-like representation of the data structure. For more details on h5py see the docs.

[17]:
import h5py

dataPath = Path(epDemoDataPath, 'photoionization','fileIOtests')
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2.h5')

# Open file with h5py
fIn = Path(dataPath,dataFile)
hf = h5py.File(fIn.as_posix(), 'r')
[18]:
# Top-level groups
hf.keys()
[18]:
<KeysViewHDF5 ['matE', 'matE2']>
[19]:
# Group items
hf['matE'].keys()
[19]:
<KeysViewHDF5 ['attrs', 'coords', 'data', 'dims', 'name']>
[20]:
# Get data from a subgroup
hf['matE']['name'][()]
[20]:
b'n2_3sg_0.1-50.1eV_A2.inp.out'
[21]:
# Get data from a subgroup
hf['matE']['dims'][()]
[21]:
b"('Eke', 'mu', 'it', 'Type', 'l', 'm', 'Cont', 'Targ', 'Total')"
[22]:
# Close file stream
hf.close()

Numpy and Pandas

Conversion to other classes can be used to access their native IO routines. Note, however, that this may be lossy, depending on the type of object converted.

Pandas

Convert to a Pandas DataFrame with ep.multiDimXrToPD(), then use any standard Pandas IO functionality.

In general this shoud work for all cases (including complex data), but may lose metadata/attributes.

For general data sharing, to_hdf and to_csv methods are quite useful. Note Pandas to_hdf uses PyTables on the backend.

TODO: wrapper for attributes & restacking? See PEMtk methods.

[23]:
# Convert to Pandas dataframe & save to HDF5
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_pd.h5')
pdDF, *_ = ep.multiDimXrToPD(data, colDims = 'Eke')

# Pandas .to_hdf - note this also needs a key set, which will be used as the top-level node in the output file.
# Similarly to the above HDF5 case, multiple datasets can be written with different group names (keys).
pdDF.to_hdf(dataFile, key = 'N2')
pdDF.to_hdf(dataFile, key = 'N2v2')

# Pandas to_csv
pdDF.to_csv(dataFile.with_suffix('.csv'))
[24]:
# Read back
import pandas as pd

pdIn = pd.read_hdf(dataFile, key = 'N2')
[25]:
# Convert back to Xarray if desired
import xarray as xr

xrNew = xr.DataArray(pdIn)  # This will produce a flat array
xrRecon, *_ = ep.util.misc.restack(xrNew, refDims = 'matE')   # Restack to specified dataType
xrRecon
[25]:
<xarray.DataArray (Eke: 51, Type: 2, it: 1, mu: 3, Sym: 4, LM: 18)>
array([[[[[[           nan          +nanj,
                       nan          +nanj,
            -1.7757076e+00+6.3474768e-01j, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
...
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj]]]]]])
Coordinates:
  * Eke      (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1
  * Type     (Type) object 'L' 'V'
  * it       (it) int64 1
  * mu       (mu) int64 -1 0 1
  * Sym      (Sym) MultiIndex
  - Cont     (Sym) object 'PU' 'PU' 'SU' 'SU'
  - Targ     (Sym) object 'SG' 'SG' 'SG' 'SG'
  - Total    (Sym) object 'PU' 'SU' 'PU' 'SU'
  * LM       (LM) MultiIndex
  - l        (LM) int64 1 1 1 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11
  - m        (LM) int64 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1
[26]:
print(xrRecon.equals(data))  # False...
(xrRecon - data).max()   # But data seems OK
False
[26]:
<xarray.DataArray ()>
array(0.+0.j)
[27]:
# Check file with h5py
hf = h5py.File(dataFile.as_posix(), 'r')
# Top-level groups
print(hf.keys())
hf.close()
<KeysViewHDF5 ['N2', 'N2v2']>

Numpy

Raw data can be saved directly as numpy arrays, see the numpy docs for options and methods. Note, however, that no metadata will be saved in this case, just the raw arrays.

TODO: wrappers for coords, and metadata/side-car files?

[28]:
import numpy as np

dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2.npy')

# For a Numpy ND array from Xarray, use self.to_numpy() or self.data
# Note this is the main array contents only, coords and attrs are discarded
np.save(dataFile, data.to_numpy())
[29]:
# Load and check data
npIn = np.load(dataFile)
print(npIn.dtype, npIn.shape)
(data-npIn).max()
complex128 (18, 51, 2, 3, 1, 2)
[29]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' ()>
array(0.+0.j)

Dictionary methods

For more general/basic IO, Xarray objects can be seralized to native types (this is, in fact, already done on the backend for netCDF and HDF5 datatypes).

Some general routines are provided for this, which wrap Xarray’s native .to_dict() method with some additional functionality. In particular full serialization of all coordinates is performed (including non-dimensional coords), and some additional wrappers/converters are implemented for problematic data types.

TODO: better wrapping/implementation of sanitizeAttrsNetCDF() routine, should be generalised.

[30]:
# Full dictionary conversion with deconstuctDims
dataDict = ep.util.misc.deconstructDims(data, returnType='dict')

# Deconstruction + split complex data types to Re + Im components
dataDictSplit = ep.util.misc.deconstructDims(data, returnType='dict', splitComplex=True)

# Deconstruction + split complex + clean up attrs for safe types only
dataCleanAttrs, fullAttrs, log = ep.util.xrIO.sanitizeAttrsNetCDF(data)
dataCleanDictSplit = ep.util.misc.deconstructDims(dataCleanAttrs, returnType='dict', splitComplex=True)
[31]:
# Convert back to XR including restacking dimensions
xrRecon = ep.util.misc.reconstructDims(dataDict)
xrRecon
[31]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' (Eke: 51, mu: 3, it: 1,
                                                  Type: 2, Sym: 2, LM: 18)>
array([[[[[[           nan          +nanj,
                       nan          +nanj,
            -1.7757076e+00+6.3474768e-01j, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj]],

          [[           nan          +nanj,
                       nan          +nanj,
            -1.9403462e+00+6.9465999e-01j, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
...
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj]],

          [[-4.1606509e-01-6.4490113e-01j,
                       nan          +nanj,
                       nan          +nanj, ...,
             5.3667894e-06-2.4842580e-06j,
                       nan          +nanj,
                       nan          +nanj],
           [           nan          +nanj,
                       nan          +nanj,
                       nan          +nanj, ...,
                       nan          +nanj,
                       nan          +nanj,
                       nan          +nanj]]]]]])
Coordinates:
  * mu       (mu) int64 -1 0 1
  * Type     (Type) <U1 'L' 'V'
  * it       (it) int64 1
  * Eke      (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1
    Ehv      (Eke) float64 15.68 16.68 17.68 18.68 ... 62.68 63.68 64.68 65.68
    SF       (Eke) complex128 (2.1560627+3.741674j) ... (4.4127053+1.8281945j)
  * Sym      (Sym) MultiIndex
  - Cont     (Sym) object 'PU' 'SU'
  - Targ     (Sym) object 'SG' 'SG'
  - Total    (Sym) object 'PU' 'SU'
  * LM       (LM) MultiIndex
  - l        (LM) int64 1 1 1 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11
  - m        (LM) int64 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1
Attributes:
    dataType:  matE
    file:      n2_3sg_0.1-50.1eV_A2.inp.out
    fileBase:  /home/jovyan/github/epsproc/data/photoionization
    fileList:  n2_3sg_0.1-50.1eV_A2.inp.out
    dimMaps:   {'dataDims': ('LM', 'Eke', 'Sym', 'mu', 'it', 'Type'), 'dataDi...
[32]:
# Dictionaries can be handled as usual, e.g. dump to Pickle, JSON

import json

# JSON dump to file example
with open(Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_dict.json'), 'w') as f:

    # json.dumps(dataDict)  # Fails for complex - not supported by JSON

    # OK with split complex to re + im format
    # Note this may still fail in some cases if any unsupported data types are in data.attrs.
    try:
        json.dump(dataDictSplit, f)
        print('JSON OK')
    except:
        print('JSON failed, trying with clean attrs')
        json.dump(dataCleanDictSplit, f)   # Try with cleaned-up attrs
        print('JSON with cleaned attrs OK')

# General JSON complex handling, see notes elsewhere...?
# E.g. https://stackoverflow.com/questions/27909658/json-encoder-and-decoder-for-complex-numpy-arrays
JSON OK
[33]:
# The sanitizeAttrsNetCDF log notes any changes made
log
[33]:
{}

Pickle

As usual, Pickle can be used as a general method, but the usual caveats apply - it requires data structures and classes to be available on read-in, so can break with library versions, and should not be regarded as archival.

Note, however, that Pickle is the fastest way to save many objects, including complex class objects containing multiple dataarrays. (Other methods to be implemented soon.)

[34]:
import pickle

# Save a dictionary
with open(Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_dict.pickle'), 'wb') as handle:
    pickle.dump(dataDict, handle, protocol=pickle.HIGHEST_PROTOCOL)


# Save an Xarray
with open(Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_XR.pickle'), 'wb') as handle:
    pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)
[35]:
# Read back in to test
with open(Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_dict.pickle'), 'rb') as handle:
    dataDictPklIn = pickle.load(handle)

print(dataDict==dataDictPklIn)  # This is False...
data.equals(ep.util.misc.reconstructDims(dataDictPklIn))  # Also False
(ep.util.misc.reconstructDims(dataDictPklIn) - data).max()   # But data looks OK. Dim ordering and/or attrs are different?
False
[35]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' ()>
array(0.+0.j)
[36]:
# Read back in to test
with open(Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_XR.pickle'), 'rb') as handle:
    dataDictPklIn = pickle.load(handle)

print(data.equals(dataDictPklIn))  # This is True
(dataDictPklIn - data).max()   # And data looks OK.
True
[36]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' ()>
array(0.+0.j)

Versions

[37]:
import scooby
scooby.Report(additional=['epsproc', 'holoviews', 'hvplot', 'xarray', 'matplotlib', 'bokeh'])
[37]:
Wed Jun 29 18:27:52 2022 UTC
OS Linux CPU(s) 32 Machine x86_64 Architecture 64bit
RAM 50.1 GiB Environment Jupyter
Python 3.9.10 | packaged by conda-forge | (main, Feb 1 2022, 21:24:11) [GCC 9.4.0]
epsproc 1.3.2-dev holoviews 1.14.8 hvplot 0.8.0 xarray 2022.3.0
matplotlib 3.5.1 bokeh 2.4.2 numpy 1.21.5 scipy 1.8.0
IPython 8.1.1 scooby 0.5.12
[38]:
# Check current Git commit for local ePSproc version
from pathlib import Path
!git -C {Path(ep.__file__).parent} branch
!git -C {Path(ep.__file__).parent} log --format="%H" -n 1
* dev
  master
  numba-tests
  pkgUpdates
ede98a5ad1ab5240d31f30e7a1eebab43b2da810
[39]:
# Check current remote commits
!git ls-remote --heads https://github.com/phockett/ePSproc
ede98a5ad1ab5240d31f30e7a1eebab43b2da810        refs/heads/dev
54b929025381f1c2cbf371180fa870f247e7f627        refs/heads/master
69cd89ce5bc0ad6d465a4bd8df6fba15d3fd1aee        refs/heads/numba-tests
ea30878c842f09d525fbf39fa269fa2302a13b57        refs/heads/revert-9-master