Data structures - IO demo
28/06/22
This notebook outlines various IO methods for reading and writing data. For a general datastructures overview, see the Data structures intro doc.
Note that this page details functional forms, for class usage see the base class intro page. However, as of June 2022, file writers are not fully implemented for the data class.
Load libraries
[1]:
from pathlib import Path
import epsproc as ep
# Set data path
# Note this is set here from ep.__path__, but may not be correct in all cases - depends on where the Github repo is.
epDemoDataPath = Path(ep.__path__[0]).parent/'data'
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
* sparse not found, sparse matrix forms not available.
* natsort not found, some sorting functions not available.
* Setting plotter defaults with epsproc.basicPlotters.setPlotters(). Run directly to modify, or change options in local env.
* Set Holoviews with bokeh.
* pyevtk not found, VTK export not available.
Loading ePolyScat data
To start with, load data from an ePolyScat output file. This will load matrix elements & cross-sections, for more details see the basic demo page and the ePolyScat basics page.
[2]:
# Load data from modPath\data
dataPath = Path(epDemoDataPath, 'photoionization')
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2.inp.out') # Set for sample N2 data for testing
# Scan data file
dataSet = ep.readMatEle(fileIn = dataFile.as_posix())
data = dataSet[0]
# dataXS = ep.readMatEle(fileIn = dataFile.as_posix(), recordType = 'CrossSection') # XS info currently not set in NO2 sample file.
*** ePSproc readMatEle(): scanning files for DumpIdy segments.
*** Scanning file(s)
['/home/jovyan/github/epsproc/data/photoionization/n2_3sg_0.1-50.1eV_A2.inp.out']
*** FileListSort
Prefix: /home/jovyan/github/epsproc/data/photoionization/n2_3sg_0.1-50.1eV_A2.inp.out
1 groups.
*** Reading ePS output file: /home/jovyan/github/epsproc/data/photoionization/n2_3sg_0.1-50.1eV_A2.inp.out
*** IO.fileParse() found 1 segments with
Start: ScatEng
End: ['#'].
Expecting 51 energy points.
*** IO.fileParse() found 2 segments with
Start: ScatSym
End: ['FileName', '\n'].
Expecting 2 symmetries.
*** IO.fileParse() found 102 segments with
Start: DumpIdy - dump
End: ['+ Command', 'Time Now'].
Found 102 dumpIdy segments (sets of matrix elements).
Processing segments to Xarrays...
Processed 102 sets of DumpIdy file segments, (0 blank)
[3]:
# All data is pushed to Xarrays
data
[3]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' (LM: 18, Eke: 51, Sym: 2, mu: 3, it: 1, Type: 2)> array([[[[[[ nan +nanj, nan +nanj]], [[ nan +nanj, nan +nanj]], [[ nan +nanj, nan +nanj]]], [[[ nan +nanj, nan +nanj]], [[ nan +nanj, nan +nanj]], [[-1.7757076e+00+6.3474768e-01j, -1.9403462e+00+6.9465999e-01j]]]], ... [[[[ nan +nanj, nan +nanj]], [[ nan +nanj, nan +nanj]], [[ nan +nanj, nan +nanj]]], [[[ 8.9213389e-06-4.6971505e-06j, nan +nanj]], [[ nan +nanj, nan +nanj]], [[ nan +nanj, nan +nanj]]]]]]) Coordinates: * LM (LM) MultiIndex - l (LM) int64 1 1 1 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11 - m (LM) int64 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 * mu (mu) int64 -1 0 1 * Type (Type) <U1 'L' 'V' * it (it) int64 1 * Sym (Sym) MultiIndex - Cont (Sym) object 'SU' 'PU' - Targ (Sym) object 'SG' 'SG' - Total (Sym) object 'SU' 'PU' * Eke (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1 Ehv (Eke) float64 15.68 16.68 17.68 18.68 ... 62.68 63.68 64.68 65.68 SF (Eke) complex128 (2.1560627+3.741674j) ... (4.4127053+1.8281945j) Attributes: dataType: matE file: n2_3sg_0.1-50.1eV_A2.inp.out fileBase: /home/jovyan/github/epsproc/data/photoionization fileList: n2_3sg_0.1-50.1eV_A2.inp.out
Xarray methods
For Xarrays, these can be written to disk and read back with ep.IO.writeXarray()
and ep.IO.readXarray()
.
These wrap various backends, including Xarray’s netCDF writer, and implement some additional complex number and dim handling for ePSproc cases (depending on the backend). In general, the file writers aim to preserve all dimensions, including stacking, if possible.
For other low-level data IO options, see the Xarray documentation. Higher level routines are currently in development for ePSproc & PEMtk, see further notes at https://github.com/phockett/ePSproc/issues/8 and https://github.com/phockett/PEMtk/issues/6.
Complex number and attribute handling
Note that native Xarray methods have some limitations…
Issues with complex data, which is not supported by netcdf - either convert to Re+Im format or use h5netcdf backend with
forceComplex=True
to workaround (but need to set this on file read too).Issues with nested dict attribs.
Issues with MultiIndex coords, esp. in non-dim coords.
With h5netcdf and invalid_netcdf
option (1) is OK (see Xarray docs for details), although still needs unstack, and may also need to set ‘to_dataset’ for more control, otherwise can get arb named items in file (if dataarray name is missing).
For general attrib handling, ep.IO.sanitizeAttrsNetCDF()
attempts to quickly clear this up at file IO if there is an exception raised, although may be lossy in some cases.
For non-Xarray file types, e.g. HDF5, some additional custom handling is included, mainly via dictionary conversion routines, which are discussed below.
TODO:
further work on this, and alternative file writers, see notes at https://github.com/phockett/ePSproc/issues/8 and https://github.com/phockett/PEMtk/issues/6.
more attrib testing.
NetCDF methods
For Xarray’s netCDF writer, valid engines (handlers/libraries/formats) are h5netcdf
(preferred), netcdf4
and scipy
. The default case uses engine = 'h5netcdf'
, and forceComplex = False
- the output file will include separate real and imaginary components and a flat representation of the array.
TODO: test for multiple Xarray per file.
[4]:
dataPath = Path(epDemoDataPath, 'photoionization','fileIOtests')
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2') # Set for sample N2 data for testing
ep.IO.writeXarray(data, fileName = dataFile.as_posix(), filePath = dataPath.as_posix()) # Default case set as: engine = 'h5netcdf', forceComplex = False
['Written to h5netcdf format', '/home/jovyan/github/epsproc/data/photoionization/fileIOtests/n2_3sg_0.1-50.1eV_A2.nc']
[4]:
['Written to h5netcdf format',
'/home/jovyan/github/epsproc/data/photoionization/fileIOtests/n2_3sg_0.1-50.1eV_A2.nc']
[5]:
dataIn = ep.IO.readXarray(fileName = dataFile.as_posix() + '.nc', filePath = dataPath.as_posix()) #, forceComplex=forceComplex, forceArray=False)
[6]:
dataIn # Note dims are restacked, although ordering may change
[6]:
<xarray.DataArray (Eke: 51, mu: 3, it: 1, Type: 2, Sym: 4, LM: 18)> array([[[[[[ nan +nanj, nan +nanj, -1.7757076e+00+6.3474768e-01j, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, ... nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj]]]]]]) Coordinates: Ehv (Eke) float64 15.68 16.68 17.68 18.68 ... 62.68 63.68 64.68 65.68 * Eke (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1 * Type (Type) object 'L' 'V' * it (it) int64 1 * mu (mu) int64 -1 0 1 SF (Eke) complex128 (2.1560627+3.741674j) ... (4.4127053+1.8281945j) * Sym (Sym) MultiIndex - Cont (Sym) object 'PU' 'PU' 'SU' 'SU' - Targ (Sym) object 'SG' 'SG' 'SG' 'SG' - Total (Sym) object 'PU' 'SU' 'PU' 'SU' * LM (LM) MultiIndex - l (LM) int64 1 1 1 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11 - m (LM) int64 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 Attributes: dataType: matE file: n2_3sg_0.1-50.1eV_A2.inp.out fileBase: /home/jovyan/github/epsproc/data/photoionization fileList: n2_3sg_0.1-50.1eV_A2.inp.out
[7]:
# Testing for equality with the original currently returns false - dim ordering might be the issue here?
dataIn.equals(data)
[7]:
False
[8]:
# Subtraction indicates identical data however.
(dataIn - data).max()
[8]:
<xarray.DataArray ()> array(0.+0.j)
[9]:
# For the
Low-level Xarray routines
The Xarray native routines can be used directly, although may require additional args and/or data reformatting.
Note that the netcdf writer always writes to DataSet format, although there are readers for both dataset and dataarrays. Other available methods include Pickle, Zarr…
[10]:
# Read file with base open_dataset
import xarray as xr
dataIn = xr.open_dataset(dataFile.as_posix() + '.nc', engine = 'h5netcdf')
dataIn
[10]:
<xarray.Dataset> Dimensions: (Cont: 2, Eke: 51, mu: 3, it: 1, Type: 2, l: 6, m: 3, Targ: 1, Total: 2) Coordinates: * Cont (Cont) object 'PU' 'SU' Ehv (Eke) float64 15.68 16.68 17.68 18.68 ... 62.68 63.68 64.68 65.68 * Eke (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1 SFi (Eke) float64 3.742 3.628 3.524 3.428 ... 1.871 1.857 1.842 1.828 SFr (Eke) float64 2.156 2.224 2.289 2.353 ... 4.311 4.345 4.379 4.413 * Targ (Targ) object 'SG' * Total (Total) object 'PU' 'SU' * Type (Type) object 'L' 'V' * it (it) int64 1 * l (l) int64 1 3 5 7 9 11 * m (m) int64 -1 0 1 * mu (mu) int64 -1 0 1 Data variables: Im (Eke, mu, it, Type, l, m, Cont, Targ, Total) float64 nan ... nan Re (Eke, mu, it, Type, l, m, Cont, Targ, Total) float64 nan ... nan Attributes: complex: split dataType: matE file: n2_3sg_0.1-50.1eV_A2.inp.out fileBase: /home/jovyan/github/epsproc/data/photoionization fileList: n2_3sg_0.1-50.1eV_A2.inp.out
[11]:
# Read file with base open_dataarray - this may fail for datasets
dataIn = xr.open_dataarray(dataFile.as_posix() + '.nc', engine = 'h5netcdf')
dataIn
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [11], in <cell line: 2>()
1 # Read file with base open_dataarray - this may fail for datasets
----> 2 dataIn = xr.open_dataarray(dataFile.as_posix() + '.nc', engine = 'h5netcdf')
3 dataIn
File /opt/conda/lib/python3.9/site-packages/xarray/backends/api.py:670, in open_dataarray(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
652 dataset = open_dataset(
653 filename_or_obj,
654 decode_cf=decode_cf,
(...)
666 **kwargs,
667 )
669 if len(dataset.data_vars) != 1:
--> 670 raise ValueError(
671 "Given file dataset contains more than one data "
672 "variable. Please read with xarray.open_dataset and "
673 "then select the variable you want."
674 )
675 else:
676 (data_array,) = dataset.data_vars.values()
ValueError: Given file dataset contains more than one data variable. Please read with xarray.open_dataset and then select the variable you want.
HDF5
HDF5 support is available if the h5py
library is present, and is wrapped for Xarray use with the routines in epsproc.ioBackends.hdf5IO
. These handle most python native and numpy data types. The routines also include string wrappers for unsupported data types (and/or they can be purged for saving). For more details see the dictionary conversion options section below (the Xarray objects are serialized to dictionaries prior to file IO).
The main benefit of HDF5 (vs. netCDF) is more type handling (esp. complex data), and cross-compatibility with other languages. String wrapping is currently applied to all fields except the numerical data, ensuring handling of nested dictionaries. (See source for writeXarrayToHDF5 and sanitizeAttrsNetCDF.)
NOTE: string unwrapping is done via ast.literal_eval(), which supports: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None. This should be safe in general (cannot evaluate arbitrary code), but pass evalStrings = True
at file read to avoid if required.
TODO:
more options here, nested dict unpacking to HDF5 as well as string wrap (see .
fix XR name, appears as byte string
Test multiple items per file.
Test IO with, e.g., Matlab.
Other libs/tools for this to be investigated…
General discussion of dicts to HDF5: https://stackoverflow.com/questions/16494669/how-to-store-dictionary-in-hdf5-dataset
Native methods: serialise to JSON, wrap as string.
Either method should be easy to implement.
Maybe Benedict for dict handling? https://github.com/fabiocaccamo/python-benedict#io-methods
HDFdict for nested dicts to HDF5: https://pypi.org/project/hdfdict/ (small, dicts only)
Deepdish: https://deepdish.readthedocs.io/en/latest/io.html (general)
[12]:
dataPath = Path(epDemoDataPath, 'photoionization','fileIOtests')
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2.h5') # Set for sample N2 data for testing
ep.IO.writeXarray(data, fileName = dataFile.as_posix(), filePath = dataPath.as_posix(), engine = 'hdf5') # Default case set as: engine = 'h5netcdf', forceComplex = False
[12]:
PosixPath('/home/jovyan/github/epsproc/data/photoionization/fileIOtests/n2_3sg_0.1-50.1eV_A2.h5')
[13]:
dataPath = Path(epDemoDataPath, 'photoionization','fileIOtests')
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2.h5') # Set for sample N2 data for testing
# HDF5 file reader, note this returns both dict and Xarray formats
dataInDict, dataInXR = ep.IO.readXarray(fileName = dataFile.as_posix(), filePath = dataPath.as_posix(), engine = 'hdf5')
*** Opened file /home/jovyan/github/epsproc/data/photoionization/fileIOtests/n2_3sg_0.1-50.1eV_A2.h5
Found top-level group(s) ['matE']
Reading group matE
[14]:
# Reconstructed array. Note this also now includes self.attrs['dimMaps'], which was added by the file writer to track dimensions.
dataInXR
[14]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' (Eke: 51, mu: 3, it: 1, Type: 2, Sym: 2, LM: 18)> array([[[[[[ nan +nanj, nan +nanj, -1.7757076e+00+6.3474768e-01j, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj]], [[ nan +nanj, nan +nanj, -1.9403462e+00+6.9465999e-01j, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, ... nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj]], [[-4.1606509e-01-6.4490113e-01j, nan +nanj, nan +nanj, ..., 5.3667894e-06-2.4842580e-06j, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj]]]]]]) Coordinates: * mu (mu) int64 -1 0 1 * Type (Type) <U1 'L' 'V' * it (it) int64 1 * Eke (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1 Ehv (Eke) float64 15.68 16.68 17.68 18.68 ... 62.68 63.68 64.68 65.68 SF (Eke) complex128 (2.1560627+3.741674j) ... (4.4127053+1.8281945j) * Sym (Sym) MultiIndex - Cont (Sym) object 'PU' 'SU' - Targ (Sym) object 'SG' 'SG' - Total (Sym) object 'PU' 'SU' * LM (LM) MultiIndex - l (LM) int64 1 1 1 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11 - m (LM) int64 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 Attributes: dataType: matE file: n2_3sg_0.1-50.1eV_A2.inp.out fileBase: /home/jovyan/github/epsproc/data/photoionization fileList: n2_3sg_0.1-50.1eV_A2.inp.out dimMaps: {'dataDims': ('LM', 'Eke', 'Sym', 'mu', 'it', 'Type'), 'dataDi...
[15]:
# data.attrs['dimMaps'] = dataInXR.attrs['dimMaps']
print(dataInXR.equals(data)) # False...
(dataInXR - data).max() # But data seems OK
False
[15]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' ()> array(0.+0.j)
[16]:
# Additional data can be appended to the file
# Note that the top-level group defaults to data.dataType, so a different name may be required for this
ep.IO.writeXarray(data, fileName = dataFile.as_posix(), filePath = dataPath.as_posix(), engine = 'hdf5', dataName = 'matE2')
[16]:
PosixPath('/home/jovyan/github/epsproc/data/photoionization/fileIOtests/n2_3sg_0.1-50.1eV_A2.h5')
Low-level HDF5 IO
Native HDF5 file IO can also be used (and the files should be readable by any standard HDF5 tools), and this contains a dictionary-like representation of the data structure. For more details on h5py see the docs.
[17]:
import h5py
dataPath = Path(epDemoDataPath, 'photoionization','fileIOtests')
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2.h5')
# Open file with h5py
fIn = Path(dataPath,dataFile)
hf = h5py.File(fIn.as_posix(), 'r')
[18]:
# Top-level groups
hf.keys()
[18]:
<KeysViewHDF5 ['matE', 'matE2']>
[19]:
# Group items
hf['matE'].keys()
[19]:
<KeysViewHDF5 ['attrs', 'coords', 'data', 'dims', 'name']>
[20]:
# Get data from a subgroup
hf['matE']['name'][()]
[20]:
b'n2_3sg_0.1-50.1eV_A2.inp.out'
[21]:
# Get data from a subgroup
hf['matE']['dims'][()]
[21]:
b"('Eke', 'mu', 'it', 'Type', 'l', 'm', 'Cont', 'Targ', 'Total')"
[22]:
# Close file stream
hf.close()
Numpy and Pandas
Conversion to other classes can be used to access their native IO routines. Note, however, that this may be lossy, depending on the type of object converted.
Pandas
Convert to a Pandas DataFrame with ep.multiDimXrToPD()
, then use any standard Pandas IO functionality.
In general this shoud work for all cases (including complex data), but may lose metadata/attributes.
For general data sharing, to_hdf
and to_csv
methods are quite useful. Note Pandas to_hdf
uses PyTables on the backend.
TODO: wrapper for attributes & restacking? See PEMtk methods.
[23]:
# Convert to Pandas dataframe & save to HDF5
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_pd.h5')
pdDF, *_ = ep.multiDimXrToPD(data, colDims = 'Eke')
# Pandas .to_hdf - note this also needs a key set, which will be used as the top-level node in the output file.
# Similarly to the above HDF5 case, multiple datasets can be written with different group names (keys).
pdDF.to_hdf(dataFile, key = 'N2')
pdDF.to_hdf(dataFile, key = 'N2v2')
# Pandas to_csv
pdDF.to_csv(dataFile.with_suffix('.csv'))
[24]:
# Read back
import pandas as pd
pdIn = pd.read_hdf(dataFile, key = 'N2')
[25]:
# Convert back to Xarray if desired
import xarray as xr
xrNew = xr.DataArray(pdIn) # This will produce a flat array
xrRecon, *_ = ep.util.misc.restack(xrNew, refDims = 'matE') # Restack to specified dataType
xrRecon
[25]:
<xarray.DataArray (Eke: 51, Type: 2, it: 1, mu: 3, Sym: 4, LM: 18)> array([[[[[[ nan +nanj, nan +nanj, -1.7757076e+00+6.3474768e-01j, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, ... nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj]]]]]]) Coordinates: * Eke (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1 * Type (Type) object 'L' 'V' * it (it) int64 1 * mu (mu) int64 -1 0 1 * Sym (Sym) MultiIndex - Cont (Sym) object 'PU' 'PU' 'SU' 'SU' - Targ (Sym) object 'SG' 'SG' 'SG' 'SG' - Total (Sym) object 'PU' 'SU' 'PU' 'SU' * LM (LM) MultiIndex - l (LM) int64 1 1 1 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11 - m (LM) int64 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1
[26]:
print(xrRecon.equals(data)) # False...
(xrRecon - data).max() # But data seems OK
False
[26]:
<xarray.DataArray ()> array(0.+0.j)
[27]:
# Check file with h5py
hf = h5py.File(dataFile.as_posix(), 'r')
# Top-level groups
print(hf.keys())
hf.close()
<KeysViewHDF5 ['N2', 'N2v2']>
Numpy
Raw data can be saved directly as numpy arrays, see the numpy docs for options and methods. Note, however, that no metadata will be saved in this case, just the raw arrays.
TODO: wrappers for coords, and metadata/side-car files?
[28]:
import numpy as np
dataFile = Path(dataPath, 'n2_3sg_0.1-50.1eV_A2.npy')
# For a Numpy ND array from Xarray, use self.to_numpy() or self.data
# Note this is the main array contents only, coords and attrs are discarded
np.save(dataFile, data.to_numpy())
[29]:
# Load and check data
npIn = np.load(dataFile)
print(npIn.dtype, npIn.shape)
(data-npIn).max()
complex128 (18, 51, 2, 3, 1, 2)
[29]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' ()> array(0.+0.j)
Dictionary methods
For more general/basic IO, Xarray objects can be seralized to native types (this is, in fact, already done on the backend for netCDF and HDF5 datatypes).
Some general routines are provided for this, which wrap Xarray’s native .to_dict()
method with some additional functionality. In particular full serialization of all coordinates is performed (including non-dimensional coords), and some additional wrappers/converters are implemented for problematic data types.
TODO: better wrapping/implementation of sanitizeAttrsNetCDF()
routine, should be generalised.
[30]:
# Full dictionary conversion with deconstuctDims
dataDict = ep.util.misc.deconstructDims(data, returnType='dict')
# Deconstruction + split complex data types to Re + Im components
dataDictSplit = ep.util.misc.deconstructDims(data, returnType='dict', splitComplex=True)
# Deconstruction + split complex + clean up attrs for safe types only
dataCleanAttrs, fullAttrs, log = ep.util.xrIO.sanitizeAttrsNetCDF(data)
dataCleanDictSplit = ep.util.misc.deconstructDims(dataCleanAttrs, returnType='dict', splitComplex=True)
[31]:
# Convert back to XR including restacking dimensions
xrRecon = ep.util.misc.reconstructDims(dataDict)
xrRecon
[31]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' (Eke: 51, mu: 3, it: 1, Type: 2, Sym: 2, LM: 18)> array([[[[[[ nan +nanj, nan +nanj, -1.7757076e+00+6.3474768e-01j, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj]], [[ nan +nanj, nan +nanj, -1.9403462e+00+6.9465999e-01j, ..., nan +nanj, nan +nanj, nan +nanj], [ nan +nanj, ... nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj]], [[-4.1606509e-01-6.4490113e-01j, nan +nanj, nan +nanj, ..., 5.3667894e-06-2.4842580e-06j, nan +nanj, nan +nanj], [ nan +nanj, nan +nanj, nan +nanj, ..., nan +nanj, nan +nanj, nan +nanj]]]]]]) Coordinates: * mu (mu) int64 -1 0 1 * Type (Type) <U1 'L' 'V' * it (it) int64 1 * Eke (Eke) float64 0.1 1.1 2.1 3.1 4.1 5.1 ... 46.1 47.1 48.1 49.1 50.1 Ehv (Eke) float64 15.68 16.68 17.68 18.68 ... 62.68 63.68 64.68 65.68 SF (Eke) complex128 (2.1560627+3.741674j) ... (4.4127053+1.8281945j) * Sym (Sym) MultiIndex - Cont (Sym) object 'PU' 'SU' - Targ (Sym) object 'SG' 'SG' - Total (Sym) object 'PU' 'SU' * LM (LM) MultiIndex - l (LM) int64 1 1 1 3 3 3 5 5 5 7 7 7 9 9 9 11 11 11 - m (LM) int64 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 Attributes: dataType: matE file: n2_3sg_0.1-50.1eV_A2.inp.out fileBase: /home/jovyan/github/epsproc/data/photoionization fileList: n2_3sg_0.1-50.1eV_A2.inp.out dimMaps: {'dataDims': ('LM', 'Eke', 'Sym', 'mu', 'it', 'Type'), 'dataDi...
[32]:
# Dictionaries can be handled as usual, e.g. dump to Pickle, JSON
import json
# JSON dump to file example
with open(Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_dict.json'), 'w') as f:
# json.dumps(dataDict) # Fails for complex - not supported by JSON
# OK with split complex to re + im format
# Note this may still fail in some cases if any unsupported data types are in data.attrs.
try:
json.dump(dataDictSplit, f)
print('JSON OK')
except:
print('JSON failed, trying with clean attrs')
json.dump(dataCleanDictSplit, f) # Try with cleaned-up attrs
print('JSON with cleaned attrs OK')
# General JSON complex handling, see notes elsewhere...?
# E.g. https://stackoverflow.com/questions/27909658/json-encoder-and-decoder-for-complex-numpy-arrays
JSON OK
[33]:
# The sanitizeAttrsNetCDF log notes any changes made
log
[33]:
{}
Pickle
As usual, Pickle can be used as a general method, but the usual caveats apply - it requires data structures and classes to be available on read-in, so can break with library versions, and should not be regarded as archival.
Note, however, that Pickle is the fastest way to save many objects, including complex class objects containing multiple dataarrays. (Other methods to be implemented soon.)
[34]:
import pickle
# Save a dictionary
with open(Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_dict.pickle'), 'wb') as handle:
pickle.dump(dataDict, handle, protocol=pickle.HIGHEST_PROTOCOL)
# Save an Xarray
with open(Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_XR.pickle'), 'wb') as handle:
pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)
[35]:
# Read back in to test
with open(Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_dict.pickle'), 'rb') as handle:
dataDictPklIn = pickle.load(handle)
print(dataDict==dataDictPklIn) # This is False...
data.equals(ep.util.misc.reconstructDims(dataDictPklIn)) # Also False
(ep.util.misc.reconstructDims(dataDictPklIn) - data).max() # But data looks OK. Dim ordering and/or attrs are different?
False
[35]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' ()> array(0.+0.j)
[36]:
# Read back in to test
with open(Path(dataPath, 'n2_3sg_0.1-50.1eV_A2_XR.pickle'), 'rb') as handle:
dataDictPklIn = pickle.load(handle)
print(data.equals(dataDictPklIn)) # This is True
(dataDictPklIn - data).max() # And data looks OK.
True
[36]:
<xarray.DataArray 'n2_3sg_0.1-50.1eV_A2.inp.out' ()> array(0.+0.j)
Versions
[37]:
import scooby
scooby.Report(additional=['epsproc', 'holoviews', 'hvplot', 'xarray', 'matplotlib', 'bokeh'])
[37]:
Wed Jun 29 18:27:52 2022 UTC | |||||||
OS | Linux | CPU(s) | 32 | Machine | x86_64 | Architecture | 64bit |
RAM | 50.1 GiB | Environment | Jupyter | ||||
Python 3.9.10 | packaged by conda-forge | (main, Feb 1 2022, 21:24:11) [GCC 9.4.0] | |||||||
epsproc | 1.3.2-dev | holoviews | 1.14.8 | hvplot | 0.8.0 | xarray | 2022.3.0 |
matplotlib | 3.5.1 | bokeh | 2.4.2 | numpy | 1.21.5 | scipy | 1.8.0 |
IPython | 8.1.1 | scooby | 0.5.12 |
[38]:
# Check current Git commit for local ePSproc version
from pathlib import Path
!git -C {Path(ep.__file__).parent} branch
!git -C {Path(ep.__file__).parent} log --format="%H" -n 1
* dev
master
numba-tests
pkgUpdates
ede98a5ad1ab5240d31f30e7a1eebab43b2da810
[39]:
# Check current remote commits
!git ls-remote --heads https://github.com/phockett/ePSproc
ede98a5ad1ab5240d31f30e7a1eebab43b2da810 refs/heads/dev
54b929025381f1c2cbf371180fa870f247e7f627 refs/heads/master
69cd89ce5bc0ad6d465a4bd8df6fba15d3fd1aee refs/heads/numba-tests
ea30878c842f09d525fbf39fa269fa2302a13b57 refs/heads/revert-9-master