-
Notifications
You must be signed in to change notification settings - Fork 307
Description
The follow is from tensorflow/tensorflow#27510
I am currently using HDF5 files (.h5 or .hdf5) to store my data, which is a data type frequently used in scientific research. (See #2089 for a similar but different request which makes the case for a HDF5 interface nicely). It is very convenient and widely used. MATLAB for example uses HDF5 files for large files. Indeed, it is much more convenient to use than Tensorflow's TFRecord format.
However, in Tensorflow, there is no native support for HDF5 files in the tf.data.Dataset API, which is supposed to be the new API for all data loading. Currently, I am using tf.py_funtion to load my data for the simple reason that tf Dataset is always in graph mode and hence cannot give out the values of the files that I want it to read.
Moreover, I have found that reading an HDF5 file in this way DRAMATICALLY slows down data I/O for unknown reasons. When I used the tf.keras.utils.Sequence API to read HDF5 files without the supposed optimizations that tensorflow is making, an operation that previously took hours now took just a few seconds. (However, I suspect that using tf.defun somehow got tangled up with this. I am not sure why but when I removed some lines, the code sped up, but was still much slower than even a single threaded for loop)
Therefore, I would like to propose creating a new API in tf Dataset for HDF5 files. It could be called HDF5Dataset, similar to TFRecordDataset or CSVDataset.
Moreover, this would allow Tensorflow to make I/O optimizations for reading using the C++ API for HDF5 instead of h5py, which has many limitations and factors that newbie users might not be familiar with.
For example, most builds of h5py cannot do multiprocessing. Also, most people do not know how to chunk their data slices, though this can make a 5-fold difference in read/write speed.
I believe that adding this API would make Tensorflow much friendlier to scientific calculation.
Will this change the current api? How?
This would add a new Dataset type in tf.data.Dataset, or a new method/function for making a dataset from an HDF5 file of an arbitrary format. This may require some low-level integration with the HDF5 format.
Who will benefit with this feature?
People in medical imaging, video datasets, astronomy, or any other type of very large dataset, which is often stored in HDF5. See their website for information on its utility. Also people who don't want to go through the difficult process of making TFRecord files.
Any Other info.
Perhaps integrating aspects of h5py will make the process easier.