Switching Dataset Formats

During the last week at Karl’s autonomous RC car lab, we made significant progress in fixing the slow training speed and memory leaks. Essentially, there were two main problems. One problem was that the internal structure of the dataset within our HDF5 files was much too complicated. The second was that our autonomous driving dataset is simply too large and complex to do any substantial on-the-fly data processing during training.

Improving HDF5 Layout

To address the first problem, Karl created a new, somewhat simplified layout inside the HDF5 files containing the dataset, which made random access significantly more efficient. Previously, random access within the dataset required multiple dictionary lookups, which are significantly slower than indexing into an array. All of this was flattened out into a single, large, multi-dimensional array that contained all of the data indexed with integers.


The second issue had only one answer: pre-processing. We finally created pre-processing code that worked. This required a pre-processing pipeline made up of multiple stages. I won’t go through all of them here, but I’ll discuss the crucial part.

We made a complete pass through the dataset to make it fully ready for training. In the old system, we had a couple hundred HDF5 files called “runs.” Each run is a set of data collected in an uninterrupted timeline. However, the entire timeline wasn’t necessarily good for training. When the car was picked up or not moving, data was still recorded in the run, but it would not have been useful to train on this. Instead, we had a system of converting runs into “segments” on the fly. Each segment is a set of data collected in an uninterrupted timeline, and consisting entirely of usable data. In this new system, when we passed through the data during pre-processing, we broke up each run HDF5 file into several segment HDF5 files, each containing a continuous stream of trainable data. Any data that wasn’t trainable was discarded. We ended up with clean, compact files that each contained only continuous, usable data.


One thought on “Switching Dataset Formats

  1. Also made some additional changes to shuffle individual runs and segments and then go through segments sequentially. This ended up speeding things up a lot. Also because of this we also open and close hdf5 files as we move along in the data so memory doesn’t accumulate only one segment is loaded to memory at a time.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s