Research and Logistics Update

For the past week, I’ve been running the same experiments I outlined before. It turned out that I had an error in my code that was only visible after running for an epoch, which takes 12 hours. As a result, I had to restart training from scratch. I will need 20 or more epochs of these networks, so I won’t be getting full results any time soon.

In the meantime, I’m working on the manuscript for the paper on this work. The paper submission deadline I’m working towards is at the end of October, so I’m getting going on the manuscript now so that when experiment results come in, I’ll already have some of the paper writing done.

From a logistics perspective, I also managed to get one of our training machines into a Berkeley data center. Now, the machine is properly cooled and ventilated, has 1 Gbps download speeds, and has proper UPS backup. I’m hoping to start getting the rest of the machines into that data center as well since we continue to have unnecessary outages and other issues with the machines.


HDF5 Memory Leak

For the past month or so, my colleague Sauhaarda and I have been trying to solve a strange problem in our training codebase. The longer we train our network, the more RAM it uses, even though there’s no in-scope variable that’s getting accumulated in our own code. After lots of memory profiling, we figured out that the issue wasn’t in our own code, but in the Python library h5py. This is the library that we use to read our data files. So why was it leaking and what was the solution?


It turns out that the underlying HDF5 library libhdf5 and the Python bindings h5py don’t properly clean up as they accumulate accesses to our data files. I haven’t yet tracked down whether this is an actual bug in the library or if it’s caused by the way the data files are organized. As a result, as training continues, RAM usage increases more and more until the entire program crashes because it runs out of memory.


I initially went with pre-processing to try to solve this problem, which is where that post on pre-processing came from. However, in practice, I was never able to get it to work without having some sort of bug that I couldn’t track down. As a result, my colleague came up with a stop-gap measure.


While pre-processing did speed up the training as I explained in the post linked above, there was some bug in the code that I couldn’t find quickly enough. Somehow, the validation loss was coming out 2 orders of magnitude worse than without pre-processing. Since I couldn’t track down the bug, I ended up shelving this method for a time when I don’t have a paper deadline coming up.

Stop-gap Script

For a few days, Sauhaarda and I tried to use the code without pre-processing and just hope that the RAM wouldn’t run out, but that quickly became impossible as we started training more and more experiments. Sauhaarda then came up with a stop-gap measure in which the training code runs only for one epoch each time it’s called, and a Bash script calls the training program in a loop. This way, the memory usage is reset after each epoch. As long as the RAM doesn’t run out before even one epoch can complete, our code runs. This is the method that I’m using to train the experiments I wrote about yesterday.


Although the memory issue bothers me a lot, I’ve resigned to using the stop-gap script for now so that research work can continue without being bogged down with the bug. I will eventually fix it after my paper deadline passes and I have some time to spend on the bug. For now, I will go with what my professor told me regarding this problem, “Do what works.”

SqueezeNet Research Status

As you may know from my first blog post, my research work is on autonomous driving with SqueezeNet. Despite the post a few days ago about speeding up the model training, I haven’t had much success applying the pre-processing and multiprocessing approached to this problem. It turns out that the HDF5 data storage format that we’re using at my lab to store our training data doesn’t easily support concurrent access from multiple processes. For now, since I need to get some models trained, I’m training them using our old training code that takes 12 hours per epoch.


I’m currently running 5 experiments: 1 baseline, 2 to explore the ability of SqueezeNet to learn to drive and 2 on modified versions of SqueezeNet that have memory.


Z2Color is the baseline model that Karl, my professor, has been researching for over a year. The network has done all of our autonomous driving, but Karl hadn’t been able to find a better network when he first hired me. Now, of course, I’m working on creating many networks that are better.

There are two main problem with this network. The first is that the maximum receptive field size (the effective vision size of the network) is smaller than the image, so it requires fully connected layers at the end of the network that are not spacially invariant. The other big issue is that the network has no memory, so if the car gets stuck in a position where the camera can’t help it get out, it has no memory of where it was before.


SqueezeNet is my primary means of fixing the first problem with Z2Color. The receptive field size is bigger than the input camera image size, so it’s able to see everything at once even though there’s no fully connected layers. This saves on training computation and allows me to use that computation in other parts of the network.


AlexNet is what inspired the creators of SqueezeNet. They wanted to have AlexNet-level performance on image classification with a much more efficient network. Of course, I’m not doing image classification, so I’m training both SqueezeNet and AlexNet to drive so that I can compare them against each other.

SqueezeNet with LSTM

LSTMs, a form of neural network memory, should help SqueezeNet remember what its past actions were when it’s trying to make a decision for future actions. If this experiment ends up performing very well, this will solve problem #2 with Z2Color.

SqueezeNet with GRU

GRUs are a newer, simpler form of neural network memory that have been shown to perform as well as LSTMs. Thus, I figured I should try them as well, since I’m inexperienced with LSTMs and GRUs and don’t know which one will work better here.


Overall, I have several exciting experiments running that will greatly improve our understanding of SqueezeNet.  I will post an update on these experiments when I start seeing some results. Unfortunately, the training is very slow so it may take a week or so.

Docker for Deep Learning

Recently, I’ve started using Docker to manage the development environment required for deep learning. Docker allows me to decide exactly which versions of which software are installed, and how. It’s very useful to be able to script the development environment and easily switch between different environments. For example, in my lab, the codebase is currently in Python 2, but we’re about to change it to Python 3. This change would normally require changing over the host machine’s environment to Python 3, but using Docker, I can just update the Dockerfile to say I want Python 3. Allow me to explain by example.

Using a Docker Image

Suppose we have a machine with Ubuntu 16.04, but the code we’re trying to run needs the software package versions of an Ubuntu 14.04 distribution. Rather than installing Ubuntu 14.04 just to run our code, we can it inside a Docker container that has Ubuntu 14.04 packages. Let’s say our code can be run with the command my_code. Then to run it inside an Ubuntu 14.04 Docker container, we can run it like this:

docker pull ubuntu:14.04
docker run -it ubuntu:14.04 my_code

The first line will pull the latest Ubuntu 14.04 image from Docker Hub, an online repository of Dockerfiles that have already been built into a Docker image. The second line will launch a Docker container that is running the Ubuntu 14.04 image, and execute our code. When our code executes inside a Docker container, it still has access to the host machine’s CPUs and RAM as it normally would outside the container. However, the filesystem that our code sees is the Docker image filesystem, not our host machine filesystem.

But what if we want to access files in our host machine’s filesystem? We can run the following snippet instead:

docker run -v /path/to/my/data:/foo -w /foo -it ubuntu:14.04 my_code

The -v argument binds the host machine’s /path/to/my/data to foo in the Docker image. The -w argument sets the present working directory of the Docker container to /foo.

Making a Docker Image

Now, suppose that we aren’t happy with the Ubuntu 14.04 Docker image that’s provided on Docker Hub and you want to make your own modified version that adds the libjpeg-dev package, for example. How do we do that? First, we define our own file named Dockerfile:

mkdir ubuntu-custom
cd ubuntu-custom
touch Dockerfile

Then, we open our Dockerfile in an editor and write this:

FROM ubuntu:14.04

RUN apt-get update && apt-get install -y --no-install-recommends \
libjpeg-dev && \
rm -rf /var/lib/apt/lists/*

This tells Docker that we want to inherit from the Ubuntu 14.04 image and then install libjpeg-dev. We clean up the Apt package list after installing libjpeg-dev so that we aren’t packaging in unnecessary cached files into our new Docker image.

Finally, to build a new image from our Dockerfile, we run the following:

docker build -t ubuntu-custom .

Now, we can execute our code in our new image with:

docker run -it ubuntu-custom my_code

Using Docker with GPU Training

One caveat of Docker is that it cannot easily access GPUs belonging to the host machine. To make it easier, Nvidia has created nvidia-docker, a modified version of Docker that does support GPUs. The usage of it is identical to Docker, but it provides GPU support.

My Docker Hub Images

For my own convenience, I’ve uploaded my own Docker images for deep learning to Docker Hub. tpankaj/dl-pytorch is built for PyTorch users and tpankaj/dl-keras is built for Keras (TensorFlow backend) users. Both use the latest stable version of their respective libraries. Here is an example of how to use them:

nvidia-docker pull tpankaj/dl-keras
nvidia-docker run -it tpankaj/dl-keras python


I just started using Docker this summer, but it’s already sped up the setup process for training on a new machine, because I only need to install CUDA and Docker, and then pull my image from Docker Hub. I’ve also slowly begun to convince people in my lab to use it instead of dealing with installing everything on each training machine themselves.

Giving CUDA 9 Another Shot

A few days ago I posted about my test of CUDA 9 and CuDNN 7. As I said at the end of that post, more experimentation was required before making conclusions. Today, I worked on running better benchmarking experiments. This time, I trained 10 epochs of a simple network using the CIFAR10 dataset. I ran the experiment 10 times on CUDA 8/CuDNN 6 and 10 times on CUDA 9/CuDNN 7 and took the best of the 10 for each of the cases. I believe this experiment is better at measuring the performance because the training was more GPU intensive than the MNIST test, so the training time should be more heavily affected by improvements in performance of the GPU.


I trained the network with CIFAR10 for 10 epochs on an AWS p2.xlarge instance using an Nvidia Tesla K80 GPU. On the instance with CUDA 8/CuDNN 6, it took 110 seconds. On the instance with CUDA 9/CuDNN 7, it took 103 seconds. CUDA 9 was approximately 6% faster.

While this is only one data point, it does appear to show that CUDA 9 can be faster than CUDA 8 when training GPU-intensive models.

Resources for Deep Learning

Many people have asked me how they should learn how to use deep learning, so I figured I should consolidate my answer in one place. First off, I think the best way to understand deep learning is to have a strong mathematical background. That doesn’t mean that you have to have taken a formal course in calculus or linear algebra, but learning those things on your own is helpful. The best resource for this is Khan Academy.

After building a strong mathematical foundation, my recommendation isn’t to jump straight into deep learning, but to first start with classical machine learning. This includes algorithms such as Principal Component Analysis, Support Vector Machines, K-Means clustering, and some others. While these aren’t directly used in deep learning, they are a lot simpler to understand and more grounded in theory. These algorithms provide a good understanding data analysis, supervised learning, and unsupervised learning. The best course for this is Andrew Ng’s Machine Learning course. I took this course in my third year of high school and found it to be very useful. The course also goes over important topics such as how to engineer a machine learning system to properly learn from data without having issues such as underfitting or overfitting.

Finally, after building up skills in math and classical machine learning, you’d be ready to dive into deep learning. I personally haven’t taken any course in deep learning because Andrew Ng hadn’t built one when I was learning it, so I learned it by poring over code and papers for months. However, to learn deep learning properly, I would suggest Andrew Ng’s new Deep Learning course series. Though it came too late for me to take it, I did look carefully at the syllabus for each of the 5 courses he has built.

This may all seem like a lot. Plenty of people do jump straight to deep learning without the mathematical foundation or the understanding of classical machine learning. However, I believe this doesn’t provide sufficient insight into the machinery of deep learning to be able to use it most effectively. I find that I personally benefit in my work from understanding the math behind deep learning and the classical techniques that were used prior to deep learning becoming a popular tool.

GPU Training Bottlenecks

Since I started at my research lab, model training has taken about 12 hours per epoch, and models need to be trained for at least 10 epochs. Over the past couple weeks, I started to analyze the code to understand why it was taking so long even though we have powerful GPUs for training. After profiling the code line by line, it turned out that the bottleneck wasn’t the actual computation, but simply loading in the training data and processing it on the fly. In this post, I will discuss a couple ways I used to eliminate this bottleneck and push up the GPU utilization to nearly 100%.

Pre-processing Data

The biggest portion of the data loading bottleneck isn’t the I/O, but the processing of the raw data into the form that’s inputted to the network. For example, in my lab, we have what we call “runs” of human driving data which we need to break up into time steps for the network. Even if this processing isn’t very complicated, it’s still done on the CPU while the GPU waits to be fed data, thus slowing down the entire training speed.

I solved this issue by separating the processing and the training. By processing all of the data beforehand into the form the network accepts as input, I eliminated most of the steps required to go from data sitting on the hard disk to training on the GPU. Now, the code only has to load the pre-processed data into memory and queue it up to feed to the GPU. This is significantly faster and shifts the bottleneck from processing the data to loading the data from disk. I needed another technique to eliminate that bottleneck.


Once I was able to eliminate the data processing bottleneck, I used multiprocessing to eliminate the data loading bottleneck.

To speed up data loading, I first looked into multithreading, where I would launch several threads and each one would simultaneously read data from disk and put it in a queue. However, I ran into trouble with this approach because of a Python lock called the Global Interpreter Lock (GIL). From what I understand, the GIL prohibits C extensions of Python from accessing the Python interpreter from more than one thread at once, to prevent and race conditions due to code that isn’t thread-safe. This of course ended any hope for multithreading since it would naturally end with the GIL becoming the bottleneck.

It turned out that PyTorch has a built-in data loading class called This class simply required me to extend with an overloaded function to enable random access to an element in my data. Armed with this, PyTorch could handle launching multiple processes, rather than threads, that each had their own Python interpreter. This is what finally eliminated all of the data bottlenecks and left the GPU as the primary bottleneck in the system.