SqueezeNet Research Status

As you may know from my first blog post, my research work is on autonomous driving with SqueezeNet. Despite the post a few days ago about speeding up the model training, I haven’t had much success applying the pre-processing and multiprocessing approached to this problem. It turns out that the HDF5 data storage format that we’re using at my lab to store our training data doesn’t easily support concurrent access from multiple processes. For now, since I need to get some models trained, I’m training them using our old training code that takes 12 hours per epoch.


I’m currently running 5 experiments: 1 baseline, 2 to explore the ability of SqueezeNet to learn to drive and 2 on modified versions of SqueezeNet that have memory.


Z2Color is the baseline model that Karl, my professor, has been researching for over a year. The network has done all of our autonomous driving, but Karl hadn’t been able to find a better network when he first hired me. Now, of course, I’m working on creating many networks that are better.

There are two main problem with this network. The first is that the maximum receptive field size (the effective vision size of the network) is smaller than the image, so it requires fully connected layers at the end of the network that are not spacially invariant. The other big issue is that the network has no memory, so if the car gets stuck in a position where the camera can’t help it get out, it has no memory of where it was before.


SqueezeNet is my primary means of fixing the first problem with Z2Color. The receptive field size is bigger than the input camera image size, so it’s able to see everything at once even though there’s no fully connected layers. This saves on training computation and allows me to use that computation in other parts of the network.


AlexNet is what inspired the creators of SqueezeNet. They wanted to have AlexNet-level performance on image classification with a much more efficient network. Of course, I’m not doing image classification, so I’m training both SqueezeNet and AlexNet to drive so that I can compare them against each other.

SqueezeNet with LSTM

LSTMs, a form of neural network memory, should help SqueezeNet remember what its past actions were when it’s trying to make a decision for future actions. If this experiment ends up performing very well, this will solve problem #2 with Z2Color.

SqueezeNet with GRU

GRUs are a newer, simpler form of neural network memory that have been shown to perform as well as LSTMs. Thus, I figured I should try them as well, since I’m inexperienced with LSTMs and GRUs and don’t know which one will work better here.


Overall, I have several exciting experiments running that will greatly improve our understanding of SqueezeNet.  I will post an update on these experiments when I start seeing some results. Unfortunately, the training is very slow so it may take a week or so.


Docker for Deep Learning

Recently, I’ve started using Docker to manage the development environment required for deep learning. Docker allows me to decide exactly which versions of which software are installed, and how. It’s very useful to be able to script the development environment and easily switch between different environments. For example, in my lab, the codebase is currently in Python 2, but we’re about to change it to Python 3. This change would normally require changing over the host machine’s environment to Python 3, but using Docker, I can just update the Dockerfile to say I want Python 3. Allow me to explain by example.

Using a Docker Image

Suppose we have a machine with Ubuntu 16.04, but the code we’re trying to run needs the software package versions of an Ubuntu 14.04 distribution. Rather than installing Ubuntu 14.04 just to run our code, we can it inside a Docker container that has Ubuntu 14.04 packages. Let’s say our code can be run with the command my_code. Then to run it inside an Ubuntu 14.04 Docker container, we can run it like this:

docker pull ubuntu:14.04
docker run -it ubuntu:14.04 my_code

The first line will pull the latest Ubuntu 14.04 image from Docker Hub, an online repository of Dockerfiles that have already been built into a Docker image. The second line will launch a Docker container that is running the Ubuntu 14.04 image, and execute our code. When our code executes inside a Docker container, it still has access to the host machine’s CPUs and RAM as it normally would outside the container. However, the filesystem that our code sees is the Docker image filesystem, not our host machine filesystem.

But what if we want to access files in our host machine’s filesystem? We can run the following snippet instead:

docker run -v /path/to/my/data:/foo -w /foo -it ubuntu:14.04 my_code

The -v argument binds the host machine’s /path/to/my/data to foo in the Docker image. The -w argument sets the present working directory of the Docker container to /foo.

Making a Docker Image

Now, suppose that we aren’t happy with the Ubuntu 14.04 Docker image that’s provided on Docker Hub and you want to make your own modified version that adds the libjpeg-dev package, for example. How do we do that? First, we define our own file named Dockerfile:

mkdir ubuntu-custom
cd ubuntu-custom
touch Dockerfile

Then, we open our Dockerfile in an editor and write this:

FROM ubuntu:14.04

RUN apt-get update && apt-get install -y --no-install-recommends \
libjpeg-dev && \
rm -rf /var/lib/apt/lists/*

This tells Docker that we want to inherit from the Ubuntu 14.04 image and then install libjpeg-dev. We clean up the Apt package list after installing libjpeg-dev so that we aren’t packaging in unnecessary cached files into our new Docker image.

Finally, to build a new image from our Dockerfile, we run the following:

docker build -t ubuntu-custom .

Now, we can execute our code in our new image with:

docker run -it ubuntu-custom my_code

Using Docker with GPU Training

One caveat of Docker is that it cannot easily access GPUs belonging to the host machine. To make it easier, Nvidia has created nvidia-docker, a modified version of Docker that does support GPUs. The usage of it is identical to Docker, but it provides GPU support.

My Docker Hub Images

For my own convenience, I’ve uploaded my own Docker images for deep learning to Docker Hub. tpankaj/dl-pytorch is built for PyTorch users and tpankaj/dl-keras is built for Keras (TensorFlow backend) users. Both use the latest stable version of their respective libraries. Here is an example of how to use them:

nvidia-docker pull tpankaj/dl-keras
nvidia-docker run -it tpankaj/dl-keras python Train.py


I just started using Docker this summer, but it’s already sped up the setup process for training on a new machine, because I only need to install CUDA and Docker, and then pull my image from Docker Hub. I’ve also slowly begun to convince people in my lab to use it instead of dealing with installing everything on each training machine themselves.

Giving CUDA 9 Another Shot

A few days ago I posted about my test of CUDA 9 and CuDNN 7. As I said at the end of that post, more experimentation was required before making conclusions. Today, I worked on running better benchmarking experiments. This time, I trained 10 epochs of a simple network using the CIFAR10 dataset. I ran the experiment 10 times on CUDA 8/CuDNN 6 and 10 times on CUDA 9/CuDNN 7 and took the best of the 10 for each of the cases. I believe this experiment is better at measuring the performance because the training was more GPU intensive than the MNIST test, so the training time should be more heavily affected by improvements in performance of the GPU.


I trained the network with CIFAR10 for 10 epochs on an AWS p2.xlarge instance using an Nvidia Tesla K80 GPU. On the instance with CUDA 8/CuDNN 6, it took 110 seconds. On the instance with CUDA 9/CuDNN 7, it took 103 seconds. CUDA 9 was approximately 6% faster.

While this is only one data point, it does appear to show that CUDA 9 can be faster than CUDA 8 when training GPU-intensive models.

Resources for Deep Learning

Many people have asked me how they should learn how to use deep learning, so I figured I should consolidate my answer in one place. First off, I think the best way to understand deep learning is to have a strong mathematical background. That doesn’t mean that you have to have taken a formal course in calculus or linear algebra, but learning those things on your own is helpful. The best resource for this is Khan Academy.

After building a strong mathematical foundation, my recommendation isn’t to jump straight into deep learning, but to first start with classical machine learning. This includes algorithms such as Principal Component Analysis, Support Vector Machines, K-Means clustering, and some others. While these aren’t directly used in deep learning, they are a lot simpler to understand and more grounded in theory. These algorithms provide a good understanding data analysis, supervised learning, and unsupervised learning. The best course for this is Andrew Ng’s Machine Learning course. I took this course in my third year of high school and found it to be very useful. The course also goes over important topics such as how to engineer a machine learning system to properly learn from data without having issues such as underfitting or overfitting.

Finally, after building up skills in math and classical machine learning, you’d be ready to dive into deep learning. I personally haven’t taken any course in deep learning because Andrew Ng hadn’t built one when I was learning it, so I learned it by poring over code and papers for months. However, to learn deep learning properly, I would suggest Andrew Ng’s new Deep Learning course series. Though it came too late for me to take it, I did look carefully at the syllabus for each of the 5 courses he has built.

This may all seem like a lot. Plenty of people do jump straight to deep learning without the mathematical foundation or the understanding of classical machine learning. However, I believe this doesn’t provide sufficient insight into the machinery of deep learning to be able to use it most effectively. I find that I personally benefit in my work from understanding the math behind deep learning and the classical techniques that were used prior to deep learning becoming a popular tool.

GPU Training Bottlenecks

Since I started at my research lab, model training has taken about 12 hours per epoch, and models need to be trained for at least 10 epochs. Over the past couple weeks, I started to analyze the code to understand why it was taking so long even though we have powerful GPUs for training. After profiling the code line by line, it turned out that the bottleneck wasn’t the actual computation, but simply loading in the training data and processing it on the fly. In this post, I will discuss a couple ways I used to eliminate this bottleneck and push up the GPU utilization to nearly 100%.

Pre-processing Data

The biggest portion of the data loading bottleneck isn’t the I/O, but the processing of the raw data into the form that’s inputted to the network. For example, in my lab, we have what we call “runs” of human driving data which we need to break up into time steps for the network. Even if this processing isn’t very complicated, it’s still done on the CPU while the GPU waits to be fed data, thus slowing down the entire training speed.

I solved this issue by separating the processing and the training. By processing all of the data beforehand into the form the network accepts as input, I eliminated most of the steps required to go from data sitting on the hard disk to training on the GPU. Now, the code only has to load the pre-processed data into memory and queue it up to feed to the GPU. This is significantly faster and shifts the bottleneck from processing the data to loading the data from disk. I needed another technique to eliminate that bottleneck.


Once I was able to eliminate the data processing bottleneck, I used multiprocessing to eliminate the data loading bottleneck.

To speed up data loading, I first looked into multithreading, where I would launch several threads and each one would simultaneously read data from disk and put it in a queue. However, I ran into trouble with this approach because of a Python lock called the Global Interpreter Lock (GIL). From what I understand, the GIL prohibits C extensions of Python from accessing the Python interpreter from more than one thread at once, to prevent and race conditions due to code that isn’t thread-safe. This of course ended any hope for multithreading since it would naturally end with the GIL becoming the bottleneck.

It turned out that PyTorch has a built-in data loading class called torch.utils.data.DataLoader. This class simply required me to extend torch.utils.data.Dataset with an overloaded function to enable random access to an element in my data. Armed with this, PyTorch could handle launching multiple processes, rather than threads, that each had their own Python interpreter. This is what finally eliminated all of the data bottlenecks and left the GPU as the primary bottleneck in the system.

CUDA 9 and CuDNN 7 with PyTorch

I’ve been trying to find ways to speed up training of my autonomous driving networks, since the current training time is about 12 hours per epoch. One of my most recent efforts has been trying to upgrade to the newly released CUDA 9.0 RC and CuDNN 7 packages from Nvidia. While these are optimized for Nvidia’s new Volta architecture, they claim it speeds up operations on Pascal GPUs like the 1080 Tis my lab has as well.

To get CUDA 9 and CuDNN 7 working with PyTorch, the deep learning framework all of my group’s research code is written in, I had to clone Pull Request #2263 from the PyTorch GitHub, which is written by an Nvidia engineer to add CUDA 9 and CuDNN 7 support to PyTorch. However, it turned out there were some other issues with this. To get everything to work, here are the steps I had to follow:

  1. Download and install CUDA 9
  2. Download and install CuDNN 7
  3. Download and install NCCL
  4. Download and install Anaconda for Python 3.6
  5. Run the following workaround for NCCL:
    mkdir ~/nccl
    mkdir ~/nccl/include
    mkdir ~/nccl/lib
    ln -s /usr/include/nccl.h ~/nccl/include/
    ln -s /usr/lib/libnccl.so ~/nccl/lib/
    export LIBRARY_PATH="~/nccl/lib/":$LIBRARY_PATH
  6. Clone the CUDA 9 branch for PyTorch:
    git clone https://github.com/csarofeen/pytorch
    cd pytorch
    git checkout cuda9
  7. Compile PyTorch
    export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"
    conda install numpy pyyaml mkl setuptools cmake gcc cffi
    conda install -c soumith magma-cuda80
    python setup.py install

Once I finally got it working, I ran a speed test by running the PyTorch MNIST example on an AWS p2.xlarge instance with an Nvidia Tesla K80 using all of the default settings in the example code. Unfortunately, the speed test didn’t show CUDA 9 speeding up training in this case. On an instance with CUDA 8 and CuDNN 6, the MNIST example took 88 seconds to train 10 epochs. On an instance where I did the above steps to get CUDA 9 and CuDNN 7 working, it took 89 seconds. More experimentation is required to see if extra performance can be squeezed out of CUDA 9 and CuDNN 7.

Starting a Blog

Today, I am starting a blog to keep track of my deep learning research. I’m an undergraduate research assistant at the Berkeley DeepDrive lab. I work under Dr. Karl Zipser on the Autonomous Driving in Unstructured Conditions project. I’ve been on this project since January 2017, and I just finished up a summer internship at Qualcomm before returning back to the Berkeley DeepDrive project.

My research interests on this project are on studying the use of  SqueezeNet as an autonomous driving-capable neural network. This includes comparing it to our existing network that was designed by Karl, as well as trying to build upon the SqueezeNet network to achieve higher performance on our validation dataset.

My goal is to update this blog at least once a week with my latest research work and deep learning experience, and to use it as an engineering notebook to refer back to.