Saturday, September 23, 2017

Building Deep Learning Machine Under $2000 with Dual GPUs

With my experimental model getting larger and larger, training takes too long time. This is especially true as most of my model is vision-based, so it requires a lot of computation and memory. Yes, I could use cloud computing, such as AWS or GCP, but I did some calculation.

I usually run training 24x7, and therefore assuming $2/hour, it would cost $1440 per month. This is for training a single model. If I train multiple models, the cost would be more than I am comfortable spending.

Instead, I could spend $2000 once building my own system with 2 GPUs, and some $50 or less each month for electric bill and train two models simultaneously. I could even sell the rig later on when I need to upgrade. I am expecting resale value of 1/3 to 1/2 of my system in 2 years.

The answer is quite obvious at this point. I need to build my own rig. After some research, below is the list of parts and justification if necessary.

CPU: AMD Ryzen 1700
This is a 8-core 16-thread processor from AMD. Since most of the computation during training is performed by GPU and not CPU, I did not want to spend more than $300 on CPU. I debated whether to get even cheaper one, 1600, which has 6-cores and 12-threads with higher clock speed. This could be a better option for neural network training. They are both good options. However, at the time of buying, I could not get Ryzen 1600 at its retail price, because 1600 was in such a high demand.

I did not get Intel CPU because it is over-priced at the moment, as the new generation Coffee Lake is imminent. If I could wait a few more months, maybe Coffee Lake processors could be much better candidates than Kaby Lake.

This was a tough call. I could get 1060 6G, 1070 8G, 1080 8G, or 1080 Ti 11G. The best bang for the buck would be 1060 6G, but I wanted more VRAM than 6GB. Next up is 1070 8G, but this was too expensive at the time due to high demand, costing around $500. Next up is 1080 8G, which is around $550 with 30% more performance than 1070. Next up is 1080 Ti 11G at $750, but this is too expansive compared to 1080 8G and the performance gain does not justify it. I therefore went with GTX 1080 8G. In fact, I got 2x GTX 1080 8G to train two models simultaneously.

AMD GPUs are not considered, as most deep learning libraries do not fully support AMD GPUs at the moment. I really hope AMD catches up with GPGPU support for deep learning libraries soon.

Mainboard: ASRock Fatal1ty X370 Gaming K4
This was one of the cheapest mainboards that support AMD Ryzen series CPUs and 2x PCI Express3 8 lanes each. Since I was getting two GPUs, I wanted to make sure that both GPUs get at least PCI Express3 8 lanes.

Yes I could have chosen CPU and mainboard to support dual PCI Express3 16 lanes, but this would sky-rocket my rig cost, and I don't think there will be much performance difference between PCI Express3 8 lanes vs 16 lanes for GTX 1080 graphics cards (source). If you are getting GTX 1080 Ti series, then perhaps you may want to opt for high-end CPU that supports PCI Express 32+ lanes and mainboard to fully support PCI Express3 16 lanes for each GPU, but you would have to spend $3000 or so on the system.

RAM: 2x DDR4 2400 8G
I will get more RAM when the cost goes down a bit. Currently the memory price is just too expansive.

SSD: Samsung Evo 860 500G
Just get a decent SSD with >= 500G. Absolutely no HDD, as this will significantly lower the performance. Samsung's SSDs are renowned for speed and stability.

Power Supply: 850W Gold-rated
Maybe 850W is too much for my config, but it is always better to choose power supply with abundant output. A cheap low-quality low-power output supply can actually destroy the whole system! I roughly estimated 100W for CPU, 200W for each GPU, and 100W for the rest. This is 600W in total, and I wanted 200W margin just to be safe, but 700W+ should have worked just fine.

Case: ATX Mid-Tower
Choose whatever you like as long as it is large enough to fit two GPUs and the motherboard. Most of ATX mid-tower cases should do.

OK, so the grand total excluding monitor/mouse/keyboards, etc is a bit more than $1900 before tax. With this config, you can train a network that requires up to 16GB of VRAM, since you have 2x GTX 1080 8G, although you will need to make sure to distribute the work load between the two GPUs manually in the code.

I installed Ubuntu 16.04 LTS for now, although I may switch to Cent OS later on. After installing NVIDIA CUDA toolkit, I can successfully detect both GPUs and use them simultaneously with no problem. I did not connect them with SLI though, since it is not needed for my purpose.

Good luck with configuring your system!

Tuesday, September 19, 2017

Tensorflow Fundamentals - K-means Cluster Part 1

Now that we are familiar with Tensorflow, let us actually write code. For this series of posts, we are going to implement K-means clustering algorithm with Tensorflow.

K-means clustering algorithm is to divide a set of points into k-clusters. The simplest algorithm is
1. choose k random points
2. cluster all points into corresponding k groups, where each point in the group is closest to the centroid
3. update the centroids by finding geometric centroids of the clusters
4. repeat steps 2 & 3 until satisfied

Below is my bare-minimum implementation in Tensorflow.

Tensorflow Fundamentals - Interactive Session

As I have discussed in my previous posts, Tensorflow's computation graphs will not evaluate the expression unless one explicitly asks it to do so. One may find this quite annoying during debugging, so Tensorflow provides what is called Interactive Session, which let's you evaluate the expressions as you go, with minimal code.

This is really simple; just call
sess = tf.InteractiveSession()

in the beginning, and this will act like the block
with tf.Session() as sess:

See the code below.

Sunday, September 17, 2017

Tensorflow Fundamentals - Computation Graph Part 3

Next up, we want to now evaluate an expression from given input values. Let us construct a function (graph)
f(x,y) = x + y

where x,y are input values to fed into the graph f. To do this, we need to use tf.placeholder methods to define input nodes, and supply feed_dict parameter to run method as shown below:

The code is easy enough to be self-explanatory. The output of the code shall look similar to
$ python 2>/dev/null
8 + -6 = 2
-6 + 4 = -2
-10 + -1 = -11
-6 + 0 = -6
5 + 9 = 14
7 + 6 = 13
3 + 8 = 11
3 + 6 = 9
5 + -4 = 1
0 + -3 = -3

Tensorflow Fundamentals - Computation Graph Part 2

Let's continue our journey on Tensorflow's computation graph.

We will now make use of Tensorflow's variables. They are different from constant in that
1. they are mutable during the execution, i.e., they can change their values
2. they will store their state (value), which shall equal to that from the lastest execution

For example, you can define a counter variable that will increment its value on each execution:
counter = counter + 1

Below is the simple demo code for doing this in Tensorflow:

The only catch here is that
1. tf.assign is another operation that will assign a new value to the variable on each execution (run)
2. one must initialize all variables

Running it will output

So far so good!

Tensorflow Fundamentals - Computation Graph Part 1

This post is intended for anyone, including myself, who is having difficulty grasping the very basic concepts of Tensorflow: Computation Graph. This post is heavily based on Tensorflow's official documentation.

The way Tensorflow and Theano operates is just different from all those other ones in that there are two distinct phases. In the first phase one builds the computation graph that defines the computations to be performed. It is like a function where given whatever inputs, it will spit out outputs. For example, let us say you define
f(x,y) = x + y

Then f is the computation graph in Tensroflow. This phase is called construction phase.

In the second phase, one executes the graph by feeding in the inputs. For example,
f(1,2) = 3
f(3,2) = 5

and so on for any x,y pairs you feed in. The graph will output the numerical values given numerical inputs. This phase is called execution phase.

One difference, however, is that not only the mathematical operations, such as additions, subtractions, multiplications, and divisions but also each variables are considered as operation nodes in Tensorflow's computation graphs. Thus, with the example above, we now have three ops:

Here, x y are constant ops, meaning that their values will be some constant directly fed in during the execution phase. The output of the graph f can be fed into a more complex graph.

Let's do a very simple example in code. We will define the computation graph for
f = pi + 1

and compute for constant pi = 3.14159...

The output of the script shall yield

So far so easy. We will progressively construct and execute more complex and useful graphs, so stay with me.

Saturday, September 16, 2017

Examining the Bottleneck between CPU and NVIDIA GPU

I was investigating which part of my computer is the culprit for slowing down neural net training. I first thought it was CPU doing the image preprocessing, as my CPU is Intel's low-end series G4560, which only costs about $90, whereas my GPU is NVIDIA's high-end series GTX 1070 that costs more than whopping $400, thanks to cryptocurrency booming.

To my surprise, it was actually the GPU that was lagging behind this time, at least for the current network that I am training. I would like to share how I found out whether GPU or CPU was lagging. Below is the code, most of which is taken from Patrick Rodriguez's repository keras-multiprocess-image-data-generator.

To run the script, you first need to install necessary modules. Save the following as requirement.txt

Next, run the command below to automate installing all the necessary modules:
$ pip install -r requirement.txt

Lastly, you also need python-tk module, so install it via
$ sudo apt-get install python-tk

Now, you can run the script
$ python

Note that you must have NVIDIA GPU in order for the script to work.

Friday, September 15, 2017

Multithreading in Python

In the previous post, I investigated a way to preprocess images using multiple processes. In this post, I will investigate a way to preprocess images using multiple threads.

The real difference from the multiprocessing code is not much. Instead of using multiprocessing.Pool class, use multiprocessing.pool.ThreadPool class. Below is the code:

The execution time for multithreading is a bit slower than that of multiprocessing, but I am not sure if this is always the case, as the difference is not significant.

single thread elapsed time: 364
threads elapsed time: 184
threads elapsed time: 115

Multiprocessing with Python

I have been training a simple neural network on my desktop, and I realized that GPU wasn't running at its full capacity, i.e., there must be some bottleneck from, most likely, CPU side. My guess is image preprocessing from CPU is taking longer than GPU computation for each batch. In order to reduce the time for CPU to preprocess the images, I started investigating multiprocessing option in Python.

Below is a simple code for running OpenCV's Canny function across multiple processes using Python's built-in multiprocess module:

Running the script yields approximately linear time reduction for 2 processes and sub-linear for 4 processes, due to other bottleneck, such as disk IO.

single process elapsed time: 364
2 processes elapsed time: 181
4 processes elapsed time: 108

Saturday, September 2, 2017

Execute User Scripts on Linux

I wanted to have my VirtualBox guest system start automatically when I turn on the host computer. After some research, here is what I have found out:

First, write a script file that is to be executed when the computer boots. For my case, it was file that reads
vboxmanage startvm CentOS --type headless

Make sure that this script is runnable.
$ chmod u+x

Next, run the script yourself to test it works
$ ./

When everything works fine, edit crontab to run the script on boot. Make sure to run the command below as the user who shall execute the script on boot. That is, don't run it as root unless you want the script to be executed as root.
$ crontab -e

Add the following line when in the crontab
@reboot /path/to/

That's it! The script shall be executed on each system boot!

Friday, September 1, 2017

Enable X11 Forwarding from macOS X

In order to enable X11 forwarding from macOS client, you must first download XQuartz from and install it. Then, you must log out and log back in.

Next, you can remote into your server via ssh with -X or -Y option:
$ ssh username@server_ip -X

To verify that you have X11 forwarding, simply examine $DISPLAY variable while you are remotely logged in:
$ echo $DISPLAY

By the way, some Linux servers may not have X11 forwarding feature disabled by default. In this case, you must enable it by editing /etc/ssh/sshd_config file:
X11Forwarding yes

Enjoy Mac Life!

Monitor NVIDIA GPU Status

If you have properly installed NVIDIA driver, then you can easily check your GPU's temperature by running
$ nvidia-smi -q -d temperature

In case you are not sure how to install NVIDIA drivers, refer to this page for excellent answer.

To display the GPU status in general, run
$ nvidia-smi

To watch the GPU status real time, run
$ watch nvidia-smi