HW selection benchmark for deep learning
Our company is doing active research in the area of deep learning. To be a bit more specific – we are building very wide embedding layers. We need to be able to perform an interactive research and we need to be able to train a lot of networks on-demand.
The number of weights in our network went well above 5 millions. So learning performance on regular CPU became an issue and hence we have to make a decision regarding the hardware.
We chose to perform some tests in order to decide what is better for us. The main options are:
- To buy Titan X based computers;
- To go for AWS new P2 instances with K80 accelerators;
- To work with existing laptops with 840M built in graphic cards.
Here are results only for TensorFlow and embedding size 50, to make things more clear.
Therefore, we tested all those kinds of hardware on different batch sizes and on different embedding sizes. You can think about embedding size simply as a scale factor for the network. Since we are working with Keras, it did make sense for us also to test different available back-ends – Theano and TensorFlow.
Embedding size (common) | Backend | Average time per epoch | |||
NVIDIA GeForce 840M | NVIDIA GeForce Titan X (Pascal) | NVIDIA K80 Tesla | |||
Batch size = 100 | |||||
50 | TensorFlow | 593 | 50.3 | 116 | |
50 | Theano | 326 | 65.25 | 134 | |
10 | TensorFlow | 121.5 | 30.5 | 70 | |
10 | Theano | 117.25 | 44 | 75 | |
Batch size = 500 | |||||
50 | TensorFlow | 153.75 | 12 | 30.5 | |
50 | Theano | 110.75 | 18.5 | 42 | |
10 | TensorFlow | 27.75 | 6.25 | 15.25 | |
10 | Theano | 29.5 | 9.25 | 18.25 | |
Batch size = 1000 | |||||
50 | TensorFlow | 100 | 7.25 | 20 | |
50 | Theano | 89.5 | 14 | 30 | |
10 | TensorFlow | 16.5 | 3.25 | 9 | |
10 | Theano | 19.75 | 5.25 | 12 | |
Batch size = 2000 | |||||
50 | TensorFlow | 73 | 5.25 | 14 | |
50 | Theano | 69 | 7 | 19 | |
10 | TensorFlow | 10.25 | 2.25 | 6 | |
10 | Theano | 13 | 4 | 8 | |
Batch size = 4000 | |||||
50 | TensorFlow | 60 | 4 | 12 | |
50 | Theano | 62 | 6 | 16 | |
10 | TensorFlow | 7.25 | 1.25 | 4 | |
10 | Theano | 9.75 | 3 | 6 |
Table 1. Benchmark results
Times are in seconds.
The test was performed on about 1 million of test samples.
We used P2.xlarge instance with half of K80 (single GPU).
It should be mentioned that Titan X and half of K80 both have 12 GB.
Main observations :
- K80 accelerator is about 50% slower than TitanX on our load.
- Batch size is a very important optimization factor.
- In our case tensor flow is considerably faster.
- Simple graphic card such as 840M is slower than high end cards in terms of magnitude on big networks.
- For smaller networks difference between 840M and high-end cards is much less significant.
Here is illustration how batch size affect training time:
Thus, our main decision was based on the choice: to go with K80 and AWS or to buy our own hardware.
TitanX cost is $1200, and a good machine to go with it would come up to about $2500.|
AWS P2 cost is $0.9 per hour.
If we take into account electricity prices, we will get to a break even at about 3000 hours. And this is lot, if we are looking at 8-hour working days. We will get to a bit more than one year (!!!!).
If we train networks 24 hours a day – it will be much less. But it does not happen in our research.
So money-wise AWS was a winner. Moreover, if we want to be flexible and to be able to train one network at a time one day and dozens on another day, AWS wins hands down…
Another important point is that AWS provides us with a very simple way to share configured environments among our team members.
It has to be mentioned that the local card has advantages of simpler UI access to data and results. At the same time, properly configured Jupiter should solve most of it (we are currently working on it).
Note also that that we are training neural networks on big data sets, and the capability to scale is critical for us. Hence we have another vote for AWS based solution.
Interactivity of the research was also one of our goals. The fact that K80 gives us 50% less performance is not pleasing, but nevertheless it still does not kill the interactivity. Note, if the difference would be bigger – we would see it as a problem.
Thus eventually, we decided to go on with Amazon AWS P2.xlarge instances.
A lot of thanks to our deep learning ninja Sergii Myskov who have done all heavy lifting: data preparations / devops / keras programming etc.