-
Notifications
You must be signed in to change notification settings - Fork 63
Performance Report
-
Objective: We train the AlexNet using ILSVRC 2012 Dataset.
-
Environment: The throughput is measured on a distributed GPU cluster, every node of which is equipped with one K20 GPU card and 40 Gigabit Ethernet (GbE). Training data are partitioned and saved on local HDD of each node. CuDNN-R2 is enabled.
-
Setting: See the net prototxt and solver. The training script and PS settings are provided here.
The following figure shows PMLS-Caffe's speedup of throughput when training AlexNet using different settings of staleness values and number of nodes. When using 1 node, the performance of the original Caffe is reported. The throughput is evaluated with cuDNN R2 and CUDA 6.5.
On our cluster, when training AlexNet with 8 nodes, PMLS-Caffe takes only 1 day to converge (compared to 5 - 7 days on the single machine Caffe), and achieves 56.5% top-1 accuracy on the validation set.
The following figures show how the validation error decreases along with training time and iterations. When using 1 node, the performance of the original Caffe is reported.
-
Objective: We train the GoogLeNet using ILSVRC 2012 Dataset.
-
Environment: The throughput is measured on a distributed GPU cluster, every node of which is equipped with one K20 GPU card and 40 Gigabit Ethernet (GbE). Training data are partitioned and saved on local HDD of each node. CuDNN-R2 is enabled.
-
Setting: See the net prototxt and solver. The training script and PS settings are provided here.
The following figure shows PMLS-Caffe's speedup of throughput when training GoogLeNet using different settings of staleness values and number of nodes, compared to single machine Caffe. The throughput is evaluated with cuDNN R2 and CUDA 6.5.
When training GoogLeNet with 8 nodes, PMLS-Caffe takes less than 48 hours to achieve 50% top-1 accuracy, and less than 75 hours to achieves 57% top-1 accuracy, and finally achieve 67.1% top-1 accuracy, enjoys about 4 times speedup compared to single machine Caffe, which usually takes 15- 20 days to converge, as shown in the following figures.
-
Objective and dataset: We train a CNN using all available images in ImageNet, including 14,197,087 labeled images from 21,841 categories. We randomly split the whole set into two parts, and use the first 7.1 million of images for training and remained for test. The whole data size is about 3.2Tb with 1.6Tb of training and 1.6Tb as test.
-
Environment: We train the CNN with fully data-parallelism on a GPU cluster with 8 nodes, of which every node is equipped with one K20 GPU card and 40 Gigabit Ethernet (GbE). Training data are partitioned and saved on local HDD of each node. CuDNN-R2 is enabled.
-
Settings: The network and solver configurations will be released soon.
The following table compares our result to those of previous work on ImageNet 22K, in terms of experimental settings, machine resources, training time used, and train/test accuracy. It's worth mentioning that the prediction performance primarily depends on what kind of CNN structure you choose, thus could be substantially improved if choosing a different or improved model.
Framework | Data (train/test) | # machines/cores | Time | Train accuracy | Test accuracy |
---|---|---|---|---|---|
PMLS-Caffe | 7.1M / 7.1M | 8 / 8 GPUs | 3 days | 41% | 23.7% |
Adam | 7.1M / 7.1M | 62 machines / ? | 10 days | N/A | 29.8% |
Le et al., w/ pretrain | 7.1M+10M unlabeled images / 7.1M | 1000 / 16000 cores | 3 days | N/A | 15.8% |
MxNet | 14.2M / No test | 1 / 4 GPUs | 8.5 days | 37.19% | N/A |