Page 25 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2016 3rd Student Computer Science Research Conference. Koper: University of Primorska Press, 2016
P. 25
Property GeForce GTX 650 AMD Radeon R9 280X 10 7 Nvidia GeForce 650gtx
Shading Units 384 2048 10 6 Intel i7-3820
10 5
Memmory 1024 MB 3072 MB
GPU Clock 1058 MHz 850 MHz
Memory Clock 1250 MHz 1500 MHz
Table 1: GPU hardware comparison Runtime in miliseconds 10 4
algorithm for each run and computed averages for each sam- 10 3
ple size. The results for tests ran on GeForce GTX 650 are 10 2
shown in Table 2 10 1
10 0
Sample size Griewank Sphere Rosenbrock Rastrigin 1000 2000 3000 4000 5000
10 6ms 6ms 6ms 7ms 0
100 9ms 9ms 9ms 9ms
500 79ms 73ms 73ms 72ms Problem Size
1000 226ms 232ms 231ms 225ms Figure 1: Runtime comparison on logarithmic scale
2500 1274ms 1272ms 1278ms 1271ms
5000 5260ms 5273ms 5270ms 5275ms
Table 2: Run times on Nvidia GeForce 650gtx Sample size Griewank Sphere Rosenbrock Rastrigin
10 31ms 26ms 28ms 25ms
To illustrate the performance improvement we ran the same 100 32ms 31ms 37ms 30ms
tests on the Intel i7-3820 CPU. The results are shown in 500 86ms 85ms 92ms 94ms
Table 3 219ms 216ms 232ms 221ms
1000 1092ms
2500 1090ms 4125ms 1102ms 1085ms
5000 4119ms 4140ms 4134ms
Sample size Griewank Sphere Rosenbrock Rastrigin Table 4: Run times on ATI Radeon R9-280x
10 11ms 11ms 11ms 12ms
100 108ms 112ms 110ms 113ms From Table 5 we can observe the performance of the CPU
500 856ms 881ms 879ms 886ms implementation on the AMD CPU. Compared with the Intel
5533ms CPU we can clearly see that the AMD CPU is much slower,
1000 5428ms 121.57s 5588ms 5610ms which in turn explains the bottleneck in bandwidth between
2500 123.39s 1184.37s 153.94s 127.89s the GPU. The performance improvement of the GPU over
5000 1208.84s 1188.17s 1265.53s CPU on the AMD configuration is shown in Figure 2 plotted
in logarithmic scale.
Table 3: Run times on Intel i7-3820 CPU
Comparing Tables 2 and 3 we can observe the improvements Sample size Griewank Sphere Rosenbrock Rastrigin
of the GPU implementation over the CPU. On larger prob- 10 17ms 17ms 17ms 17ms
lems, the GPU implementation outperforms the CPU by a 100 152ms 130ms 140ms 139ms
factor of 230. On smaller tests however, the improvements 500 1824ms
are not substantial. This was expected due to the fact the 1864ms 24028ms 1803ms 1871ms
kernels and the data need to be loaded from RAM to VRAM. 1000 24128ms 229.29s 24106ms 24191ms
In smaller problems the latency gives expression, while in the 2500 232.52s 3446s 223.91 234.76s
case of larger problems it does not impact overall computa- 5000 3416.5s 3431.2s 3445.2s
tion time as much. The performance increase can also be
observed in Figure 1 which was plotted in logarithmic scale. Table 5: Run times on AMD FX-6300
We perform the same tests on the second configuration using 4. CONCLUSION
an AMD-FX-6300 and an ATI Radeon R9-280x GPU shown
in Tables 4 The results on the ATI GPU are a bit surprising. Using an open source meta-heuristic algorithm framework
The graphic card is generally one of the fastest currently we implemented a GPGPU version of CMA-ES algorithm
on the market, yet, it seems the Nvidia outperformed it using openCL, which is also open source. We empirically
on smaller problems. After profiling the execution using compared the computational results between the CPU and
Jprofiler [11] we noticed the bottleneck was the AMD CPU, GPU version, which show improvement of the GPU over the
which took much longer to load the kernels and data on to CPU version. Additionally we compared two different con-
the GPU. On larger problems, where the actual computation figurations in an attempt to eliminate the possible bias of the
takes longer, the slow bandwidth between CPU and GPU is framework towards certain manufacturers. Even though the
not noticeable. configurations were not in the same price range and hence
StuCoSReC Proceedings of the 2016 3rd Student Computer Science Research Conference 25
Ljubljana, Slovenia, 12 October
Shading Units 384 2048 10 6 Intel i7-3820
10 5
Memmory 1024 MB 3072 MB
GPU Clock 1058 MHz 850 MHz
Memory Clock 1250 MHz 1500 MHz
Table 1: GPU hardware comparison Runtime in miliseconds 10 4
algorithm for each run and computed averages for each sam- 10 3
ple size. The results for tests ran on GeForce GTX 650 are 10 2
shown in Table 2 10 1
10 0
Sample size Griewank Sphere Rosenbrock Rastrigin 1000 2000 3000 4000 5000
10 6ms 6ms 6ms 7ms 0
100 9ms 9ms 9ms 9ms
500 79ms 73ms 73ms 72ms Problem Size
1000 226ms 232ms 231ms 225ms Figure 1: Runtime comparison on logarithmic scale
2500 1274ms 1272ms 1278ms 1271ms
5000 5260ms 5273ms 5270ms 5275ms
Table 2: Run times on Nvidia GeForce 650gtx Sample size Griewank Sphere Rosenbrock Rastrigin
10 31ms 26ms 28ms 25ms
To illustrate the performance improvement we ran the same 100 32ms 31ms 37ms 30ms
tests on the Intel i7-3820 CPU. The results are shown in 500 86ms 85ms 92ms 94ms
Table 3 219ms 216ms 232ms 221ms
1000 1092ms
2500 1090ms 4125ms 1102ms 1085ms
5000 4119ms 4140ms 4134ms
Sample size Griewank Sphere Rosenbrock Rastrigin Table 4: Run times on ATI Radeon R9-280x
10 11ms 11ms 11ms 12ms
100 108ms 112ms 110ms 113ms From Table 5 we can observe the performance of the CPU
500 856ms 881ms 879ms 886ms implementation on the AMD CPU. Compared with the Intel
5533ms CPU we can clearly see that the AMD CPU is much slower,
1000 5428ms 121.57s 5588ms 5610ms which in turn explains the bottleneck in bandwidth between
2500 123.39s 1184.37s 153.94s 127.89s the GPU. The performance improvement of the GPU over
5000 1208.84s 1188.17s 1265.53s CPU on the AMD configuration is shown in Figure 2 plotted
in logarithmic scale.
Table 3: Run times on Intel i7-3820 CPU
Comparing Tables 2 and 3 we can observe the improvements Sample size Griewank Sphere Rosenbrock Rastrigin
of the GPU implementation over the CPU. On larger prob- 10 17ms 17ms 17ms 17ms
lems, the GPU implementation outperforms the CPU by a 100 152ms 130ms 140ms 139ms
factor of 230. On smaller tests however, the improvements 500 1824ms
are not substantial. This was expected due to the fact the 1864ms 24028ms 1803ms 1871ms
kernels and the data need to be loaded from RAM to VRAM. 1000 24128ms 229.29s 24106ms 24191ms
In smaller problems the latency gives expression, while in the 2500 232.52s 3446s 223.91 234.76s
case of larger problems it does not impact overall computa- 5000 3416.5s 3431.2s 3445.2s
tion time as much. The performance increase can also be
observed in Figure 1 which was plotted in logarithmic scale. Table 5: Run times on AMD FX-6300
We perform the same tests on the second configuration using 4. CONCLUSION
an AMD-FX-6300 and an ATI Radeon R9-280x GPU shown
in Tables 4 The results on the ATI GPU are a bit surprising. Using an open source meta-heuristic algorithm framework
The graphic card is generally one of the fastest currently we implemented a GPGPU version of CMA-ES algorithm
on the market, yet, it seems the Nvidia outperformed it using openCL, which is also open source. We empirically
on smaller problems. After profiling the execution using compared the computational results between the CPU and
Jprofiler [11] we noticed the bottleneck was the AMD CPU, GPU version, which show improvement of the GPU over the
which took much longer to load the kernels and data on to CPU version. Additionally we compared two different con-
the GPU. On larger problems, where the actual computation figurations in an attempt to eliminate the possible bias of the
takes longer, the slow bandwidth between CPU and GPU is framework towards certain manufacturers. Even though the
not noticeable. configurations were not in the same price range and hence
StuCoSReC Proceedings of the 2016 3rd Student Computer Science Research Conference 25
Ljubljana, Slovenia, 12 October