Page 24 - Fister jr., Iztok, and Andrej Brodnik (eds.). StuCoSReC. Proceedings of the 2016 3rd Student Computer Science Research Conference. Koper: University of Primorska Press, 2016
P. 24
cific compute unit. This units are then grouped into one, All this actions are implemented in the init() method.
two or three dimensional set, which are called work-groups.
After the initialization process has finished the main loop of
For the implementation we obtained the source code of the the algorith is executed as shown in algorithm 2. In each
iterative CMA-ES algorithm from the jMetal Java based iteration of the main loop the SamplePopulation method
framework for multi-objective optimization [1]. To speed- executes the OpenCL kernels with the OpenCL command
up the algorithm we integrated OpenCL into the execution clEnqueueN DRangeKernel. When the SamplePopulation
process. In the first part the algorithm selects the OpenCL method has finished the OpenCL device holds the current
device with the largest number of compute units in addiction best solution found according to the fitness function which
to the work group size. is implemented in the computeF itnessKernel.

After that it initializes the OpenCL context with the OpenCL In the last phase of the main loop the algorithm executes
clCreateContext command. The OpenCL context is re- all the kernels in the UpdateDistribution method, which up-
quired for managing OpenCL objects such as command- dates the covariance matrix as described in [5]. When the
queues, memory, program and kernel objects and for execut- number of evaluations is bigger than the predefined maxi-
ing kernels on one or more devices specified in the context. It mum evaluation number, the host machine reads the best so-
continues with creating the OpenCL command-queue with lution from the device memory, with the OpenCL command
the command clCreateCommandQueue. clEnqueueReadBuf f er and outputs the result as shown in
algorithm 2.
The main loop of the CMA-ES algorithm is divided into
kernels which are enqueued to the command-queue with the Algorithm 2 CMA-ES (OpenCL version)
OpenCL clEnqueueN DRangeKernel command. This ker-
nels are than executed on the device. When the termina- 1: procedure execute()
tion criterion is met, the host machine reads the best solu- 2: ...
tion from the device memory with the OpenCL command 3: // Host: initialize variables
clEnqueueReadBuf f er and outputs the result. 4: // Host: initializes kernels and input arguments
5: // Host: send data to OpenCL device
Algorithm 1 CMA-ES (Sequential version) 6: init()
7: ...
1: procedure execute() 8: // Host: main loop
2: ... 9: while counteval < maxEvaluations do
3: // initialize variables 10: // Host: execute kernels to sample population
4: init() 11: samplePopulation();
5: ... 12: // Host: update counteval
6: while counteval < maxEvaluations do 13: counteval += (populationSize*populationSize);
7: // get a new population of solutions 14: // Host: execute kernels to update the covariance
8: population = samplePopulation()
9: for i ∈ 0...populationSize do matrix
10: // evaluate solution 15: updateDistribution();
11: evaluateSolution(population.get(i));
12: // update counteval 16: ...
13: counteval += populationSize; 17: // Host: read the best solution from the device
14: // returns the best solution using a Comparator 18: clEnqueueReadBuffer(...);
15: storeBest(comparator); 19: ...
16: // update the covariance matrix 20: // Host: store the best solution
17: updateDistribution(); 21: resultPopulation.add(bestSolutionEver);
22: ...
18: ... 23: // Host: return the best solution
19: // store the best solution 24: return resultPopulation;
20: resultPopulation.add(bestSolutionEver);
21: ... 3. RESULTS
22: // return the best solution
23: return resultPopulation; The tests were performed on two main hardware configu-
rations. Besides testing the performance improvement of
The analysis of the iterative CMA-ES algorithm has shown our GPU implementation, we tested the differences between
that the most time expensive steps are in the SamplePopu- Nvidia and ATI graphic cards. The first machine was equipped
lation method, in the inner for loop and in the UpdateDistri- with an i7-3820 CPU clocked at 3.60 GHz coupled with a
bution method. To speed-up the algorithm we implemented GeForce GTX 650. The second configuration was an AMD
eight OpenCL kernels, which encapsulate the logic of those FX-6300 clocked at 3500 MHz with a AMD Radeon R9
methods. In order to execute this kernels on the device, 280X. The details of both cards are shown in Table 1
the implementation firstly initializes these kernels and their
input arguments with the OpenCL clSetKernelArg com- The jMetal framework offers plentiful problems to test opti-
mand. After that it sends all the precomputed data from mization algorithms. We have chosen four typical problems,
the host to the device with the clCreateBuf f er command. namely Rosenbrock, Sphere, Griewank, and Rastrigin. Each
problem was tested on each hardware configuration and with
different sample sizes. We measured the running time of the

StuCoSReC Proceedings of the 2016 3rd Student Computer Science Research Conference 24
Ljubljana, Slovenia, 12 October
   19   20   21   22   23   24   25   26   27   28   29