Within the GPGPU team Gil Pitney has been working on Shamrock which is an open source OpenCL implementation. It’s really a friendly fork of the clover project but taken in a bit of a new direction.
Over the past few months Gil has updated it to make use of the new MCJIT from llvm which works much better for ARM processors. Further he’s updated Shamrock so that it uses current llvm. I have a build based on 3.5.0 on my chromebook.
The other part about Gil’s Shamrock work is it will in time also have the ability to drive Keystone hardware which is TI’s ARM + DPSs on board computing solution. Being able to drive DSPs with OpenCL is quite an awesome capability. I do wish I had one of those boards.
The other capability Shamrock has is to provide a CPU driver for OpenCL on ARM. How does it perform? Good question!
I took my OpenCL accelerated sqlite prototype and built it to use the Shamrock CPU only driver. Would you expect that a CPU only OpenCL driver offloading SQL SELECT queries to be faster or would the sqlite engine?
If you guessed OpenCL running on a CPU only driver, you’re right. Now remember the Samsung ARM based chromebook is a dual A15. The queries are against 100,000 rows in a single table database with 7 columns. Lower numbers are better and times
sql1 took 43653 microseconds OpenCL handcoded-opencl/sql1.cl Interval took 17738 microseconds OpenCL Shamrock 2.46x faster
sql2 took 62530 microseconds OpenCL handcoded-opencl/sql2.cl Interval took 18168 microseconds OpenCL Shamrock 3.44x faster
sql3 took 110095 microseconds OpenCL handcoded-opencl/sql3.cl Interval took 18711 microseconds OpenCL Shamrock 5.88x faster
sql4 took 143278 microseconds OpenCL handcoded-opencl/sql4.cl Interval took 19612 microseconds OpenCL Shamrock 7.30x faster
sql5 took 140398 microseconds OpenCL handcoded-opencl/sql5.cl Interval took 18698 microseconds OpenCL Shamrock 7.5x faster
These numbers for running on the CPU are pretty consistent and I was concerned there was some error in the process. Yet the returned number of matching rows is the same for both the sqlite engine and the OpenCL versions which helps detect functional problems. I’ve clipped the result row counts from the results above for brevity.
I wasn’t frankly expecting this kind of speed up, especially with a CPU only driver. Yet there it is in black and white. It does speak highly of the capabilities of OpenCL to be more efficient at computing when you have data parallel problems.
Another interesting thing to note in this comparison, the best results achieved have been with the Mali GPU using vload/vstores and thus take advantage of SIMD vector instructions. On a CPU this would equate to use of NEON. The Shamrock CPU only driver doesn’t at the moment have support for vload/vstore so the compiled OpenCL kernel isn’t even using NEON on the CPU to achieve these results.