Details & Early Benchmarks of OpenCL accelerated SQLite on ARM Mali

Posted: March 26, 2014 in OpenCL

I’ve done some more tuning over the past couple of days. I’ve also done some reading about how to make OpenCL perform better on ARM Mali.  In this post I’m going to retrace some of my steps, share what my tests looks like, share what some of my OpenCL looks like, share current performance numbers and last discuss next steps.

Gentle Reminder

This is in many ways still a prototype / proof of concept. My early goals are to get a very good sense what the possible performance of an OpenCL accelerated SQLite would be for the general case.  From this prototype I want to be able to iterate to complete implementation.

 Performance Testing the Prototype

I’m comparing the c apis that SQLite provides as well as my own API that I’ve developed. My API works with the same data structures and is compiled with the SQLite source code. I’m able to have an apples to apples comparison. The end of the performance measurement is always the caller of the API has all the data to the SQL statement that they requested.

While OpenCL has some nice mechanisms to measure the beginning and end of operations, I’m instead using clock_gettime to for time measurements. Example

 if (clock_gettime(CLOCK_MONOTONIC,&startSelectTime)) {
 return -1; 
 }
operation
 if (clock_gettime(CLOCK_MONOTONIC,&endSelectTime)) {
 return -1; 
 }

I am using 13 SQL statements that were defined in the Cuda accelerated SQLite paper by Peter Bakkum and Kevin Skadron. I am further using their same 100,000 row x 7 column database.

The 13 SQL statements are:

char *sql1 = "SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0";
 char* sql2 = "SELECT id, uniformf, normalf5 FROM test WHERE uniformf > 60 AND normalf5 < 0";
 char *sql3 ="SELECT id, uniformi, normali5 FROM test WHERE uniformi > -60 AND normali5 < 5";
 char *sql4 ="SELECT id, uniformf, normalf5 FROM test WHERE uniformf > -60 AND normalf5 < 5";
 char *sql5 ="SELECT id, normali5, normali20 FROM test WHERE (normali20 + 40) > (uniformi - 10)";
 char *sql6 ="SELECT id, normalf5, normalf20 FROM test WHERE (normalf20 + 40) > (uniformf - 10)";
 char *sql7 ="SELECT id, normali5, normali20 FROM test WHERE normali5 * normali20 BETWEEN -5 AND 5";
 char *sql8 ="SELECT id, normalf5, normalf20 FROM test WHERE normalf5 * normalf20 BETWEEN -5 AND 5";
 char *sql9 ="SELECT id, uniformi, normali5, normali20 FROM test WHERE NOT uniformi OR NOT normali5 OR NOT normali20";
 char *sql10 ="SELECT id, uniformf, normalf5, normalf20 FROM test WHERE NOT uniformf OR NOT normalf5 OR NOT normalf20";
 char *sql11 ="SELECT SUM(normalf20) FROM test";
 char *sql12 ="SELECT AVG(uniformi) FROM test WHERE uniformi > 0";
 char *sql13 ="SELECT MAX(normali5), MIN(normali5) FROM test";

The queries are a good starting point however I do feel the need to add to them as they avoided the use of character data for instance. For a start however I find them reasonable.

A bit about SQLite

There isn’t much magic when it comes to SQLite or really most other databases when you think about it. You have rows of data. Within those rows there are some number of columns. Your query might be interested in the 1st, 3rd and 5th column but nothing else. You might have some sort of test or sum or average you want performed as part of your query. As it stands today, SQLite will parse out the SQL statement and turn that into a bytecode program. The bytecode program is run in a virtual machine that knows how to look in the right spots for where the columns are, their types, perform operations on the data and so on.

The SQLite virtual machine does not have a just in time compiler.

My OpenCL acceleration essentially picks up at the point after which the bytecode program has been created but not executed. The normal case you’d execute the bytecode program on the CPU and get your data. The OpenCL accelerated case, the OpenCL kernel takes the place of the bytecode program with the database data as input.

In the future instead of bytecode I think a utilization of llvm’s MCJIT to just in time compile for either a CPU or dispatch to a GPU makes the most sense. At a future date, SPIR (OpenCL’s IR) which was generated by the just in time compiler could be then fed into a SPIR supporting driver for GPU offload. In the case of CPUs you’d of course use MCJIT to generate machine instructions.

Evolution of OpenCL SQLite queries

Consider the following SQL statement:

SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

Internal to SQLite this comes from a database with 7 columns. id, uniformi and normali5 are the first three columns in the row.

So in our OpenCL code if we can use the following mapping:

typedef struct tag_my_struct {
 int id; 
 int uniformi;
 int normali5;
 } Row;

(Note I did try to use an int3 or an int2 and an int in the data structure but I could never get it to work. With queries involving say 4 ints, I have used int4 vectors and that works really well)

This gives us the following code to run for each row:

 do {
    if ((data[offset].uniformi > 60) && (data[offset].normali5 < 0)) {
       resultArray[roffset].id = data[offset].id;
       resultArray[roffset].uniformi = data[offset].uniformi;
       resultArray[roffset].normali5 = data[offset].normali5; 
       roffset++;
    } 
    offset++;
    endRow--;
 } while (endRow);

Next it comes down to how many rows should a work unit handle?

We’re fortunate in that within our computational problem, we don’t have a dependency between rows. If we had 100000 rows to consider and 100000 GPU cores every GPU core could work on one row and that would be that. Silly, but you could do that.

Now if you’re thinking to yourself the output in the resultArray has to be different from each run of the kernel and the roffset will be different from each work unit run and have to be picked up and considered as part of the results you’re entirely correct.

There is some amount of post processing work that needs to be performed.

The time it takes to do any sort of data copying / munging into the GPUs memory + the time it takes to run the OpenCL kernel + the time to copy results out and do any post processing needs to be compared to the time the original non OpenCL accelerated SQLite implementation would have taken.

Tuning for Mali

It’s important to understand the architecture of the device(s) you maybe running on. In my case I’m using a Mali T604 on a ARM based Samsung Chromebook.  There a couple of papers I recommend reading, especially the first one.

http://malideveloper.arm.com/downloads/GPU_Pro_5/GronqvistLokhmotov_white_paper.pdf 

http://malideveloper.arm.com/develop-for-mali/tutorials-developer-guides/developer-guides/mali-t600-series-gpu-opencl-developer-guide/

For me there were are couple of pieces of advice that had significant gains.

 Avoid array indexing arithmetic

Using the first sql1 query I first accessed the elements of data in the following way:

 if ((data[offset+1] > 60) && (data[offset+2] < 0)) {

This is slower as compared to the method of using an array of data structures such as:

typedef struct tag_my_struct {
 int id; 
 int uniformi;
 int normali5;
 } Row;
 if ((data[offset].uniformi > 60) && (data[offset].normali5 < 0)) {

Vectors are even better but as mentioned I wasn’t able to get a structure with an int3 or an int2 and an int to work. A structure with float4 or int4 vectors however worked just fine.

typedef struct tag_my_struct {
 float4 v;
 } Row;

How about the use of vload/vstore? For my data I haven’t as of yet seen a noticeable performance improvement using them.  This is on my list to revisit.

The memory architecture of Mali is important to understand and take advantage of.  First access to OpenCL global and local data has the same cost.

Buffers

How one creates memory buffers is important and has a performance impact.

clCreateBuffer(CL_MEM_ALLOC_HOST_PTR)  is the preferred method.

clCreateBuffer(CL_MEM_USE_HOST_PTR)  uses a buffer allocated by malloc. Unfortunately it means that the contents of that malloced region must be copied into storage accessible by Mali.  CL_MEM_ALLOC_HOST_PTR avoid the copy.

On desktop GPUs memory on the graphics card is disjoint from memory that the CPU core(s) have direct access to. As such operations such as

 err = clEnqueueReadBuffer(s->queue, s->results_gpu->roffsetResults ....

make sense. However this is slower with ARM and Mali. Instead

 t = clEnqueueMapBuffer(s->queue, s->results_gpu->roffsetResults ....

is must faster. In my case I started with clEnqueueReadBuffer and via my new found knowledge in the guides moved to clEneueMapBuffer.

Kernel Work Units

How many work units one should use is not always a straight forward value. Gronqvist & Lokhmotov note that the theoretical maximum for Mali is 256 but the actual number is based on how many registers are being used by the OpenCL kernel. Four and under 256 is the right value. Between Four and Eight, 128 and between Eight and Sixteen 64 work units.

For my kernels I noticed that my integer based kernels 64 work units was the right choice. For my kernels that mix integer and floating point 128 was the right setting. I don’t have an explanation for this.

Early Results

Here’s what I’m seeing for results across the 13 queries. CPU times are with SQLite version built with -O2 and running on a Cortex-A15. GPU times are from a Mali T604.

SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

CPU sql1 took 43631 microseconds

OpenCL sql1  took 14545 microseconds  (2.99x better or 199% better)


SELECT id, uniformf, normalf5 FROM test WHERE uniformf > 60 AND normalf5 < 0

CPU sql2 took 62785 microseconds

OpenCL sql2 took 7756 microseconds (8.09x better or 709% better)


SELECT id, uniformi, normali5 FROM test WHERE uniformi > -60 AND normali5 < 5

CPU sql3 took 114448 microseconds

OpenCL sql3.cl took 44533 microseconds (2.56x better or 156% better)


SELECT id, uniformf, normalf5 FROM test WHERE uniformf > -60 AND normalf5 < 5

CPU sql4 took 139694 microseconds

OpenCL sql4.cl took 20911 microseconds (6.68x better or 568% better)


SELECT id, normali5, normali20 FROM test WHERE (normali20 + 40) > (uniformi – 10)

CPU sql5 took 138859 microseconds

OpenCL sql5.cl took 48834 microseconds (2.84x better or 184% better)


SELECT id, normalf5, normalf20 FROM test WHERE (normalf20 + 40) > (uniformf – 10)

CPU sql6 took 163830 microseconds

OpenCL sql6.cl took 22712 microseconds (7.21x better or 621% better)


SELECT id, normali5, normali20 FROM test WHERE normali5 * normali20 BETWEEN -5 AND 5

CPU sql7 took 82662 microseconds

OpenCL sql7.cl took 20669 microseconds (3.99x better or 299% better)


SELECT id, normalf5, normalf20 FROM test WHERE normalf5 * normalf20 BETWEEN -5 AND 5

CPU sql8 took 96882 microseconds

OpenCL sql8.cl took 10854 microseconds (8.92x better or 792% better)


SELECT id, uniformi, normali5, normali20 FROM test WHERE NOT uniformi OR NOT normali5 OR NOT normali20

CPU sql9 took 74317 microseconds

OpenCL sql9.cl took 12955 microseconds (5.73x better of 473% better)


SELECT id, uniformf, normalf5, normalf20 FROM test WHERE NOT uniformf OR NOT normalf5 OR NOT normalf20

CPU sql10 took 91617 microseconds

OpenCL sql10.cl took 7524 microseconds (12.17x better or 1117% better)


SELECT SUM(normalf20) FROM test

CPU sql11 took 44995 microseconds

OpenCL sql11 took 2190 microseconds (20.54x or 1954% better)


SELECT AVG(uniformi) FROM test WHERE uniformi > 0

CPU sql12 took 41000 microseconds

OpenCL sql12.cl took 4236 microseconds (9.67x or 867% better)


SELECT MAX(normali5), MIN(normali5) FROM test

CPU sql13 took 52354 microseconds
OpenCL sql13.cl took 4619 microseconds (11.33x or 1034% better)


This gives us a range of improvements between 2.56x and 20.54x better when OpenCL is used for these queries.

Next Steps

I will post the prototype code to git.linaro.org. I plan to do this once I’m able to handle character data. Guessing that should be in about a week. It’ll give me some time to also cut off a useless prehensile tail or two.

From here I want to add character data into the mix and then begin the construction of the general purpose path to generate and execute the resulting OpenCL kernel. I want to gather a wider range of  performance numbers to understand what the minimum number of rows of data needs to be before it makes sense to use OpenCL.

I would like to run this code on a SoC with a 628 Mali  to see how the performance changes.

I would like to run this code on a different GPU such as Qualcomm’s Adreno.

Last  this approach certainly could use Renderscript instead of OpenCL. Obviously it would only work on Android but that’s perfectly fine and I think well worth the time. My “current” stable Android platform is an original Nexus 7 which while it runs KitKat I’m not sure would be the best choice as compared to the current Nexus 7 which has a 400 MHz quad-core Adreno 320.   Time to review the hardware budget.

Advertisements
Comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s