Archive for the ‘OpenCL’ Category

I’ve been working on moving the OpenCL accelerated sqlite prototype toward being able to support the general case instead of just the contrived set of initial SQL SELECTs.

First, why did I have to start out with a contrived set of SQL SELECTs to accelerate? Consider:

SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

For a query we need to have the equivalent in OpenCL. For the prototype I hand coded these OpenCL kernels and called the kernels with the data as obtained from the sqlite infrastructure.  I had to start somewhere. A series of SQL statements to try and shake out patterns for generation I thought would be the best path to validate this idea.

The next evolutionary step is to generate an OpenCL kernel by reading the parse tree that sqlite generates as it pulls apart the SQL statement.

This is what a machine generated kernel looks like for previously mentioned SQL statement:

__kernel void x2_entry (__global int * id, __global int * uniformi, __global int * normali5, __global int * _cl_resultMask) {
__private int4 v0;
__private int4 v1;
__private int4 v2;
__private int4 _cl_r;
int i = get_global_id(0);
size_t offset = i * (totalRows/workUnits);
do {
v0 = vload4(0, id + offset);
v1 = vload4(0, uniformi + offset);
v2 = vload4(0, normali5 + offset);
_cl_r = (( uniformi  >  60 ) && ( normali5  <  0 ));
vstore4(_cl_r, 0, _cl_resultMask + offset);
        offset+=4
        totalRows--;
} while(totalRows);
}

Why are we generating OpenCL kernel code there? Isn’t there a better way? Well there is. In later versions of the OpenCL standard (and HSA) there is something called an intermediate representation (IR) form which is very much akin to what compilers translate high level languages to before targeting the native instruction set of whatever that code will run on.

Unfortunately OpenCL’s IR otherwise known as SPIR isn’t available to us since the OpenCL drivers for ARM’s Mali currently don’t support it. Imagination’s PowerVR doesn’t either. (Heck Imagination requires an NDA to be signed to even get there drivers, talk about unfriendly!)  They might someday but that day isn’t today. Likewise HSA has an IRA as part of it’s standard called HSAIL.

Either one would be much better to emit of course presuming that the OpenCL drivers could take that IR as input.

None the less, as soon as I have “parity” with the prototype and a little testing I’ll commit the code that machine generates these OpenCL kernels to git. I’m getting close. The next step after that will be to make a few changes internal to sqlite use those kernels.

Advertisements

Within the GPGPU team Gil Pitney has been working on Shamrock which is an open source OpenCL implementation. It’s really a friendly fork of the clover project but taken in a bit of a new direction.

Over the past few months Gil has updated it to make use of the new MCJIT from llvm which works much better for ARM processors. Further he’s updated Shamrock so that it uses current llvm. I have a build based on 3.5.0 on my chromebook.

The other part about Gil’s Shamrock work is it will in time also have the ability to drive Keystone hardware which is TI’s ARM + DPSs on board computing solution. Being able to drive DSPs with OpenCL is quite an awesome capability. I do wish I had one of those boards.

The other capability Shamrock has is to provide a CPU driver for OpenCL on ARM. How does it perform? Good question!

I took my OpenCL accelerated sqlite prototype and built it to use the Shamrock CPU only driver. Would you expect that a CPU only OpenCL driver offloading SQL SELECT queries to be faster or would the  sqlite engine?

If you guessed OpenCL running on a CPU only driver, you’re right. Now remember the Samsung ARM based chromebook is a dual A15. The queries are against 100,000 rows in a single table database with 7 columns. Lower numbers are better and times

sql1 took 43653 microseconds
OpenCL handcoded-opencl/sql1.cl Interval took 17738 microseconds
OpenCL Shamrock 2.46x faster
sql2 took 62530 microseconds
OpenCL handcoded-opencl/sql2.cl Interval took 18168 microseconds
OpenCL Shamrock 3.44x faster
sql3 took 110095 microseconds
OpenCL handcoded-opencl/sql3.cl Interval took 18711 microseconds
OpenCL Shamrock 5.88x faster
sql4 took 143278 microseconds
OpenCL handcoded-opencl/sql4.cl Interval took 19612 microseconds
OpenCL Shamrock 7.30x faster
sql5 took 140398 microseconds
OpenCL handcoded-opencl/sql5.cl Interval took 18698 microseconds
OpenCL Shamrock 7.5x faster

These numbers for running on the CPU are pretty consistent and I was concerned there was some error in the process. Yet the returned number of matching rows is the same for both the sqlite engine and the OpenCL versions which helps detect functional problems. I’ve clipped the result row counts from the results above for brevity.

I wasn’t frankly expecting this kind of speed up, especially with a CPU only driver. Yet there it is in black and white. It does speak highly of the capabilities of OpenCL to be more efficient at computing when you have data parallel problems.

Another interesting thing to note in this comparison, the best results achieved have been with the Mali GPU using vload/vstores and thus take advantage of SIMD vector instructions. On a CPU this would equate to use of NEON. The Shamrock CPU only driver doesn’t at the moment have support for vload/vstore so the compiled OpenCL kernel isn’t even using NEON on the CPU to achieve these results.

I’ve posted my initial OpenCL accelerated sqlite prototype code:

http://git.linaro.org/people/tom.gall/sq-cl.git

Don’t get excited. Remember, it’s a prototype and a quite contrived one at that. It doesn’t handle the general case yet and of course it has bugs. But!  It’s interesting and I think shows what’s possible.

Over at the mali developer community that ARM hosts. I happened to mention this work which in a post that ended up resulting in some good suggestions to use of vectors as well as other good feedback. While working with vectors was a bit painful due to the introduction of some bugs on my part, I made my way through it and have some initial numbers with a couple of kernels so I can get an idea just what a difference it makes.

Alot.

The core of the algorithm for sql1 changes from:

    do {
        if ((data[offset].v > 60) && (data[offset].w < 0)) {
            resultArray[roffset].id = data[offset].id;
            resultArray[roffset].v = data[offset].v;
            resultArray[roffset].w = data[offset].w;
            roffset++;
        }
        offset++;
        endRow--;
    } while (endRow);

To

    do {
        v1 = vload4(0, data1+offset);
        v2 = vload4(0, data2+offset);
        r = (v1 > 60) && ( 0 > v2);
        vstore4(r,0, resultMask+offset);
        offset+=4;
        totalRows--;
    } while (totalRows);

With each spin through the loop, the vectorized version of course is operating over 4 values at once to check for a match. Obvious win. To do this the data has to come in in pure columns and I’m using an vector as essentially a bitmask to indicate if that row is a match or not. This requires a post processing loop to spin through and assemble the resulting data into a useful state. For the 100,000 row database I’m using it doesn’t seem to have as much of a performance impact as I thought it might.

For the first sql1 test query the numbers look like this:

CPU sql1 took 43631 microseconds
OpenCL sql1  took 14545 microseconds  (2.99x or 199% better)
OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

Not bad. sql3 sees even better results:

CPU sql3 took 111020 microseconds
OpenCL sql3 took 44533 microseconds (2.49x  or 149% better)
OpenCL (using vectors) took 4436 microseconds (25.02x or 2402% better)

There’s another factor why these vectorized versions are doing better. With the newer code I am using less registers on the Mali GPU and thus am able to up the number of work units from 64 to 128.

I do have one bug that I need to track down. I am (of course) validating that all the versions are coming up with the same matches. The new vector versions are off by a couple of rows. The missing rows don’t seem to follow a pattern. I’m sure I’ve done something dumb. Now that there is the ability for more eyes on the code perhaps someone will spot it.

I’ve done some more tuning over the past couple of days. I’ve also done some reading about how to make OpenCL perform better on ARM Mali.  In this post I’m going to retrace some of my steps, share what my tests looks like, share what some of my OpenCL looks like, share current performance numbers and last discuss next steps.

Gentle Reminder

This is in many ways still a prototype / proof of concept. My early goals are to get a very good sense what the possible performance of an OpenCL accelerated SQLite would be for the general case.  From this prototype I want to be able to iterate to complete implementation.

 Performance Testing the Prototype

I’m comparing the c apis that SQLite provides as well as my own API that I’ve developed. My API works with the same data structures and is compiled with the SQLite source code. I’m able to have an apples to apples comparison. The end of the performance measurement is always the caller of the API has all the data to the SQL statement that they requested.

While OpenCL has some nice mechanisms to measure the beginning and end of operations, I’m instead using clock_gettime to for time measurements. Example

 if (clock_gettime(CLOCK_MONOTONIC,&startSelectTime)) {
 return -1; 
 }
operation
 if (clock_gettime(CLOCK_MONOTONIC,&endSelectTime)) {
 return -1; 
 }

I am using 13 SQL statements that were defined in the Cuda accelerated SQLite paper by Peter Bakkum and Kevin Skadron. I am further using their same 100,000 row x 7 column database.

The 13 SQL statements are:

char *sql1 = "SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0";
 char* sql2 = "SELECT id, uniformf, normalf5 FROM test WHERE uniformf > 60 AND normalf5 < 0";
 char *sql3 ="SELECT id, uniformi, normali5 FROM test WHERE uniformi > -60 AND normali5 < 5";
 char *sql4 ="SELECT id, uniformf, normalf5 FROM test WHERE uniformf > -60 AND normalf5 < 5";
 char *sql5 ="SELECT id, normali5, normali20 FROM test WHERE (normali20 + 40) > (uniformi - 10)";
 char *sql6 ="SELECT id, normalf5, normalf20 FROM test WHERE (normalf20 + 40) > (uniformf - 10)";
 char *sql7 ="SELECT id, normali5, normali20 FROM test WHERE normali5 * normali20 BETWEEN -5 AND 5";
 char *sql8 ="SELECT id, normalf5, normalf20 FROM test WHERE normalf5 * normalf20 BETWEEN -5 AND 5";
 char *sql9 ="SELECT id, uniformi, normali5, normali20 FROM test WHERE NOT uniformi OR NOT normali5 OR NOT normali20";
 char *sql10 ="SELECT id, uniformf, normalf5, normalf20 FROM test WHERE NOT uniformf OR NOT normalf5 OR NOT normalf20";
 char *sql11 ="SELECT SUM(normalf20) FROM test";
 char *sql12 ="SELECT AVG(uniformi) FROM test WHERE uniformi > 0";
 char *sql13 ="SELECT MAX(normali5), MIN(normali5) FROM test";

The queries are a good starting point however I do feel the need to add to them as they avoided the use of character data for instance. For a start however I find them reasonable.

A bit about SQLite

There isn’t much magic when it comes to SQLite or really most other databases when you think about it. You have rows of data. Within those rows there are some number of columns. Your query might be interested in the 1st, 3rd and 5th column but nothing else. You might have some sort of test or sum or average you want performed as part of your query. As it stands today, SQLite will parse out the SQL statement and turn that into a bytecode program. The bytecode program is run in a virtual machine that knows how to look in the right spots for where the columns are, their types, perform operations on the data and so on.

The SQLite virtual machine does not have a just in time compiler.

My OpenCL acceleration essentially picks up at the point after which the bytecode program has been created but not executed. The normal case you’d execute the bytecode program on the CPU and get your data. The OpenCL accelerated case, the OpenCL kernel takes the place of the bytecode program with the database data as input.

In the future instead of bytecode I think a utilization of llvm’s MCJIT to just in time compile for either a CPU or dispatch to a GPU makes the most sense. At a future date, SPIR (OpenCL’s IR) which was generated by the just in time compiler could be then fed into a SPIR supporting driver for GPU offload. In the case of CPUs you’d of course use MCJIT to generate machine instructions.

Evolution of OpenCL SQLite queries

Consider the following SQL statement:

SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

Internal to SQLite this comes from a database with 7 columns. id, uniformi and normali5 are the first three columns in the row.

So in our OpenCL code if we can use the following mapping:

typedef struct tag_my_struct {
 int id; 
 int uniformi;
 int normali5;
 } Row;

(Note I did try to use an int3 or an int2 and an int in the data structure but I could never get it to work. With queries involving say 4 ints, I have used int4 vectors and that works really well)

This gives us the following code to run for each row:

 do {
    if ((data[offset].uniformi > 60) && (data[offset].normali5 < 0)) {
       resultArray[roffset].id = data[offset].id;
       resultArray[roffset].uniformi = data[offset].uniformi;
       resultArray[roffset].normali5 = data[offset].normali5; 
       roffset++;
    } 
    offset++;
    endRow--;
 } while (endRow);

Next it comes down to how many rows should a work unit handle?

We’re fortunate in that within our computational problem, we don’t have a dependency between rows. If we had 100000 rows to consider and 100000 GPU cores every GPU core could work on one row and that would be that. Silly, but you could do that.

Now if you’re thinking to yourself the output in the resultArray has to be different from each run of the kernel and the roffset will be different from each work unit run and have to be picked up and considered as part of the results you’re entirely correct.

There is some amount of post processing work that needs to be performed.

The time it takes to do any sort of data copying / munging into the GPUs memory + the time it takes to run the OpenCL kernel + the time to copy results out and do any post processing needs to be compared to the time the original non OpenCL accelerated SQLite implementation would have taken.

Tuning for Mali

It’s important to understand the architecture of the device(s) you maybe running on. In my case I’m using a Mali T604 on a ARM based Samsung Chromebook.  There a couple of papers I recommend reading, especially the first one.

http://malideveloper.arm.com/downloads/GPU_Pro_5/GronqvistLokhmotov_white_paper.pdf 

http://malideveloper.arm.com/develop-for-mali/tutorials-developer-guides/developer-guides/mali-t600-series-gpu-opencl-developer-guide/

For me there were are couple of pieces of advice that had significant gains.

 Avoid array indexing arithmetic

Using the first sql1 query I first accessed the elements of data in the following way:

 if ((data[offset+1] > 60) && (data[offset+2] < 0)) {

This is slower as compared to the method of using an array of data structures such as:

typedef struct tag_my_struct {
 int id; 
 int uniformi;
 int normali5;
 } Row;
 if ((data[offset].uniformi > 60) && (data[offset].normali5 < 0)) {

Vectors are even better but as mentioned I wasn’t able to get a structure with an int3 or an int2 and an int to work. A structure with float4 or int4 vectors however worked just fine.

typedef struct tag_my_struct {
 float4 v;
 } Row;

How about the use of vload/vstore? For my data I haven’t as of yet seen a noticeable performance improvement using them.  This is on my list to revisit.

The memory architecture of Mali is important to understand and take advantage of.  First access to OpenCL global and local data has the same cost.

Buffers

How one creates memory buffers is important and has a performance impact.

clCreateBuffer(CL_MEM_ALLOC_HOST_PTR)  is the preferred method.

clCreateBuffer(CL_MEM_USE_HOST_PTR)  uses a buffer allocated by malloc. Unfortunately it means that the contents of that malloced region must be copied into storage accessible by Mali.  CL_MEM_ALLOC_HOST_PTR avoid the copy.

On desktop GPUs memory on the graphics card is disjoint from memory that the CPU core(s) have direct access to. As such operations such as

 err = clEnqueueReadBuffer(s->queue, s->results_gpu->roffsetResults ....

make sense. However this is slower with ARM and Mali. Instead

 t = clEnqueueMapBuffer(s->queue, s->results_gpu->roffsetResults ....

is must faster. In my case I started with clEnqueueReadBuffer and via my new found knowledge in the guides moved to clEneueMapBuffer.

Kernel Work Units

How many work units one should use is not always a straight forward value. Gronqvist & Lokhmotov note that the theoretical maximum for Mali is 256 but the actual number is based on how many registers are being used by the OpenCL kernel. Four and under 256 is the right value. Between Four and Eight, 128 and between Eight and Sixteen 64 work units.

For my kernels I noticed that my integer based kernels 64 work units was the right choice. For my kernels that mix integer and floating point 128 was the right setting. I don’t have an explanation for this.

Early Results

Here’s what I’m seeing for results across the 13 queries. CPU times are with SQLite version built with -O2 and running on a Cortex-A15. GPU times are from a Mali T604.

SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

CPU sql1 took 43631 microseconds

OpenCL sql1  took 14545 microseconds  (2.99x better or 199% better)


SELECT id, uniformf, normalf5 FROM test WHERE uniformf > 60 AND normalf5 < 0

CPU sql2 took 62785 microseconds

OpenCL sql2 took 7756 microseconds (8.09x better or 709% better)


SELECT id, uniformi, normali5 FROM test WHERE uniformi > -60 AND normali5 < 5

CPU sql3 took 114448 microseconds

OpenCL sql3.cl took 44533 microseconds (2.56x better or 156% better)


SELECT id, uniformf, normalf5 FROM test WHERE uniformf > -60 AND normalf5 < 5

CPU sql4 took 139694 microseconds

OpenCL sql4.cl took 20911 microseconds (6.68x better or 568% better)


SELECT id, normali5, normali20 FROM test WHERE (normali20 + 40) > (uniformi – 10)

CPU sql5 took 138859 microseconds

OpenCL sql5.cl took 48834 microseconds (2.84x better or 184% better)


SELECT id, normalf5, normalf20 FROM test WHERE (normalf20 + 40) > (uniformf – 10)

CPU sql6 took 163830 microseconds

OpenCL sql6.cl took 22712 microseconds (7.21x better or 621% better)


SELECT id, normali5, normali20 FROM test WHERE normali5 * normali20 BETWEEN -5 AND 5

CPU sql7 took 82662 microseconds

OpenCL sql7.cl took 20669 microseconds (3.99x better or 299% better)


SELECT id, normalf5, normalf20 FROM test WHERE normalf5 * normalf20 BETWEEN -5 AND 5

CPU sql8 took 96882 microseconds

OpenCL sql8.cl took 10854 microseconds (8.92x better or 792% better)


SELECT id, uniformi, normali5, normali20 FROM test WHERE NOT uniformi OR NOT normali5 OR NOT normali20

CPU sql9 took 74317 microseconds

OpenCL sql9.cl took 12955 microseconds (5.73x better of 473% better)


SELECT id, uniformf, normalf5, normalf20 FROM test WHERE NOT uniformf OR NOT normalf5 OR NOT normalf20

CPU sql10 took 91617 microseconds

OpenCL sql10.cl took 7524 microseconds (12.17x better or 1117% better)


SELECT SUM(normalf20) FROM test

CPU sql11 took 44995 microseconds

OpenCL sql11 took 2190 microseconds (20.54x or 1954% better)


SELECT AVG(uniformi) FROM test WHERE uniformi > 0

CPU sql12 took 41000 microseconds

OpenCL sql12.cl took 4236 microseconds (9.67x or 867% better)


SELECT MAX(normali5), MIN(normali5) FROM test

CPU sql13 took 52354 microseconds
OpenCL sql13.cl took 4619 microseconds (11.33x or 1034% better)


This gives us a range of improvements between 2.56x and 20.54x better when OpenCL is used for these queries.

Next Steps

I will post the prototype code to git.linaro.org. I plan to do this once I’m able to handle character data. Guessing that should be in about a week. It’ll give me some time to also cut off a useless prehensile tail or two.

From here I want to add character data into the mix and then begin the construction of the general purpose path to generate and execute the resulting OpenCL kernel. I want to gather a wider range of  performance numbers to understand what the minimum number of rows of data needs to be before it makes sense to use OpenCL.

I would like to run this code on a SoC with a 628 Mali  to see how the performance changes.

I would like to run this code on a different GPU such as Qualcomm’s Adreno.

Last  this approach certainly could use Renderscript instead of OpenCL. Obviously it would only work on Android but that’s perfectly fine and I think well worth the time. My “current” stable Android platform is an original Nexus 7 which while it runs KitKat I’m not sure would be the best choice as compared to the current Nexus 7 which has a 400 MHz quad-core Adreno 320.   Time to review the hardware budget.

Late on a Saturday night and I’m working on my monitor tan. It’s spring, can’t be too early to prepare for summer of course!

I’ve taken the following sql queries and run them both with the traditional sqlite c apis as well as with my OpenCL accelerated APIs. These queries are the same that Bakkum et all used in their cuda accelerated sqlite paper.

char *sql11 ="SELECT SUM(normalf20) FROM test";
char *sql12 ="SELECT AVG(uniformi) FROM test WHERE uniformi > 0";
char *sql13 ="SELECT MAX(normali5), MIN(normali5) FROM test";

Straight sqlite with my A15 based Samsung Chromebook yields:

sql11 took 95399 microseconds
sql12 took 86576 microseconds
sql13 took 121898 microseconds

My OpenCL APIs yields the following for the same queries:

OpenCL sql11 took 46098 microseconds
OpenCL sql12 took 55524 microseconds
OpenCL sql13 took 64802 microseconds

The data is the same for both straight C sqlite apis and OpenCL apis 100,000 rows to process from one database with one table. The time measured is the time to perform the query across all selected data and for the end user API to obtain the data. For OpenCL this includes the copying out of the data. For the straight c apis this includes the time accessing the one row.

I’m not applying any sort of statistical process or test to these results. That’ll be a later step to assert a confidence interval based on a distribution of collected results.

All in all I don’t think the results are too bad but I’d like to think OpenCL should be able do better. Time to spend a little time with perf as well as do a little digging to see what might be available from Mali developer to analyze performance on the GPU.

These microbenchmarks are important to me. They give a guide as far as what might be accomplished with a general purpose solution which is yet to be written.  They also are helping me to form opinions about how to best approach it.

People have side projects. This one is mine.

What if you accelerate the popular sqlite database with OpenCL? This is one of the ideas that was floated as part of the GPGPU team to get a feel for what might be accomplished on ARM hardware with a mobile GPU.

In my case I’m using the Mali opencl drivers, running with ubuntu linux on a Samsung Chromebook which includes a dual core A15 and a Mali T604. You can replicate this same setup following these instructions.

At Linaro Connect Asia 2014 as part of the GPGPU session I gave an overview of the effort but I wasn’t able to give any initial performance numbers since my free time is highly variable and Connect arrived before I was quite ready. At the time I was about a week out from being able to run a microbenchmark or two since I was just getting to the step of writing some of the OpenCL.

Before I get to some initial numbers let me review a bit of what I talked about at Connect.

To accelerate sqlite I’ve initially added an api that sits next to the sqlite C api. My API in time should be able to blend right into the sqlite API so that no code changes would be needed by end user applications.  With sqlite you usually have a call sequence something like :

sqlite3_open(databaseName, &db);
c= sqlite3_prepare_v2(db, sql, -1, &selectAll_statement, NULL);
while (sqlite3_step(selectAll_statement) == SQLITE_ROW) {
    sqlite3_column_TYPE(selectAll_statement,0);
}

The prepare call takes sql and converts it to an expression tree that is translated into a bytecode which is run inside of a VM. The virtual machine is really nothing more than an big switch statement and each case handles an op code that the VM operates over. sqlite doesn’t do any sort of JIT to accelerate it’s operation. (I know what you’re thinking, hold that thought.)

The challenge to make a general purpose acceleration is to take the operation of the VM and move that onto the GPU. I see a few ways to accomplish this. In the past work that Peter Bakkum and Kevin Skadron had done they basically moved the implementation of the VM into the GPU using Cuda. This kind of approach really doesn’t work in my opinion for using OpenCL. Instead I’m currently of the opinion that the output of the sql expression tree ought to be a bit more than just VM bytecodes. I do wonder if utilizing llvm couldn’t offer interesting possibilities including SPIR (the Khronos intermediate representation standard for OpenCL) . Further research for sure.

The opencl accelerated API sequence looks like:

opencl_init(s, db);
opencl_prepare_data(s, sql);
opencl_transfer_data(s);
opencl_select(s, sql, 0);
opencl_transfer_results(s);

At this point, what I’ve managed to do is using a 100,000 row database with 7 columns run the same query using the sqlite c interface and my opencl accelerated interface.

With the sqlite c API the query took 420274 microseconds on my chromebook a dual core A15 cpu running at 1.7 Gz.

The OpenCL accelerated API running on the Mali T604 GPU at 533Mhz(?) from the same Chromebook yields 110289 microseconds. This measured time includes both the running of the OpenCL kernel and the data transfer from the result buffer.

These are early results. Many grains of salt should be applied but over all this seems like good results for a mobile GPU.

OpenCL on ARM (part 1)

Posted: March 29, 2013 in linaro, OpenCL

A few weeks past before Linaro Connect I had started to see what might be available for OpenCL implementations on ARM. Via a little bit of googling it seemed that the only choice would going to be for boards with a Mali 6xx GPU. This was ok since that basically boils down to the Arndale board and the Samsung Chromebook. Both good options since I happened to have a Chromebook.

I downloaded the Mali OpenCL SDK which can be found from their site.

It didn’t take long following the instructions when I realized the SDK isn’t like most SDKs. Not contained within this SDK was any form of Mali OpenCL driver. Within the Mali SDK it contains a lib directory which when you type make (and you probably have to fix their Makefile to make it work) it will yield a libOpenCL.so it’s just that it’s essentially an empty stub library. You can compile and link against what is provided but when you try and run nothing will happen. Within this library is just a long list of functions with no implementation behind it. None. Not very useful.

Via this discussion, at the very bottom we see a bit of an explanation as to why.

We (ARM) do provide a build of Linux containing the OpenCL driver to select partners under a specific license, but this is not public at this time

So they leave it to the maker of the board to at their open distribute a driver. This gives the board maker the option to not support OpenCL at all if they so choose. Ok, I respect that and it makes sense, since just because a Mali T6xx part is on a board doesn’t mean that it’s wired up universally the same way which may require some driver specific change. It’s conjecture on my part since obviously we’ve no view into the source code as it’s not Open Source.

That said, the Insignal discussion can be found on their boards here. Simply put, not yet available for Linux but supposedly available for their Android Jelly Bean.

Hrumph!

I like Android but the problem is at Connect I gave up my Arndale board to one of my coworkers. I haven’t ordered a new one since the wait times are impressively long and currently they are sold out again at HowChip.

Android to my knowledge doesn’t run on the ARM based Samsung Chromebook so I’m out of options.

Next I did a little splunking within the ChromeOS file system on my Chromebook to see if I might find something to suggest that OpenCL was there. I didn’t find libOpenCL.so in any of the usual places so it’s probably safe to say ChromeOS doesn’t make use of OpenCL. No chance of copying over any binaries for use on Linux.

So backing up what other options do I have? Well I do have an OSX option. Putting together an OpenCL HelloWorld there is quite easy. Still.

I’ve a couple Intel Linux boxes, at least it would be a place to get my feet wet in the meantime and be more in line with what OpenCL on ARM linux will be like. So on Ubuntu I proceeded. There are two options. Either Intel’s VCSource or AMD’s APP SDK both proclaiming OpenCL support.

Let’s talk about how the OpenCL infrastructure is installed. First the includes that are best placed at /usr/include/CL. Not needed of course for runtime. Next if you put the contents of each respective SDK’s lib directory into /usr/lib/OpenCL/vendor/  intel or amd then you can have both SDK’s installed at the same time. These are needed at runtime. Next you have /etc/OpenCL/vendors which will have a number of .icd files. You only need one but with multiple SDKs you’ll have more than one.  The ICD is Installable Client Driver. IE This points to the real driver. This is required at runtime. For libOpenCL.so it looks to the icd to specify which driver(s) to use making libOpenCL.so more of a traffic cop between your application that uses libOpenCL.so and the real driver. Next within /etc/ld.so.conf.d you’ll have a new file that points to where the shared libraries are. So in my case these point to /usr/lib/OpenCL/vendors/intel and /usr/lib/OpenCL/vendors/amd in separate files. Last I have symlinks for OpenCL.so OpenCL.so.1 and OpenCL.so.1.2 that all point into the libOpenCL.so implementation I’m using such as the one in /usr/lib/OpenCL/vendors/amd.

AMD’s APP wants to set the environment variable AMDAPPSDKROOT=”/opt/AMDAPP” and does so in /etc/profile.

So knowing these aspects of setup, I proceeded to try out a simple HelloWorld app that would get the list of devices, create a context and spawn off some simple work to validation things are sane.

Let’s talk about how well things work with the Intel and AMD SDKs.

Intel’s SDK for Linux indicates they only support a limited set of CPUs. GPUs are not supported. Neither is the i7 CPU which is what my laptop has. Tried to run. Fail!  Intel’s SDK does not support any of their GPUs. If you want to use their SDK with an i7 for instance you can only do so on Windows! Lame! Further why Intel would have a dep on a very limited set of CPUs is beyond me.

Ok so obviously this wasn’t going to work. Next I switched over to the AMD APP SDK. As it turns out they support OpenCL for just CPUs IE without using a GPUs or for submitting work on both CPUs and GPUs. My laptop and my main intel desktop does not have an ATI GPU so this was essential for me to use the AMD implementation since they only support ATI GPUs, and as it turns out “any” Intel based CPU. Using the AMD supplied HelloWorld OpenCL app, it ran. But.

./HelloWorld
Setting of real/effective user Id to 0/0 failed
FATAL: Module fglrx not found.
Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly
No GPU device available.
Choose CPU as default device.
input string:
GdkknVnqkc
output string:
HelloWorld

fglrx of course is the ATI kernel module. Via OpenCL you can specify that your workload is only going to be directed at CPUs. Even tho you might do so you’ll still get this error every time. Awesome! Least as compared to the Intel offering it runs on any CPU.