sq-cl code posted and some words about vectors

Posted: April 2, 2014 in linaro, OpenCL, open_source

I’ve posted my initial OpenCL accelerated sqlite prototype code:

http://git.linaro.org/people/tom.gall/sq-cl.git

Don’t get excited. Remember, it’s a prototype and a quite contrived one at that. It doesn’t handle the general case yet and of course it has bugs. But!  It’s interesting and I think shows what’s possible.

Over at the mali developer community that ARM hosts. I happened to mention this work which in a post that ended up resulting in some good suggestions to use of vectors as well as other good feedback. While working with vectors was a bit painful due to the introduction of some bugs on my part, I made my way through it and have some initial numbers with a couple of kernels so I can get an idea just what a difference it makes.

Alot.

The core of the algorithm for sql1 changes from:

    do {
        if ((data[offset].v > 60) && (data[offset].w < 0)) {
            resultArray[roffset].id = data[offset].id;
            resultArray[roffset].v = data[offset].v;
            resultArray[roffset].w = data[offset].w;
            roffset++;
        }
        offset++;
        endRow--;
    } while (endRow);

To

    do {
        v1 = vload4(0, data1+offset);
        v2 = vload4(0, data2+offset);
        r = (v1 > 60) && ( 0 > v2);
        vstore4(r,0, resultMask+offset);
        offset+=4;
        totalRows--;
    } while (totalRows);

With each spin through the loop, the vectorized version of course is operating over 4 values at once to check for a match. Obvious win. To do this the data has to come in in pure columns and I’m using an vector as essentially a bitmask to indicate if that row is a match or not. This requires a post processing loop to spin through and assemble the resulting data into a useful state. For the 100,000 row database I’m using it doesn’t seem to have as much of a performance impact as I thought it might.

For the first sql1 test query the numbers look like this:

CPU sql1 took 43631 microseconds
OpenCL sql1  took 14545 microseconds  (2.99x or 199% better)
OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

Not bad. sql3 sees even better results:

CPU sql3 took 111020 microseconds
OpenCL sql3 took 44533 microseconds (2.49x  or 149% better)
OpenCL (using vectors) took 4436 microseconds (25.02x or 2402% better)

There’s another factor why these vectorized versions are doing better. With the newer code I am using less registers on the Mali GPU and thus am able to up the number of work units from 64 to 128.

I do have one bug that I need to track down. I am (of course) validating that all the versions are coming up with the same matches. The new vector versions are off by a couple of rows. The missing rows don’t seem to follow a pattern. I’m sure I’ve done something dumb. Now that there is the ability for more eyes on the code perhaps someone will spot it.

Advertisements
Comments
  1. Emanuele says:

    This lib, if stable, would really change the way I work. I am a data scientist, usually I keep data in Sqlite for consistency, I retrieve them and I use OpenCL-accelerated Python for computations. Being able to make some pre-computation efficiently in sqlite would be amazing. I’ll follow the project very carefully!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s