Archive for the ‘open_source’ Category

Within the GPGPU team Gil Pitney has been working on Shamrock which is an open source OpenCL implementation. It’s really a friendly fork of the clover project but taken in a bit of a new direction.

Over the past few months Gil has updated it to make use of the new MCJIT from llvm which works much better for ARM processors. Further he’s updated Shamrock so that it uses current llvm. I have a build based on 3.5.0 on my chromebook.

The other part about Gil’s Shamrock work is it will in time also have the ability to drive Keystone hardware which is TI’s ARM + DPSs on board computing solution. Being able to drive DSPs with OpenCL is quite an awesome capability. I do wish I had one of those boards.

The other capability Shamrock has is to provide a CPU driver for OpenCL on ARM. How does it perform? Good question!

I took my OpenCL accelerated sqlite prototype and built it to use the Shamrock CPU only driver. Would you expect that a CPU only OpenCL driver offloading SQL SELECT queries to be faster or would the  sqlite engine?

If you guessed OpenCL running on a CPU only driver, you’re right. Now remember the Samsung ARM based chromebook is a dual A15. The queries are against 100,000 rows in a single table database with 7 columns. Lower numbers are better and times

sql1 took 43653 microseconds
OpenCL handcoded-opencl/ Interval took 17738 microseconds
OpenCL Shamrock 2.46x faster
sql2 took 62530 microseconds
OpenCL handcoded-opencl/ Interval took 18168 microseconds
OpenCL Shamrock 3.44x faster
sql3 took 110095 microseconds
OpenCL handcoded-opencl/ Interval took 18711 microseconds
OpenCL Shamrock 5.88x faster
sql4 took 143278 microseconds
OpenCL handcoded-opencl/ Interval took 19612 microseconds
OpenCL Shamrock 7.30x faster
sql5 took 140398 microseconds
OpenCL handcoded-opencl/ Interval took 18698 microseconds
OpenCL Shamrock 7.5x faster

These numbers for running on the CPU are pretty consistent and I was concerned there was some error in the process. Yet the returned number of matching rows is the same for both the sqlite engine and the OpenCL versions which helps detect functional problems. I’ve clipped the result row counts from the results above for brevity.

I wasn’t frankly expecting this kind of speed up, especially with a CPU only driver. Yet there it is in black and white. It does speak highly of the capabilities of OpenCL to be more efficient at computing when you have data parallel problems.

Another interesting thing to note in this comparison, the best results achieved have been with the Mali GPU using vload/vstores and thus take advantage of SIMD vector instructions. On a CPU this would equate to use of NEON. The Shamrock CPU only driver doesn’t at the moment have support for vload/vstore so the compiled OpenCL kernel isn’t even using NEON on the CPU to achieve these results.

I’ve posted my initial OpenCL accelerated sqlite prototype code:

Don’t get excited. Remember, it’s a prototype and a quite contrived one at that. It doesn’t handle the general case yet and of course it has bugs. But!  It’s interesting and I think shows what’s possible.

Over at the mali developer community that ARM hosts. I happened to mention this work which in a post that ended up resulting in some good suggestions to use of vectors as well as other good feedback. While working with vectors was a bit painful due to the introduction of some bugs on my part, I made my way through it and have some initial numbers with a couple of kernels so I can get an idea just what a difference it makes.


The core of the algorithm for sql1 changes from:

    do {
        if ((data[offset].v > 60) && (data[offset].w < 0)) {
            resultArray[roffset].id = data[offset].id;
            resultArray[roffset].v = data[offset].v;
            resultArray[roffset].w = data[offset].w;
    } while (endRow);


    do {
        v1 = vload4(0, data1+offset);
        v2 = vload4(0, data2+offset);
        r = (v1 > 60) && ( 0 > v2);
        vstore4(r,0, resultMask+offset);
    } while (totalRows);

With each spin through the loop, the vectorized version of course is operating over 4 values at once to check for a match. Obvious win. To do this the data has to come in in pure columns and I’m using an vector as essentially a bitmask to indicate if that row is a match or not. This requires a post processing loop to spin through and assemble the resulting data into a useful state. For the 100,000 row database I’m using it doesn’t seem to have as much of a performance impact as I thought it might.

For the first sql1 test query the numbers look like this:

CPU sql1 took 43631 microseconds
OpenCL sql1  took 14545 microseconds  (2.99x or 199% better)
OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

Not bad. sql3 sees even better results:

CPU sql3 took 111020 microseconds
OpenCL sql3 took 44533 microseconds (2.49x  or 149% better)
OpenCL (using vectors) took 4436 microseconds (25.02x or 2402% better)

There’s another factor why these vectorized versions are doing better. With the newer code I am using less registers on the Mali GPU and thus am able to up the number of work units from 64 to 128.

I do have one bug that I need to track down. I am (of course) validating that all the versions are coming up with the same matches. The new vector versions are off by a couple of rows. The missing rows don’t seem to follow a pattern. I’m sure I’ve done something dumb. Now that there is the ability for more eyes on the code perhaps someone will spot it.

People have side projects. This one is mine.

What if you accelerate the popular sqlite database with OpenCL? This is one of the ideas that was floated as part of the GPGPU team to get a feel for what might be accomplished on ARM hardware with a mobile GPU.

In my case I’m using the Mali opencl drivers, running with ubuntu linux on a Samsung Chromebook which includes a dual core A15 and a Mali T604. You can replicate this same setup following these instructions.

At Linaro Connect Asia 2014 as part of the GPGPU session I gave an overview of the effort but I wasn’t able to give any initial performance numbers since my free time is highly variable and Connect arrived before I was quite ready. At the time I was about a week out from being able to run a microbenchmark or two since I was just getting to the step of writing some of the OpenCL.

Before I get to some initial numbers let me review a bit of what I talked about at Connect.

To accelerate sqlite I’ve initially added an api that sits next to the sqlite C api. My API in time should be able to blend right into the sqlite API so that no code changes would be needed by end user applications.  With sqlite you usually have a call sequence something like :

sqlite3_open(databaseName, &db);
c= sqlite3_prepare_v2(db, sql, -1, &selectAll_statement, NULL);
while (sqlite3_step(selectAll_statement) == SQLITE_ROW) {

The prepare call takes sql and converts it to an expression tree that is translated into a bytecode which is run inside of a VM. The virtual machine is really nothing more than an big switch statement and each case handles an op code that the VM operates over. sqlite doesn’t do any sort of JIT to accelerate it’s operation. (I know what you’re thinking, hold that thought.)

The challenge to make a general purpose acceleration is to take the operation of the VM and move that onto the GPU. I see a few ways to accomplish this. In the past work that Peter Bakkum and Kevin Skadron had done they basically moved the implementation of the VM into the GPU using Cuda. This kind of approach really doesn’t work in my opinion for using OpenCL. Instead I’m currently of the opinion that the output of the sql expression tree ought to be a bit more than just VM bytecodes. I do wonder if utilizing llvm couldn’t offer interesting possibilities including SPIR (the Khronos intermediate representation standard for OpenCL) . Further research for sure.

The opencl accelerated API sequence looks like:

opencl_init(s, db);
opencl_prepare_data(s, sql);
opencl_select(s, sql, 0);

At this point, what I’ve managed to do is using a 100,000 row database with 7 columns run the same query using the sqlite c interface and my opencl accelerated interface.

With the sqlite c API the query took 420274 microseconds on my chromebook a dual core A15 cpu running at 1.7 Gz.

The OpenCL accelerated API running on the Mali T604 GPU at 533Mhz(?) from the same Chromebook yields 110289 microseconds. This measured time includes both the running of the OpenCL kernel and the data transfer from the result buffer.

These are early results. Many grains of salt should be applied but over all this seems like good results for a mobile GPU.

Updated June 3rd.

Here are the instructions for building your own Open Embedded based aarch64 image which is able to run an xfce based graphical environment. Consider this a draft that will evolve as there are some hacking bits and some steps that over time I want to make disappear.

mkdir openembedded

cd openembedded

git clone git://

cd jenkins-setup

sudo bash

edit and delete the very last line which is a call to bitbake. bitbake starts a build, we want to delay that for a bit.


  • pull from my xfce brance

cd meta-linaro
git remote add linaro git://
git fetch linaro
git checkout -b xfce linaro/xfce

cd ../meta-openembedded

git remote add linaro git://
git fetch linaro
git checkout -b aarch64 linaro/aarch64

cd ../openembedded-core

git remote add linaro git://
git fetch linaro
git checkout -b aarch64 linaro/aarch64

cd ..

  • Next we need to expand the set of recipes the build will use.

cd build

edit conf/bblayers.conf and add

BBLAYERS += ‘/your-full-path/jenkins-setup/meta-linaro/meta-browser’
BBLAYERS += ‘/your-full-path/jenkins-setup/meta-openembedded/meta-xfce’
BBLAYERS += ‘/your-full-path/jenkins-setup/meta-openembedded/meta-gnome’

  • Now it’s time to build

cd openembedded-core
. oe-init-build-env ../build
cd ../build
bitbake linaro-image-xfce

  • The output of the build is in the build directory in tmp-eglibc/deploy/images
  • Package the resulting rootfs into an img. using linaro-media-create. This implies you have a current copy from bzr of linaro-image-tools (bzr clone lp:linaro-image-tools. Also you need the hwpack from the first step.

~/bzr/linaro-image-tools/linaro-media-create –dev fastmodel
–image_size 2000M –output-directory gfx   –binary
linaro-image-xfce-genericarmv8-20130524144429.rootfs.tar.gz –hwpack

We’ll do some work to smooth this out and get rid of this step and use the OE built kernel.

  • Boot the rtsm model

I use a script for running with either the Foundation model or the RTSM model. Note I keep the Foundation model and the RTSM models inside of ~/aarch64.




if [ ! -z $3 ]; then

if [ -z $1 ]; then
echo “Usage: aarch64-sim KERNEL ROOTFS foundation|rtsm”
kernel=`realpath $1`

if [ ! -z $2 ]; then
rootfs=`realpath $2`

case $model in
if [ ! -z $rootfs ];then
rootfs=” –block-device $rootfs”
# sudo ip tuntap add tap0 mode tap
# sudo ifconfig tap0
~/aarch64/Foundation/Foundation_v8pkg/Foundation_v8 –image $kernel  –network nat $rootfs
if [ ! -z $rootfs ];then
rootfs=” -C motherboard.mmc.p_mmc_file=$rootfs ”
~/aarch64/RTSM_AEMv8_VE/bin/model_shell64 \
-a $kernel \
$rootfs \


I put this in  (Don’t forget to chmod +x) gfx/img.axf  gfx/sd.img rtsm

  • After the linux system has booted you need the run the following by hand.

fc-cache -f -v

pango-querymodules > /etc/pango/pango.modules

  • and now finally run:


That it for now!

Here’s another useful example with the Android build system.

How do you install a number of files in the build system without actually specifying the files?

In this case, it makes a single test case that uses lots and lots of data files. Who wants to name each and every data file? I don’t!  I have better things to do! Let’s look at an example:

LOCAL_PATH:= $(call my-dir)
piglit_shared_libs := libGLESv2 \
 libwaffle-1 \
 libpiglitutil_gles2 \
piglit_c_includes := $(piglit_top)/tests/util \
 bionic \
 $(piglit_top)/src \
 external/waffle/include/waffle \
 external/mesa3d/include \
include $(CLEAR_VARS)
LOCAL_SHARED_LIBRARIES := libGLESv2 libwaffle-1 libpiglitutil_gles2
LOCAL_C_INCLUDES := $(piglit_c_includes)
LOCAL_MODULE := glslparsertest_gles2
systemtarball: glslparsertest_gles2
LOCAL_SRC_FILES := glslparsertest.c
define all-vert-frag-files-under
$(patsubst ./%,%, \
 $(shell cd $(1) ; \
 find $(2) -name "*.vert" -or -name "*.frag" -and -not -name ".*" -printf "%P\n" ) \
define glsl2_add_test_data
include $(CLEAR_VARS)
LOCAL_SRC_FILES := glsl2/$1
$(warning $(LOCAL_SRC_FILES)) 
datatarball: $1
glsl2_files := $(call all-vert-frag-files-under, external/piglit/tests/glslparsertest/glsl2)
$(foreach item,$(glsl2_files),$(eval $(call glsl2_add_test_data,$(item))))

Walking you through. all-vert-frag-files-under contains the find which looks for those files.  The next define glsl2_add_test_data creates a BUILD_PREBUILT block for each one of the data files.  After the define, we call the find naming the directory where to look. The line after goes through the contents of glsl2_files which has the list of data files that the find found. It iterates through each entry, passing each $item as a param in the BUILD_PREBUILT block.

Simple as that, but given I couldn’t find any documentation or examples, I thought I’d save you the trouble. Enjoy!

Or how to do some fairly useful things with the Android build system that were not intuitively obvious to me.

I’ve been working with Android a fair amount this week, both the upstream Android Open Source Project sources and the Linaro sources. For both I’ve been getting piglit to build which has required me to construct a number of files. In this case, piglit and waffle are the two packages that need to build. Waffle for it’s library and piglit for it’s library, testcases and test case data.

How do you build a number of binaries in a loop using Android’s

I wondered the same. As the google examples go, for each binary you want to build, you need to specify several LOCAL_* values. If you had to do this for each and every test binary in piglit, it’d be quite quite tedious.  Here’s example how to it the easy way:

include $(call all-subdir-makefiles)

LOCAL_PATH:= $(call my-dir)

piglit_top := $(LOCAL_PATH)/../../..
piglit_shared_libs := libGLESv2 libwaffle-1 libpiglitutil_gles2
piglit_c_includes := $(piglit_top)/tests/util \
bionic \
$(piglit_top)/src \
external/waffle/include/waffle \
external/mesa3d/include \


module_name = piglit-spec-gles2

define $(module_name)_etc_add_executable
include $(CLEAR_VARS)
LOCAL_SHARED_LIBRARIES := $(piglit_shared_libs)
LOCAL_C_INCLUDES := $(piglit_c_includes)
LOCAL_CFLAGS := $(piglit_c_flags)
LOCAL_MODULE := $1_gles2
systemtarball: $1_gles2

test_names := invalid-es3-queries minmax
$(foreach item,$(test_names),$(eval $(call $(module_name)_etc_add_executable, $(item))))

It’s fairly straightforward once you have the solution but getting there isn’t exactly documented in the google docs from what I could see. So let me talk you through it.

At the top, I have the list of shared libs that will be linked in and the directories that should be searched.

Within define $(module_name)_etc_add_executable are the set of values  we need to specify for each testcase. Observe that $1 is a parameter passed in. We have to keep to a pattern of test_case.c  will end up as name of the test_case binary, but who cares, this is quite acceptable.

Last test_names is the list of all the test cases we’re going to build. Followed last but not least by the loop that iterates over the test case list.

Next. Data!

How do you copy test case user data to the data partition in an file?

Again this was something I scratched my head about, looking through the docs, googled often and came up empty. It’s not hard but it’s not necessarily intuitive.

LOCAL_PATH:= $(call my-dir)
include $(CLEAR_VARS)

LOCAL_MODULE:= sanity.shader_test
LOCAL_SRC_FILES := sanity.shader_test
userdatatarball: $(LOCAL_MODULE)

In this example there is but one data file. But I think you can see where it’s going. Combine with the loop example above and things get quite useful.

Some things to point out.  We need BUILD_PREBUILT since we’re not building anything. The LOCAL_SRC_FILES is our list of data. LOCAL_MODULE_PATH is where to install to.  TARGET_OUT_DATA is /data. It will create the directories for you if they don’t exist.

userdatatarball: is rule to make sure that the datafile ends up in the tarball. I couldn’t figure out a way to not have this but perhaps there is a way. I didn’t run across it.

So there you have it. There might be bugs yet. There might be a better more official way. Based on what I figured out this is what works for me.

1047, to be exact.

Hmm? 1047 Piglit tests running on ARM Linux.

What’s piglit?  Piglit is a test suite for OpenGL. For the most part it’s used for testing Mesa which is the open source implementation of OpenGL and OpenGL ES. It’s used on Android and Linux. If you’re writing a GL or GLES driver Piglit is a way to run graphical tests to check your implementation.

Just 60 days ago, there were basically zero OpenGL ES 2.0 tests that ran. None. I’ve been working to fix,  largely concentrating on the glslparser and shader_runner tests both stressing glsl es 1.00 which is an important component in Open GL ES 2.0. To put it into perspective a full piglit test run of all tests on my intel box for OpenGL  is about 6,900 tests.

Also, the piglit team is in the midst of doing something rather fun and that’s converting over to the use of waffle. What’s waffle?  Waffle is a way to defer to runtime which variation of GL api you’re going to use (GL, GLES) as well as what windowing system you’re going to use (X11, Wayland, etc).  So one could for instance do a complete set of test runs for say GLES 1.1, 2.0 and 3.0 in time without a recompile.

Thus far the code in question is not yet all upstream and things are still under rapid development but via my git tree you’re able to follow along and run. I’m not running on Android quite yet but that is being worked on by myself and others.


Then you’ll want the gles2-all branch.

git checkout -b my-copy origin/gles2-all

(Be warned, the gles2-all branch is fairly dynamic)

Before you build piglit you’ll need to obtain, build and install waffle. Grab it from git://

See the README.txt for instructions as to how to build and install.

Turning our attention back to piglit. Make sure you read the README.txt and that the software prereqs are installed. Then:

ccmake .

Turn on just the gles2 tests. Hit ‘c’ to configure.  Then hit ‘g’ to generate and last to build



./ tests/all_es2.tests results/es2

and format the results.

./ sum/es2 results/es2

After that point your web browser to the sum directory that you’ve created to see the results.

What’s next? Getting this all running on android, getting it integrated into our  lava automated test system and of course more tests.

java vs c on ARM

Posted: July 11, 2012 in linaro, open_source

To go hand in hand with my initial hadoop experiments, one of the questions I’ve had on my mind is the state of java on ARM, specifically how is the performance of the openjdk?


I’ve picked SciBench2 which has both a C version and a java implementation. This allows us to compare the two for each algorithm implemented in SciBench2. We should be able to measure the ratio of C results to Java results on the same architecture and then compare those ratios between the the two architectures. If the ratio is greater on ARM than say Intel, then we know that the implementation of Java on ARM needs improvement.

In the case of SciMark2 there are 5 components to the benchmark, all measured in MFLOPS.

  1. FFT (1024)
  2. SOR (100×100)
  3. MonteCarlo
  4. Sparse matmult (N=1000, nz=5000)
  5. LU (100×100)

Running 10 measurements of the c version and java version on ARM and then the same on intel x86_64. The c version is built -O3. In the case of java we trust that the jit will do the right thing.

The JDK measured on intel x86_64 is:

java version “1.6.0_24”
OpenJDK Runtime Environment (IcedTea6 1.11.1) (6b24-1.11.1-4ubuntu3)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

The JDK measured on ARM is:

java version “1.6.0_24”
OpenJDK Runtime Environment (IcedTea6 1.11.1) (6b24-1.11.1-4ubuntu3)
OpenJDK Zero VM (build 20.0-b12, mixed mode)

gcc on ARM is version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

gcc  on Intel x86_64 is version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

The Intel x86_64 machine is an i5 with 16 gigs of memory running Ubuntu 12.04. The ARM machine is an OMAP4 TI PandaBoard ES with 1 gig of memory.


In the spreadsheet includes calculations for standard deviation, 95% confidence interval using Student’s T 9 degrees of freedom and last the ratios of C to Java performance on each platform.

Given the large difference in the ratios observed between ARM and Intel, I believe the evidence from this benchmark supports the assertion that the OpenJDK on ARM needs work. For each of the tests in Intel the ratio between C and Java performance is reasonably close to 1:1 or the same performance, yet on ARM the ratios between C and Java for the same code yields multiples ranging from 3.6x to 8.9x.

ngnix & Calxeda

Posted: June 22, 2012 in images, linaro, open_source, server

In progress I have an update to the Linaro based server image I’ve created. It fixes a couple of notable bugs.

  1. linaro-media-create would fail due to an already installed kernel.
  2. openssh-server removed for now – the package while previously installed wouldn’t have the keys generated, so unless you knew to manually gen your keys slogin and friends would fail in unhelpful ways.

Besides the update to this lamp image, I’ve another image I’ve create which replaces apache with ngnix. Never heard of ngnix ?  Read more about it here.

Also of note Calxeda has posted ApacheBench numbers using their new chips. That can be found here.

The new lamp server image is located here.  It is as yet untested.

The ngnix based image isn’t complete.

Today I went to look to find the Linaro server image on  that I had put together last year  and well .. umm .. err … yeah not there. Now don’t get upset. Maintaining lots of different reference images takes time, effort and resources. It’s cool.

Rolling up my sleeves, I took the time to create a version of the past server image using armhf with precise. I’ve tested booted it on my panda boards. It works. Tomorrow I’m intending to run ApacheBench against it.

The live-build config is located at : lp:~tom-gall/linaro/live-helper.config.precise.server

I’ve cross built this on my intel box with the a45 version linaro’s live-build. You can too.

If you’ve never cross built an image before you can find instructions here.

Or if I have a little time tomorrow I’ll post the image somewhere so you don’t have to rebuild it.

Updated 6/19 : Server image can be downloaded from here.