There is an interesting article I ran across today. Entitled, “Why would anybody buy an Apple Watch?” the article,  asks an interesting question through the lens of history. In 2007, many predicted that the iPhone would fail and had plenty of data to back up their stories. All of these people were right. Based on the data available at the time, it should have been a complete utter failure. None of this data took into account the human condition. The experience of being exposed to a mobile device with converged functionality and a multi-touch display. People liked it and smartphones across the board evolved into a new way. How many years did it take Apple to get to the point?

Next week Linaro Connect begins. Many experts within the ARM ecosystem will assemble in Burlingame California to interact and set plans for the next 6 months of engineering activity. Our collective job is not to just predict the future, it is to implement it.

At the heart of Open Source development is the mantra of release early, release often. Apple does not do this. They work and work and work and work some more and eventually release something. Open Source on the other hand iterates quickly. We strive to hit the stage where the human condition can be exposed to a design and implementation as soon as possible and subject our work to the rigors of many eyes so that evolutionary dead ends don’t last long.

The longer you wait to release something, the larger your risk.

Member companies that join Linaro are at an advantage. Through their membership they live at the nexus point of good fast iterative upstream engineering united with technical leadership. Failure happens. The faster you can fail by exposing the code to experts, the more you lower your risk and the quicker, through iteration, get onto the right track. Our members in turn are first to receive the fruits of those labors for their future products.

At a website called kickstarter inventors bring their ideas and expose them to a marketplace where people evaluate and fund the promising inventions.

Linaro is like kickstarter but better for our member companies. The ideas flow in from our members and engineering teams, are discussed at Connect and even outside of Connect, great engineering happens and the promising becomes the next great thing. At kickstarter you don’t get to influence the design, in Linaro a member company does.

See you at Connect. It’s going to be a great week.

Back to Gentoo

Posted: July 12, 2014 in Uncategorized

Back in 2003 I became a gentoo developer. I had been using gentoo prior to that as my Linux distro since it had good amd64 hardware support pretty much out of the gate. I had pieced together an amd64 box and at the time I thought trying out a new Linux distro was a good idea.

Then, I worked really hard on getting ppc64 up and running. At that time, while you could run 64 bit kernels on Power and ppc64 hardware, the user space was pretty much all 32 bit.

Gentoo today in 2014 is still in my opinion a good distro. There are essentially two modes of operation where you either build a package at the time you install it, or you can install from binaries via

As an open source developer I treasure the ability to easily install and test anything from source. Further I very much enjoy the ability to change compilation options for fiddling -O3, -mtune etc options to test out new compilers and see how performance improvements in codegen is coming along. I find it a much better environment than Open Embedded.

For me, I’ve been adding arm64 support to gentoo and this will be my primary focus in my “copious spare time.”

Both the Samsung Gear Live and LG G Android Wear watches are first generation hardware and software implementations.  I don’t have a copy of either. They are about the cost of a dev board so in the grand scheme for a developer it’s not necessarily hard to justify the cost to leap in and get involved.

From the WSJ review by Joanna Stern it feels like as an industry we best roll up our sleeves and get to work optimizing:

Performance wise, the Samsung edged out the LG, which tended to stutter and lag. And for their bulk, both watches’ battery lives should be better. They had to be charged at least once a day in proprietary charging cradles.

Really when you think about it, this is far more than just a wearable problem, we’ve got to evolve mobile devices so a daily charge cycle isn’t the norm.


Linaro Mobile Group

Posted: July 8, 2014 in aarch64, android, linaro

I’m pleased to say I’ve taken on the responsibility to help get the newly formed Linaro Mobile Group off the ground. Officially I’m the Acting Director of LMG. In many ways the Group is actually old as advancement of the ARM ecosystem for Mobile has always been and continues to be a top goal of Linaro. What is happening is we’ve refined the structure so that LMG will function like the other segment groups in Linaro. Linaro has groups formed for Enterprise (LEG), Networking (LNG), Home Entertainment (LHG), so it makes perfect sense that Mobile was destined to become a group.

I am quite grateful to Kanta Vekaria for getting the LMG’s steering committee up and running. This committee, called MOBSCOM started last fall and will morph to be called LMG-SC, the SC of course being short for Steering Committee. Members of LMG are in the drivers seat, setting LMG direction. It’s my job to run the day to day and deliver on the goals set by the steering committee.

Mobile is our middle name and also a big term. For LMG, our efforts are largely around Android. This is not to say that embedded Linux or other mobile operating systems like ChromeOS aren’t interesting. They are. We have and will continue to perform work that can be applied across more than one ARM based mobile operating system. Our media library optimization work using ARM’s SIMD NEON is a very good example. Android is top priority and the main focus, but it’s not all we do.

It’s a great time to form LMG. June, for a number of years, has brought many gifts for developers in mobile with Apple’s WWDC, and Google I/O. Competition is alive and well between these two environments, which in turn fuels innovation. It challenges us as developers to continue to improve. It also makes the existence of Linaro all the more important. It benefits all our members to collaborate together, accomplish engineering goals together instead of each shouldering the engineering costs to improve Android.

Android 64 is a great example. We were quite excited to make available Android64 for Juno, ARM’s armv8 reference board as well as a version for running in software emulators like qemu. We also did quite a bit of work in qemu so that it could emulate armv8 hardware. The world doesn’t need 20 different Android 64 implementations. The world DOES need one great Android 64 and in this way the collaborative environment in and around Linaro is important. While the 06.14 release of Android64 for Juno by Linaro is just a first release with much to do yet, things are on track for some great products by our member companies in the months ahead.

Stay tuned!

Android64 on ARM’s Juno

Posted: July 2, 2014 in Uncategorized

I’m very pleased to point to the announcement of the initial Android64 release by Linaro. for ARM “Juno” hardware.

The Linaro Android team has been working very hard on this for some time and a very big congratulations is due to them.

It speaks volumes about what a team of companies who work together can achieve. Linaro is a very special player in the ARM ecosystem and I’m very pleased to be a part of it.

What other fun things might be running on Juno? :-D Stay tuned.

I’ve been working on moving the OpenCL accelerated sqlite prototype toward being able to support the general case instead of just the contrived set of initial SQL SELECTs.

First, why did I have to start out with a contrived set of SQL SELECTs to accelerate? Consider:

SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

For a query we need to have the equivalent in OpenCL. For the prototype I hand coded these OpenCL kernels and called the kernels with the data as obtained from the sqlite infrastructure.  I had to start somewhere. A series of SQL statements to try and shake out patterns for generation I thought would be the best path to validate this idea.

The next evolutionary step is to generate an OpenCL kernel by reading the parse tree that sqlite generates as it pulls apart the SQL statement.

This is what a machine generated kernel looks like for previously mentioned SQL statement:

__kernel void x2_entry (__global int * id, __global int * uniformi, __global int * normali5, __global int * _cl_resultMask) {
__private int4 v0;
__private int4 v1;
__private int4 v2;
__private int4 _cl_r;
int i = get_global_id(0);
size_t offset = i * (totalRows/workUnits);
do {
v0 = vload4(0, id + offset);
v1 = vload4(0, uniformi + offset);
v2 = vload4(0, normali5 + offset);
_cl_r = (( uniformi  >  60 ) && ( normali5  <  0 ));
vstore4(_cl_r, 0, _cl_resultMask + offset);
} while(totalRows);

Why are we generating OpenCL kernel code there? Isn’t there a better way? Well there is. In later versions of the OpenCL standard (and HSA) there is something called an intermediate representation (IR) form which is very much akin to what compilers translate high level languages to before targeting the native instruction set of whatever that code will run on.

Unfortunately OpenCL’s IR otherwise known as SPIR isn’t available to us since the OpenCL drivers for ARM’s Mali currently don’t support it. Imagination’s PowerVR doesn’t either. (Heck Imagination requires an NDA to be signed to even get there drivers, talk about unfriendly!)  They might someday but that day isn’t today. Likewise HSA has an IRA as part of it’s standard called HSAIL.

Either one would be much better to emit of course presuming that the OpenCL drivers could take that IR as input.

None the less, as soon as I have “parity” with the prototype and a little testing I’ll commit the code that machine generates these OpenCL kernels to git. I’m getting close. The next step after that will be to make a few changes internal to sqlite use those kernels.

Within the GPGPU team Gil Pitney has been working on Shamrock which is an open source OpenCL implementation. It’s really a friendly fork of the clover project but taken in a bit of a new direction.

Over the past few months Gil has updated it to make use of the new MCJIT from llvm which works much better for ARM processors. Further he’s updated Shamrock so that it uses current llvm. I have a build based on 3.5.0 on my chromebook.

The other part about Gil’s Shamrock work is it will in time also have the ability to drive Keystone hardware which is TI’s ARM + DPSs on board computing solution. Being able to drive DSPs with OpenCL is quite an awesome capability. I do wish I had one of those boards.

The other capability Shamrock has is to provide a CPU driver for OpenCL on ARM. How does it perform? Good question!

I took my OpenCL accelerated sqlite prototype and built it to use the Shamrock CPU only driver. Would you expect that a CPU only OpenCL driver offloading SQL SELECT queries to be faster or would the  sqlite engine?

If you guessed OpenCL running on a CPU only driver, you’re right. Now remember the Samsung ARM based chromebook is a dual A15. The queries are against 100,000 rows in a single table database with 7 columns. Lower numbers are better and times

sql1 took 43653 microseconds
OpenCL handcoded-opencl/ Interval took 17738 microseconds
OpenCL Shamrock 2.46x faster
sql2 took 62530 microseconds
OpenCL handcoded-opencl/ Interval took 18168 microseconds
OpenCL Shamrock 3.44x faster
sql3 took 110095 microseconds
OpenCL handcoded-opencl/ Interval took 18711 microseconds
OpenCL Shamrock 5.88x faster
sql4 took 143278 microseconds
OpenCL handcoded-opencl/ Interval took 19612 microseconds
OpenCL Shamrock 7.30x faster
sql5 took 140398 microseconds
OpenCL handcoded-opencl/ Interval took 18698 microseconds
OpenCL Shamrock 7.5x faster

These numbers for running on the CPU are pretty consistent and I was concerned there was some error in the process. Yet the returned number of matching rows is the same for both the sqlite engine and the OpenCL versions which helps detect functional problems. I’ve clipped the result row counts from the results above for brevity.

I wasn’t frankly expecting this kind of speed up, especially with a CPU only driver. Yet there it is in black and white. It does speak highly of the capabilities of OpenCL to be more efficient at computing when you have data parallel problems.

Another interesting thing to note in this comparison, the best results achieved have been with the Mali GPU using vload/vstores and thus take advantage of SIMD vector instructions. On a CPU this would equate to use of NEON. The Shamrock CPU only driver doesn’t at the moment have support for vload/vstore so the compiled OpenCL kernel isn’t even using NEON on the CPU to achieve these results.