Back to Gentoo

Posted: July 12, 2014 in Uncategorized

Back in 2003 I became a gentoo developer. I had been using gentoo prior to that as my Linux distro since it had good amd64 hardware support pretty much out of the gate. I had pieced together an amd64 box and at the time I thought trying out a new Linux distro was a good idea.

Then, I worked really hard on getting ppc64 up and running. At that time, while you could run 64 bit kernels on Power and ppc64 hardware, the user space was pretty much all 32 bit.

Gentoo today in 2014 is still in my opinion a good distro. There are essentially two modes of operation where you either build a package at the time you install it, or you can install from binaries via http://www.sabayon.org/.

As an open source developer I treasure the ability to easily install and test anything from source. Further I very much enjoy the ability to change compilation options for fiddling -O3, -mtune etc options to test out new compilers and see how performance improvements in codegen is coming along. I find it a much better environment than Open Embedded.

For me, I’ve been adding arm64 support to gentoo and this will be my primary focus in my “copious spare time.”

Both the Samsung Gear Live and LG G Android Wear watches are first generation hardware and software implementations.  I don’t have a copy of either. They are about the cost of a dev board so in the grand scheme for a developer it’s not necessarily hard to justify the cost to leap in and get involved.

From the WSJ review by Joanna Stern it feels like as an industry we best roll up our sleeves and get to work optimizing:

Performance wise, the Samsung edged out the LG, which tended to stutter and lag. And for their bulk, both watches’ battery lives should be better. They had to be charged at least once a day in proprietary charging cradles.

Really when you think about it, this is far more than just a wearable problem, we’ve got to evolve mobile devices so a daily charge cycle isn’t the norm.

 

Linaro Mobile Group

Posted: July 8, 2014 in aarch64, android, linaro

I’m pleased to say I’ve taken on the responsibility to help get the newly formed Linaro Mobile Group off the ground. Officially I’m the Acting Director of LMG. In many ways the Group is actually old as advancement of the ARM ecosystem for Mobile has always been and continues to be a top goal of Linaro. What is happening is we’ve refined the structure so that LMG will function like the other segment groups in Linaro. Linaro has groups formed for Enterprise (LEG), Networking (LNG), Home Entertainment (LHG), so it makes perfect sense that Mobile was destined to become a group.

I am quite grateful to Kanta Vekaria for getting the LMG’s steering committee up and running. This committee, called MOBSCOM started last fall and will morph to be called LMG-SC, the SC of course being short for Steering Committee. Members of LMG are in the drivers seat, setting LMG direction. It’s my job to run the day to day and deliver on the goals set by the steering committee.

Mobile is our middle name and also a big term. For LMG, our efforts are largely around Android. This is not to say that embedded Linux or other mobile operating systems like ChromeOS aren’t interesting. They are. We have and will continue to perform work that can be applied across more than one ARM based mobile operating system. Our media library optimization work using ARM’s SIMD NEON is a very good example. Android is top priority and the main focus, but it’s not all we do.

It’s a great time to form LMG. June, for a number of years, has brought many gifts for developers in mobile with Apple’s WWDC, and Google I/O. Competition is alive and well between these two environments, which in turn fuels innovation. It challenges us as developers to continue to improve. It also makes the existence of Linaro all the more important. It benefits all our members to collaborate together, accomplish engineering goals together instead of each shouldering the engineering costs to improve Android.

Android 64 is a great example. We were quite excited to make available Android64 for Juno, ARM’s armv8 reference board as well as a version for running in software emulators like qemu. We also did quite a bit of work in qemu so that it could emulate armv8 hardware. The world doesn’t need 20 different Android 64 implementations. The world DOES need one great Android 64 and in this way the collaborative environment in and around Linaro is important. While the 06.14 release of Android64 for Juno by Linaro is just a first release with much to do yet, things are on track for some great products by our member companies in the months ahead.

Stay tuned!

Android64 on ARM’s Juno

Posted: July 2, 2014 in Uncategorized

I’m very pleased to point to the announcement of the initial Android64 release by Linaro. http://www.linaro.org/news/aosp-on-64-bit/ for ARM “Juno” hardware.

The Linaro Android team has been working very hard on this for some time and a very big congratulations is due to them.

It speaks volumes about what a team of companies who work together can achieve. Linaro is a very special player in the ARM ecosystem and I’m very pleased to be a part of it.

What other fun things might be running on Juno? :-D Stay tuned.

I’ve been working on moving the OpenCL accelerated sqlite prototype toward being able to support the general case instead of just the contrived set of initial SQL SELECTs.

First, why did I have to start out with a contrived set of SQL SELECTs to accelerate? Consider:

SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0

For a query we need to have the equivalent in OpenCL. For the prototype I hand coded these OpenCL kernels and called the kernels with the data as obtained from the sqlite infrastructure.  I had to start somewhere. A series of SQL statements to try and shake out patterns for generation I thought would be the best path to validate this idea.

The next evolutionary step is to generate an OpenCL kernel by reading the parse tree that sqlite generates as it pulls apart the SQL statement.

This is what a machine generated kernel looks like for previously mentioned SQL statement:

__kernel void x2_entry (__global int * id, __global int * uniformi, __global int * normali5, __global int * _cl_resultMask) {
__private int4 v0;
__private int4 v1;
__private int4 v2;
__private int4 _cl_r;
int i = get_global_id(0);
size_t offset = i * (totalRows/workUnits);
do {
v0 = vload4(0, id + offset);
v1 = vload4(0, uniformi + offset);
v2 = vload4(0, normali5 + offset);
_cl_r = (( uniformi  >  60 ) && ( normali5  <  0 ));
vstore4(_cl_r, 0, _cl_resultMask + offset);
        offset+=4
        totalRows--;
} while(totalRows);
}

Why are we generating OpenCL kernel code there? Isn’t there a better way? Well there is. In later versions of the OpenCL standard (and HSA) there is something called an intermediate representation (IR) form which is very much akin to what compilers translate high level languages to before targeting the native instruction set of whatever that code will run on.

Unfortunately OpenCL’s IR otherwise known as SPIR isn’t available to us since the OpenCL drivers for ARM’s Mali currently don’t support it. Imagination’s PowerVR doesn’t either. (Heck Imagination requires an NDA to be signed to even get there drivers, talk about unfriendly!)  They might someday but that day isn’t today. Likewise HSA has an IRA as part of it’s standard called HSAIL.

Either one would be much better to emit of course presuming that the OpenCL drivers could take that IR as input.

None the less, as soon as I have “parity” with the prototype and a little testing I’ll commit the code that machine generates these OpenCL kernels to git. I’m getting close. The next step after that will be to make a few changes internal to sqlite use those kernels.

Within the GPGPU team Gil Pitney has been working on Shamrock which is an open source OpenCL implementation. It’s really a friendly fork of the clover project but taken in a bit of a new direction.

Over the past few months Gil has updated it to make use of the new MCJIT from llvm which works much better for ARM processors. Further he’s updated Shamrock so that it uses current llvm. I have a build based on 3.5.0 on my chromebook.

The other part about Gil’s Shamrock work is it will in time also have the ability to drive Keystone hardware which is TI’s ARM + DPSs on board computing solution. Being able to drive DSPs with OpenCL is quite an awesome capability. I do wish I had one of those boards.

The other capability Shamrock has is to provide a CPU driver for OpenCL on ARM. How does it perform? Good question!

I took my OpenCL accelerated sqlite prototype and built it to use the Shamrock CPU only driver. Would you expect that a CPU only OpenCL driver offloading SQL SELECT queries to be faster or would the  sqlite engine?

If you guessed OpenCL running on a CPU only driver, you’re right. Now remember the Samsung ARM based chromebook is a dual A15. The queries are against 100,000 rows in a single table database with 7 columns. Lower numbers are better and times

sql1 took 43653 microseconds
OpenCL handcoded-opencl/sql1.cl Interval took 17738 microseconds
OpenCL Shamrock 2.46x faster
sql2 took 62530 microseconds
OpenCL handcoded-opencl/sql2.cl Interval took 18168 microseconds
OpenCL Shamrock 3.44x faster
sql3 took 110095 microseconds
OpenCL handcoded-opencl/sql3.cl Interval took 18711 microseconds
OpenCL Shamrock 5.88x faster
sql4 took 143278 microseconds
OpenCL handcoded-opencl/sql4.cl Interval took 19612 microseconds
OpenCL Shamrock 7.30x faster
sql5 took 140398 microseconds
OpenCL handcoded-opencl/sql5.cl Interval took 18698 microseconds
OpenCL Shamrock 7.5x faster

These numbers for running on the CPU are pretty consistent and I was concerned there was some error in the process. Yet the returned number of matching rows is the same for both the sqlite engine and the OpenCL versions which helps detect functional problems. I’ve clipped the result row counts from the results above for brevity.

I wasn’t frankly expecting this kind of speed up, especially with a CPU only driver. Yet there it is in black and white. It does speak highly of the capabilities of OpenCL to be more efficient at computing when you have data parallel problems.

Another interesting thing to note in this comparison, the best results achieved have been with the Mali GPU using vload/vstores and thus take advantage of SIMD vector instructions. On a CPU this would equate to use of NEON. The Shamrock CPU only driver doesn’t at the moment have support for vload/vstore so the compiled OpenCL kernel isn’t even using NEON on the CPU to achieve these results.

I run OSX on my laptop. (gasp!) I ssh into my various linux boxes to work on various projects. As I’m doing a little work with Renderscript and my sqlite acceleration project I thought it would be handy to build Android on my OS X laptop. Turns out it’s not entirely difficult and required just one fix to the code.

Ports

There are several projects to bring various linux/unix tools onto OSX. I use MacPorts. Brew is probably another good option. Either way this gives us a foundation of tools that the android build system is going to need.

The install instructions offer an extra easy pkg option.

Next we need to install some software.

sudo port install coreutils findutils pngcrush gsed gnupg

Xcode

Xcode is of course Apple’s development environment for OSX and iOS. You need it, and it can be installed directly out of the App Store.

Java

Make sure you have java installed.

java -version
java version "1.6.0_65"

If you don’t, you’ll get a popup dialog that will ask if you want to install it. Do!

Python

Make sure you have python installed. If I recall correctly that’s a default install with OSX Mavericks.  There is an option to install via ports.

sudo port install python

Repo

Pull down repo.

curl http://commondatastorage.googleapis.com/git-repo-downloads/repo > ~/bin/repo

Make sure you add your ~/bin to your PATH

export PATH="$PATH:~/bin"

Android SDK tools

You need to download the android sdk tools built for the Mac. Download these from here. Extract. At this point I created an android directory and put the tools inside of it.

mkdir -p ~/android
mv <whereever>/android-sdk  ~/android

Filesystem setup

OSX for all it’s joys doesn’t deal with case differences in it’s file system unless you specifically created the file system to do so. The default doesn’t. It’s not 8.3, but it’s still 1990s lame. So you’ll need to create a file system for the Android source code to live in.

Make sure you have the space in your file system. I created a 100 gig file system. I wouldn’t go below 50. I also put this onto my desktop. Makes it easy to double click later to mount it. Feel free to mount it where it works best for you. However remember this location!

hdiutil create -type SPARSE -fs "Case-sensitive Journaled HFS+" -size 100g -volname "android" -attach ~/Desktop/Android

Android source code

Download as you normally would. (note the cd to the location of where you just attached the new HFS case sensitive file system.

cd ~/Desktop/Android
git clone http://android.googlesource.com/platform/manifest.git
git branch -r   // this will show you all the branch options. I was after the latest.
repo init -u git://android.git.kernel.org/platform/manifest.git  -b android-4.4_r1.2
repo sync

Environment Setup

We need to setup a few environment variables. First add the android sdk tools to your path

export PATH=~/android/android-sdk/sdk/platform-tools:$PATH
export BUILD_MAC_SDK_EXPERIMENTAL=1
export LC_CTYPE=C
export LANG=C

The One Fix

So in jni_generator.py there is a slight issue where it doesn’t handle that situation where one of the tool parameters isn’t available. So we need to defensively work around it. (yeah yeah I should just post the patch)

In external/chromium_org/base/android/jni_generator/jni_generator.py

At the top of the file (around line 20) add

import platform

Then lower down add the following if to check for Darwin so that -fpreprocessed isn’t passed:

531   def _RemoveComments(self, contents):
532     # We need to support both inline and block comments, and we need to handle
533     # strings that contain '//' or '/*'. Rather than trying to do all that with
534     # regexps, we just pipe the contents through the C preprocessor. We tell cpp
535     # the file has already been preprocessed, so it just removes comments and
536     # doesn't try to parse #include, #pragma etc.
537     #
538     # TODO(husky): This is a bit hacky. It would be cleaner to use a real Java
539     # parser. Maybe we could ditch JNIFromJavaSource and just always use
540     # JNIFromJavaP; or maybe we could rewrite this script in Java and use APT.
541     # http://code.google.com/p/chromium/issues/detail?id=138941
542     system = platform.system()
543     if system == 'Darwin':
544       cpp_args = ['cpp']
545     else:
546       cpp_args = ['cpp', '-fpreprocessed']
547     p = subprocess.Popen(args=cpp_args,
548                          stdin=subprocess.PIPE,
549                          stdout=subprocess.PIPE,
550                          stderr=subprocess.PIPE)
551     stdout, _ = p.communicate(contents)

Ready To Build

That’s it. Least I hope I captured everything I had to do. Build away.