Archive for the ‘linaro’ Category

Linaro Mobile Group

Posted: July 8, 2014 in aarch64, android, linaro

I’m pleased to say I’ve taken on the responsibility to help get the newly formed Linaro Mobile Group off the ground. Officially I’m the Acting Director of LMG. In many ways the Group is actually old as advancement of the ARM ecosystem for Mobile has always been and continues to be a top goal of Linaro. What is happening is we’ve refined the structure so that LMG will function like the other segment groups in Linaro. Linaro has groups formed for Enterprise (LEG), Networking (LNG), Home Entertainment (LHG), so it makes perfect sense that Mobile was destined to become a group.

I am quite grateful to Kanta Vekaria for getting the LMG’s steering committee up and running. This committee, called MOBSCOM started last fall and will morph to be called LMG-SC, the SC of course being short for Steering Committee. Members of LMG are in the drivers seat, setting LMG direction. It’s my job to run the day to day and deliver on the goals set by the steering committee.

Mobile is our middle name and also a big term. For LMG, our efforts are largely around Android. This is not to say that embedded Linux or other mobile operating systems like ChromeOS aren’t interesting. They are. We have and will continue to perform work that can be applied across more than one ARM based mobile operating system. Our media library optimization work using ARM’s SIMD NEON is a very good example. Android is top priority and the main focus, but it’s not all we do.

It’s a great time to form LMG. June, for a number of years, has brought many gifts for developers in mobile with Apple’s WWDC, and Google I/O. Competition is alive and well between these two environments, which in turn fuels innovation. It challenges us as developers to continue to improve. It also makes the existence of Linaro all the more important. It benefits all our members to collaborate together, accomplish engineering goals together instead of each shouldering the engineering costs to improve Android.

Android 64 is a great example. We were quite excited to make available Android64 for Juno, ARM’s armv8 reference board as well as a version for running in software emulators like qemu. We also did quite a bit of work in qemu so that it could emulate armv8 hardware. The world doesn’t need 20 different Android 64 implementations. The world DOES need one great Android 64 and in this way the collaborative environment in and around Linaro is important. While the 06.14 release of Android64 for Juno by Linaro is just a first release with much to do yet, things are on track for some great products by our member companies in the months ahead.

Stay tuned!

Advertisements

Within the GPGPU team Gil Pitney has been working on Shamrock which is an open source OpenCL implementation. It’s really a friendly fork of the clover project but taken in a bit of a new direction.

Over the past few months Gil has updated it to make use of the new MCJIT from llvm which works much better for ARM processors. Further he’s updated Shamrock so that it uses current llvm. I have a build based on 3.5.0 on my chromebook.

The other part about Gil’s Shamrock work is it will in time also have the ability to drive Keystone hardware which is TI’s ARM + DPSs on board computing solution. Being able to drive DSPs with OpenCL is quite an awesome capability. I do wish I had one of those boards.

The other capability Shamrock has is to provide a CPU driver for OpenCL on ARM. How does it perform? Good question!

I took my OpenCL accelerated sqlite prototype and built it to use the Shamrock CPU only driver. Would you expect that a CPU only OpenCL driver offloading SQL SELECT queries to be faster or would the  sqlite engine?

If you guessed OpenCL running on a CPU only driver, you’re right. Now remember the Samsung ARM based chromebook is a dual A15. The queries are against 100,000 rows in a single table database with 7 columns. Lower numbers are better and times

sql1 took 43653 microseconds
OpenCL handcoded-opencl/sql1.cl Interval took 17738 microseconds
OpenCL Shamrock 2.46x faster
sql2 took 62530 microseconds
OpenCL handcoded-opencl/sql2.cl Interval took 18168 microseconds
OpenCL Shamrock 3.44x faster
sql3 took 110095 microseconds
OpenCL handcoded-opencl/sql3.cl Interval took 18711 microseconds
OpenCL Shamrock 5.88x faster
sql4 took 143278 microseconds
OpenCL handcoded-opencl/sql4.cl Interval took 19612 microseconds
OpenCL Shamrock 7.30x faster
sql5 took 140398 microseconds
OpenCL handcoded-opencl/sql5.cl Interval took 18698 microseconds
OpenCL Shamrock 7.5x faster

These numbers for running on the CPU are pretty consistent and I was concerned there was some error in the process. Yet the returned number of matching rows is the same for both the sqlite engine and the OpenCL versions which helps detect functional problems. I’ve clipped the result row counts from the results above for brevity.

I wasn’t frankly expecting this kind of speed up, especially with a CPU only driver. Yet there it is in black and white. It does speak highly of the capabilities of OpenCL to be more efficient at computing when you have data parallel problems.

Another interesting thing to note in this comparison, the best results achieved have been with the Mali GPU using vload/vstores and thus take advantage of SIMD vector instructions. On a CPU this would equate to use of NEON. The Shamrock CPU only driver doesn’t at the moment have support for vload/vstore so the compiled OpenCL kernel isn’t even using NEON on the CPU to achieve these results.

I’ve posted my initial OpenCL accelerated sqlite prototype code:

http://git.linaro.org/people/tom.gall/sq-cl.git

Don’t get excited. Remember, it’s a prototype and a quite contrived one at that. It doesn’t handle the general case yet and of course it has bugs. But!  It’s interesting and I think shows what’s possible.

Over at the mali developer community that ARM hosts. I happened to mention this work which in a post that ended up resulting in some good suggestions to use of vectors as well as other good feedback. While working with vectors was a bit painful due to the introduction of some bugs on my part, I made my way through it and have some initial numbers with a couple of kernels so I can get an idea just what a difference it makes.

Alot.

The core of the algorithm for sql1 changes from:

    do {
        if ((data[offset].v > 60) && (data[offset].w < 0)) {
            resultArray[roffset].id = data[offset].id;
            resultArray[roffset].v = data[offset].v;
            resultArray[roffset].w = data[offset].w;
            roffset++;
        }
        offset++;
        endRow--;
    } while (endRow);

To

    do {
        v1 = vload4(0, data1+offset);
        v2 = vload4(0, data2+offset);
        r = (v1 > 60) && ( 0 > v2);
        vstore4(r,0, resultMask+offset);
        offset+=4;
        totalRows--;
    } while (totalRows);

With each spin through the loop, the vectorized version of course is operating over 4 values at once to check for a match. Obvious win. To do this the data has to come in in pure columns and I’m using an vector as essentially a bitmask to indicate if that row is a match or not. This requires a post processing loop to spin through and assemble the resulting data into a useful state. For the 100,000 row database I’m using it doesn’t seem to have as much of a performance impact as I thought it might.

For the first sql1 test query the numbers look like this:

CPU sql1 took 43631 microseconds
OpenCL sql1  took 14545 microseconds  (2.99x or 199% better)
OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

Not bad. sql3 sees even better results:

CPU sql3 took 111020 microseconds
OpenCL sql3 took 44533 microseconds (2.49x  or 149% better)
OpenCL (using vectors) took 4436 microseconds (25.02x or 2402% better)

There’s another factor why these vectorized versions are doing better. With the newer code I am using less registers on the Mali GPU and thus am able to up the number of work units from 64 to 128.

I do have one bug that I need to track down. I am (of course) validating that all the versions are coming up with the same matches. The new vector versions are off by a couple of rows. The missing rows don’t seem to follow a pattern. I’m sure I’ve done something dumb. Now that there is the ability for more eyes on the code perhaps someone will spot it.

Thursday featured the UMM user space allocator helper discussion and the GPGPU status talk.

The UMM User Space Allocators discussion was given by Sumit Semwal and Benjamin Gaignard. The problem involves the need from user space to allocate and work with memory for sharing between devices. Consider a video pipeline or a web camera that is rendering to the screen. This work will help achieve a zero copy design without user space having to know hardware details such as memory ranges, and other device constraints.

Gil Pitney and I gave the GPGPU talk which covered the current efforts involving the GPGPU subteam. Gil is working on Shamrock which is the old Clover project evolved. He’s upgraded it to use current top of tree llvm and MCJIT for code gen. There’s still testing to do but these are excellent steps forward as getting off the old JIT was important. Shamrock provides a CPU only OpenCL implementation which is great for those that don’t want to implement their own drivers but still want to provide at least the basic functionality. In addition there will be via Shamrock a driver for TI DSP hardware. This is also quite a great step forward. Via this route, everyone can collaborate on the open source portion which takes care of the base library and this leave just the driver/codegen to be something that needs to be created by the board creator.

The other part of the talk was about accelerating SQLite with OpenCL. There was a past project that accomplished something similar but with CUDA. I’m working on this and it’s quite the enjoyable project. I’m just implementing OpenCL kernels so there is a ways to go.  It will serve as a good reference for what can be accomplished on ARM SoC systems which have OpenCL drivers. We typically don’t have as many shader units as modern desktop PCIe solutions in the intel universe. I do find it encouraging that the SQLite design is quite flexible and fits well with this kind of experiment.

I did also attend the Ne10 and Chromium optimizations for Cocos2D/Cocos2D-HTML5. This are ARM projects. Ne10 is essentially a library that sits above Neon intrinsics to give easier access to that functionality. Cocos is a popular cross platform engine that is particularly popular in the Android world for 2D UIs and game creation. There was some nice optimization work in and around various drawing primitives done by the ARM team for Chromium that end up helping Cocos.

Thursday included the first bit of quiet time I had all week to actually write some code. It didn’t last long but it did feel good as I’m in a very fun portion of implementing the optimized SQLite with OpenCL and it was hard to set that work aside while Connect is on.

Updated June 3rd.

Here are the instructions for building your own Open Embedded based aarch64 image which is able to run an xfce based graphical environment. Consider this a draft that will evolve as there are some hacking bits and some steps that over time I want to make disappear.

mkdir openembedded

cd openembedded

git clone git://git.linaro.org/openembedded/jenkins-setup.git

cd jenkins-setup

sudo bash pre-build-root-install-dependencies.sh

edit init-and-build.sh and delete the very last line which is a call to bitbake. bitbake starts a build, we want to delay that for a bit.

./init-and-build.sh

  • pull from my xfce brance

cd meta-linaro
git remote add linaro git://git.linaro.org/people/tomgall/oe/meta-linaro.git
git fetch linaro
git checkout -b xfce linaro/xfce

cd ../meta-openembedded

git remote add linaro git://git.linaro.org/people/tomgall/oe/meta-openembedded.git
git fetch linaro
git checkout -b aarch64 linaro/aarch64

cd ../openembedded-core

git remote add linaro git://git.linaro.org/people/tomgall/oe/openembedded-core.git
git fetch linaro
git checkout -b aarch64 linaro/aarch64

cd ..

  • Next we need to expand the set of recipes the build will use.

cd build

edit conf/bblayers.conf and add

BBLAYERS += ‘/your-full-path/jenkins-setup/meta-linaro/meta-browser’
BBLAYERS += ‘/your-full-path/jenkins-setup/meta-openembedded/meta-xfce’
BBLAYERS += ‘/your-full-path/jenkins-setup/meta-openembedded/meta-gnome’

  • Now it’s time to build

cd openembedded-core
. oe-init-build-env ../build
cd ../build
bitbake linaro-image-xfce

  • The output of the build is in the build directory in tmp-eglibc/deploy/images
  • Package the resulting rootfs into an img. using linaro-media-create. This implies you have a current copy from bzr of linaro-image-tools (bzr clone lp:linaro-image-tools. Also you need the hwpack from the first step.

~/bzr/linaro-image-tools/linaro-media-create –dev fastmodel
–image_size 2000M –output-directory gfx   –binary
linaro-image-xfce-genericarmv8-20130524144429.rootfs.tar.gz –hwpack
../linux-gfx-model/hwpack_linaro-vexpress64-rtsm_20130521-337_arm64_supported.tar.gz

We’ll do some work to smooth this out and get rid of this step and use the OE built kernel.

  • Boot the rtsm model

I use a script for running with either the Foundation model or the RTSM model. Note I keep the Foundation model and the RTSM models inside of ~/aarch64.

————————————————————————————

#!/bin/bash

model=foundation
kernel=
rootfs=

if [ ! -z $3 ]; then
model=$3
fi

if [ -z $1 ]; then
echo “Usage: aarch64-sim KERNEL ROOTFS foundation|rtsm”
else
kernel=`realpath $1`
fi

if [ ! -z $2 ]; then
rootfs=`realpath $2`
fi

case $model in
foundation)
if [ ! -z $rootfs ];then
rootfs=” –block-device $rootfs”
fi
# sudo ip tuntap add tap0 mode tap
# sudo ifconfig tap0 192.168.168.1
~/aarch64/Foundation/Foundation_v8pkg/Foundation_v8 –image $kernel  –network nat $rootfs
;;
rtsm)
if [ ! -z $rootfs ];then
rootfs=” -C motherboard.mmc.p_mmc_file=$rootfs ”
fi
~/aarch64/RTSM_AEMv8_VE/bin/model_shell64 \
-a $kernel \
$rootfs \
~/aarch64/RTSM_AEMv8_VE/./models/Linux64_GCC-4.1/RTSM_VE_AEMv8A.so
;;
esac

———————————————————————————————

I put this in aarch64-sim.sh.  (Don’t forget to chmod +x)

aarch64-sim.sh gfx/img.axf  gfx/sd.img rtsm

  • After the linux system has booted you need the run the following by hand.

fc-cache -f -v

pango-querymodules > /etc/pango/pango.modules

  • and now finally run:

startxfce4

That it for now!

Wouldn’t it be great to have an Open Embedded linaro image with a graphical environment for aarch64? We thought so too.

So in the Linaro Graphics Working Group we’ve been creating one. In our case we picked xfce as our environment. It’s reasonably lightweight, fairly simple, has reasonable package dependencies and already supported in OE.

The environment for running aarch64 these days that I have access to is the simulator. There are two, one called the Foundation and the other the RTSM commercial model. I’ve been using the later but the results should work with the Foundation model as well. Emphasis on should, I haven’t tried yet.

So let’s retrace the journey a bit so you can enjoy the adventure. With framebuffer on in the kernel and the usual suspects, X of course is the first stop along the way.

Image

xclock, xterm in all their early 1990s lack of any bells and whistles glory. While the output looks reasonable, at this stage I didn’t have the mouse working and couldn’t enter any keyboard input.

Pressing ahead I starting adding in xfce packages to my image. That got me to here.

Image

Ew! Obvious font problems and still no ability to get keyboard or mouse input. I quickly discovered from the Xorg log I didn’t have xf86-input-evdev. Once installed both keyboard and mouse worked.

Fonts were a bit more tricky. fc-cache -f -v alone wasn’t fixing it. Given the squares that doesn’t look like a rendering problem. Pango!  Ah yes.  /etc/pango/pango.modules was empty. One pango-querymodules > /etc/pango/pango.modules later and things were much better.

Image

Ahhh much better!  Still you can see in the window titlebars something is amiss and our xfce environment is still missing things. So fleshing things out even more brings us to this current point.

Image

That certainly looks much more like a normal xfce setup. Still have some rendering issues to address, between the mouse pointer and something with the windows but otherwise, this is something usable and moving in the right direction.

In my next post, I’ll point you at images and the early instructions as to how you can build and use this yourself!

OpenCL on ARM (part 1)

Posted: March 29, 2013 in linaro, OpenCL

A few weeks past before Linaro Connect I had started to see what might be available for OpenCL implementations on ARM. Via a little bit of googling it seemed that the only choice would going to be for boards with a Mali 6xx GPU. This was ok since that basically boils down to the Arndale board and the Samsung Chromebook. Both good options since I happened to have a Chromebook.

I downloaded the Mali OpenCL SDK which can be found from their site.

It didn’t take long following the instructions when I realized the SDK isn’t like most SDKs. Not contained within this SDK was any form of Mali OpenCL driver. Within the Mali SDK it contains a lib directory which when you type make (and you probably have to fix their Makefile to make it work) it will yield a libOpenCL.so it’s just that it’s essentially an empty stub library. You can compile and link against what is provided but when you try and run nothing will happen. Within this library is just a long list of functions with no implementation behind it. None. Not very useful.

Via this discussion, at the very bottom we see a bit of an explanation as to why.

We (ARM) do provide a build of Linux containing the OpenCL driver to select partners under a specific license, but this is not public at this time

So they leave it to the maker of the board to at their open distribute a driver. This gives the board maker the option to not support OpenCL at all if they so choose. Ok, I respect that and it makes sense, since just because a Mali T6xx part is on a board doesn’t mean that it’s wired up universally the same way which may require some driver specific change. It’s conjecture on my part since obviously we’ve no view into the source code as it’s not Open Source.

That said, the Insignal discussion can be found on their boards here. Simply put, not yet available for Linux but supposedly available for their Android Jelly Bean.

Hrumph!

I like Android but the problem is at Connect I gave up my Arndale board to one of my coworkers. I haven’t ordered a new one since the wait times are impressively long and currently they are sold out again at HowChip.

Android to my knowledge doesn’t run on the ARM based Samsung Chromebook so I’m out of options.

Next I did a little splunking within the ChromeOS file system on my Chromebook to see if I might find something to suggest that OpenCL was there. I didn’t find libOpenCL.so in any of the usual places so it’s probably safe to say ChromeOS doesn’t make use of OpenCL. No chance of copying over any binaries for use on Linux.

So backing up what other options do I have? Well I do have an OSX option. Putting together an OpenCL HelloWorld there is quite easy. Still.

I’ve a couple Intel Linux boxes, at least it would be a place to get my feet wet in the meantime and be more in line with what OpenCL on ARM linux will be like. So on Ubuntu I proceeded. There are two options. Either Intel’s VCSource or AMD’s APP SDK both proclaiming OpenCL support.

Let’s talk about how the OpenCL infrastructure is installed. First the includes that are best placed at /usr/include/CL. Not needed of course for runtime. Next if you put the contents of each respective SDK’s lib directory into /usr/lib/OpenCL/vendor/  intel or amd then you can have both SDK’s installed at the same time. These are needed at runtime. Next you have /etc/OpenCL/vendors which will have a number of .icd files. You only need one but with multiple SDKs you’ll have more than one.  The ICD is Installable Client Driver. IE This points to the real driver. This is required at runtime. For libOpenCL.so it looks to the icd to specify which driver(s) to use making libOpenCL.so more of a traffic cop between your application that uses libOpenCL.so and the real driver. Next within /etc/ld.so.conf.d you’ll have a new file that points to where the shared libraries are. So in my case these point to /usr/lib/OpenCL/vendors/intel and /usr/lib/OpenCL/vendors/amd in separate files. Last I have symlinks for OpenCL.so OpenCL.so.1 and OpenCL.so.1.2 that all point into the libOpenCL.so implementation I’m using such as the one in /usr/lib/OpenCL/vendors/amd.

AMD’s APP wants to set the environment variable AMDAPPSDKROOT=”/opt/AMDAPP” and does so in /etc/profile.

So knowing these aspects of setup, I proceeded to try out a simple HelloWorld app that would get the list of devices, create a context and spawn off some simple work to validation things are sane.

Let’s talk about how well things work with the Intel and AMD SDKs.

Intel’s SDK for Linux indicates they only support a limited set of CPUs. GPUs are not supported. Neither is the i7 CPU which is what my laptop has. Tried to run. Fail!  Intel’s SDK does not support any of their GPUs. If you want to use their SDK with an i7 for instance you can only do so on Windows! Lame! Further why Intel would have a dep on a very limited set of CPUs is beyond me.

Ok so obviously this wasn’t going to work. Next I switched over to the AMD APP SDK. As it turns out they support OpenCL for just CPUs IE without using a GPUs or for submitting work on both CPUs and GPUs. My laptop and my main intel desktop does not have an ATI GPU so this was essential for me to use the AMD implementation since they only support ATI GPUs, and as it turns out “any” Intel based CPU. Using the AMD supplied HelloWorld OpenCL app, it ran. But.

./HelloWorld
Setting of real/effective user Id to 0/0 failed
FATAL: Module fglrx not found.
Error! Fail to load fglrx kernel module! Maybe you can switch to root user to load kernel module directly
No GPU device available.
Choose CPU as default device.
input string:
GdkknVnqkc
output string:
HelloWorld

fglrx of course is the ATI kernel module. Via OpenCL you can specify that your workload is only going to be directed at CPUs. Even tho you might do so you’ll still get this error every time. Awesome! Least as compared to the Intel offering it runs on any CPU.