Late on a Saturday night and I’m working on my monitor tan. It’s spring, can’t be too early to prepare for summer of course!

I’ve taken the following sql queries and run them both with the traditional sqlite c apis as well as with my OpenCL accelerated APIs. These queries are the same that Bakkum et all used in their cuda accelerated sqlite paper.

char *sql11 ="SELECT SUM(normalf20) FROM test";
char *sql12 ="SELECT AVG(uniformi) FROM test WHERE uniformi > 0";
char *sql13 ="SELECT MAX(normali5), MIN(normali5) FROM test";

Straight sqlite with my A15 based Samsung Chromebook yields:

sql11 took 95399 microseconds
sql12 took 86576 microseconds
sql13 took 121898 microseconds

My OpenCL APIs yields the following for the same queries:

OpenCL sql11 took 46098 microseconds
OpenCL sql12 took 55524 microseconds
OpenCL sql13 took 64802 microseconds

The data is the same for both straight C sqlite apis and OpenCL apis 100,000 rows to process from one database with one table. The time measured is the time to perform the query across all selected data and for the end user API to obtain the data. For OpenCL this includes the copying out of the data. For the straight c apis this includes the time accessing the one row.

I’m not applying any sort of statistical process or test to these results. That’ll be a later step to assert a confidence interval based on a distribution of collected results.

All in all I don’t think the results are too bad but I’d like to think OpenCL should be able do better. Time to spend a little time with perf as well as do a little digging to see what might be available from Mali developer to analyze performance on the GPU.

These microbenchmarks are important to me. They give a guide as far as what might be accomplished with a general purpose solution which is yet to be written.  They also are helping me to form opinions about how to best approach it.

People have side projects. This one is mine.

What if you accelerate the popular sqlite database with OpenCL? This is one of the ideas that was floated as part of the GPGPU team to get a feel for what might be accomplished on ARM hardware with a mobile GPU.

In my case I’m using the Mali opencl drivers, running with ubuntu linux on a Samsung Chromebook which includes a dual core A15 and a Mali T604. You can replicate this same setup following these instructions.

At Linaro Connect Asia 2014 as part of the GPGPU session I gave an overview of the effort but I wasn’t able to give any initial performance numbers since my free time is highly variable and Connect arrived before I was quite ready. At the time I was about a week out from being able to run a microbenchmark or two since I was just getting to the step of writing some of the OpenCL.

Before I get to some initial numbers let me review a bit of what I talked about at Connect.

To accelerate sqlite I’ve initially added an api that sits next to the sqlite C api. My API in time should be able to blend right into the sqlite API so that no code changes would be needed by end user applications.  With sqlite you usually have a call sequence something like :

sqlite3_open(databaseName, &db);
c= sqlite3_prepare_v2(db, sql, -1, &selectAll_statement, NULL);
while (sqlite3_step(selectAll_statement) == SQLITE_ROW) {
    sqlite3_column_TYPE(selectAll_statement,0);
}

The prepare call takes sql and converts it to an expression tree that is translated into a bytecode which is run inside of a VM. The virtual machine is really nothing more than an big switch statement and each case handles an op code that the VM operates over. sqlite doesn’t do any sort of JIT to accelerate it’s operation. (I know what you’re thinking, hold that thought.)

The challenge to make a general purpose acceleration is to take the operation of the VM and move that onto the GPU. I see a few ways to accomplish this. In the past work that Peter Bakkum and Kevin Skadron had done they basically moved the implementation of the VM into the GPU using Cuda. This kind of approach really doesn’t work in my opinion for using OpenCL. Instead I’m currently of the opinion that the output of the sql expression tree ought to be a bit more than just VM bytecodes. I do wonder if utilizing llvm couldn’t offer interesting possibilities including SPIR (the Khronos intermediate representation standard for OpenCL) . Further research for sure.

The opencl accelerated API sequence looks like:

opencl_init(s, db);
opencl_prepare_data(s, sql);
opencl_transfer_data(s);
opencl_select(s, sql, 0);
opencl_transfer_results(s);

At this point, what I’ve managed to do is using a 100,000 row database with 7 columns run the same query using the sqlite c interface and my opencl accelerated interface.

With the sqlite c API the query took 420274 microseconds on my chromebook a dual core A15 cpu running at 1.7 Gz.

The OpenCL accelerated API running on the Mali T604 GPU at 533Mhz(?) from the same Chromebook yields 110289 microseconds. This measured time includes both the running of the OpenCL kernel and the data transfer from the result buffer.

These are early results. Many grains of salt should be applied but over all this seems like good results for a mobile GPU.

The last session has pasted. As I write this, it’s sort of situation out of the twilight zone. I’ve managed to break my glasses. I’m fairly near sited but my vision isn’t good enough for my screen to be in focus at an average distance.

The last day of Connect we had two sessions. Friday is a tough day to run a session both on account of  people being tired. Numbers suffer and it seems like we are all subject a -20 IQ modifier to the technical discussion at times.

Friday Sessions

The ION session covered the current work in progress with John Stultz from the Android team and Sumit Semwal tag team presenting.  Since Plumbers there’s been a good amount of activity.  Colin Cross updated this code a fair amount as it was reviewed. He created a number of tests which John Stultz ported outside of Android, a dummy driver has been put together for testing on non ION enabled graphics stacks and the 115+ patch set was pushed up into staging. As of right now these patches build and run on ARM, x86_64 and ARM64. There’s more to do, the team is working to get the tests running in LAVA. There are a number of design issues yet to be worked out on the dma-buf side of things. The needs to be a constraint aware dma-buf allocation helper functions. Dma-buf needs to try and reuse some of the ION code so they are both reusing the same heaps. Then there needs to be a set of functions within dma-buf that will examine heap flags and allocate memory from the correct ION heap. It all boils down to having a more common infrastructure between dma-buf and ION such that the ION and dma-buf interfaces will rest nicely on what is there instead of being two separate and divergent things.

Benjamin Gaignard presented on Wayland / Weston. He reviewed the current status of the future replacement of X,  it’s status on ARM and how people can use it today. He covered current efforts to address some of the design failings of Wayland/Weston that assume all systems have graphics architectures like Intel. The use of hardware compositors over GPUs on ARM shows a disjoint view of the world as compared to intel that just assumes everyone has a GPU and will want to use things like EGLImage. This is a common theme which we must introduce time and time again to various developer communities who have limited ARM experience. At this point the focus Benjamin has been more to try and introduce into Wayland/Weston the ability to take advantage of dma-buf to promote the sharing of buffers over copying. It’s a slow go especially without the user space dma-buf helps which was from the session yesterday. Wayland/Weston is viable for use on ARM. It’s not perfect and we anticipate more work in this space first with dma-buf and then to take advantage of compositing hardware often found on ARM SoCs.

Summary for the week

Media & Lib Team

I’m pleased that the media team was able to get a list formulated of the next media libraries to port and optimize for AARCH64.  We synced with the ARM team which is also working in the area of optimization. This is vital so that we don’t replicated efforts accidentally.

Display Team

Bibhuti and I were able to sit down and discuss the hwcomposer project. We’ve set the milestones and we’ll set the schedule. I think it’s more than fair to say we’ll be showing code at the next Connect. The next step for Mali driver support on the graphics LSK branch includes further boards depending on their kernel status as well as some discussion about the potential to try and formally upstream the Mali kernel driver even in the face of likely community opposition.  We had great discussions with the LHG and no doubts we’ll be working together to support LHG like we do with other groups.

UMM

As discussed above the UMM team is heads down on creating their initial PoC for connecting the heap allocators to provide map-time post attach allocation. This is code in progress and a very important step in knitting the ION and dma-buf worlds closer together.

GPGPU

This was more of a quiet connect for GPGPU since projects are mid stream. GPGPU is more is a “sprint” like mode than a “connect” like mode. We did release the GPGPU whitepaper to the Friends of OCTO.

Thursday featured the UMM user space allocator helper discussion and the GPGPU status talk.

The UMM User Space Allocators discussion was given by Sumit Semwal and Benjamin Gaignard. The problem involves the need from user space to allocate and work with memory for sharing between devices. Consider a video pipeline or a web camera that is rendering to the screen. This work will help achieve a zero copy design without user space having to know hardware details such as memory ranges, and other device constraints.

Gil Pitney and I gave the GPGPU talk which covered the current efforts involving the GPGPU subteam. Gil is working on Shamrock which is the old Clover project evolved. He’s upgraded it to use current top of tree llvm and MCJIT for code gen. There’s still testing to do but these are excellent steps forward as getting off the old JIT was important. Shamrock provides a CPU only OpenCL implementation which is great for those that don’t want to implement their own drivers but still want to provide at least the basic functionality. In addition there will be via Shamrock a driver for TI DSP hardware. This is also quite a great step forward. Via this route, everyone can collaborate on the open source portion which takes care of the base library and this leave just the driver/codegen to be something that needs to be created by the board creator.

The other part of the talk was about accelerating SQLite with OpenCL. There was a past project that accomplished something similar but with CUDA. I’m working on this and it’s quite the enjoyable project. I’m just implementing OpenCL kernels so there is a ways to go.  It will serve as a good reference for what can be accomplished on ARM SoC systems which have OpenCL drivers. We typically don’t have as many shader units as modern desktop PCIe solutions in the intel universe. I do find it encouraging that the SQLite design is quite flexible and fits well with this kind of experiment.

I did also attend the Ne10 and Chromium optimizations for Cocos2D/Cocos2D-HTML5. This are ARM projects. Ne10 is essentially a library that sits above Neon intrinsics to give easier access to that functionality. Cocos is a popular cross platform engine that is particularly popular in the Android world for 2D UIs and game creation. There was some nice optimization work in and around various drawing primitives done by the ARM team for Chromium that end up helping Cocos.

Thursday included the first bit of quiet time I had all week to actually write some code. It didn’t last long but it did feel good as I’m in a very fun portion of implementing the optimized SQLite with OpenCL and it was hard to set that work aside while Connect is on.

Our first Graphics Working Group session of this connect was today. We reviewed the ongoing efforts to optimize various media libraries for AARCH64. James Yu and Ragesh Radhakrishnan talked about their work involving libpng, libjpeg-turbo, libvpx and pixman. With libjpeg-turbo it features a refresh of the android port. This allows libjpeg-turbo to be a drop in replacement for libjpeg which is currently part of AOSP. The performance difference is clear. Libjpeg-turbo contains quite a number of optimization work over the past few years while libjpeg has languished for a variety of unfortunately political reasons. The reasons?  libjpeg-turbo is better than twice as fast as libjpeg.

James talked his libvpx work which of cource includes VP8 and VP9 support. He’s essentially replaced the past hand coded assembler with a version that used neon intrinsics. Comparing hand coded vs neon intrinsics on ARMv7, there’s a bit of a degrade in performance that needs to be looked into. It’s nearly 10%. In prior efforts with libpng, a similar switch yielded statistically no change in performance.

On the good news side, progress continues on porting past optimized libraries so that they are also optimized for AARCH64 in preparation for the arrival of real hardware.

The other important goal of the session was to gather input on the next set of libraries that should be optimized for AARCH64. We’ve a limited amount of time. We’ve a limited amount of hands. We won’t have everything optimized by the time real hardware shows up. We want to optimize first what is viewed as most important.

Unfortunately we didn’t get a lot of direction. There seems to be a feeling that codecs and such that are in android should be considered first priority. Tho there was some rightful dissent on that concept from those interested in traditional linux.

Besides leaning towards android the other concept that seemed to be there was we should give priority for video over audio. This makes sense as while default fallbacks for audio use more CPU, they generally won’t completely consume a CPU with loss of quality as compared to a video codec that will suffer in framework and quality.

libav was discussed as being potentially important to 3rd party application developers. There doesn’t appear to be solid usage numbers that have been identified as of yet as far as how much libav is in use by 3rd party application developers on Android. I would presume there must be some as libav is a good choice for “odd” formats.

In the afternoon we reviewed our internal list of codecs and media libs for porting and optimizing for AARCH64 and with some other attendees from connect we assembled our list for the “next” libs to put time and attention into. We’ll be reviewing this with others first but generally speaking it looks something approx like this:

  1. mpeg4
  2. webrtc (audio portion)
  3. aac  (from AOSP)
  4. flac
  5. vorbis
  6. mp3
  7. h265
  8. speex

libav is a discussion point yet so that might very well find it’s way onto the list.

LCA14 GWG Day 2

Posted: March 4, 2014 in Uncategorized

Tuesday was yet another day with no sessions hosted by the Graphics Working Group. Wednesday, Thursday Friday are our big days. So for us it was a day of our own meetings and attending working group sessions.

I went to the 64 bit toolchain status meeting. It was good to hear that LLVM’s MCJIT is at least able to pass it’s tests on AARCH64. This is important for future versions of OpenCL for instance on AARCH64. I do wish there was a bit more focus on llvm for AARCH64 tho. With significant projects like Chromium seriously considering a move to llvm it seems wise. 

The next session I went to was the one on SQLite optimization. SQLite is database at the heart of and in very common use across Android, iOS, Linux and so on. It’s an important foundation block so it’s optimization can be valuable. Using the cortex strings work and applying that to SQLite on Android yielded the Linaro Android team some significant results. (20-35%)  The Android use case is an interesting one in that the databases tend to be smaller so while speed is one factor to optimize for, space and battery usage are also important. This work came at it only from the speed angle. It’s still a work in progress so will keep in tune.

Related to that as part of our GPGPU efforts I’m in the midst of optimizing SQLite with OpenCL. I’m at the stage of writing the OpenCL kernels so it’s too early to start talking about microbenchmark results. I’m also aiming at a slightly different use case. I’m looking at more sizable databases and more common operations that would be found with use as part of a LAMP or LAMP-like stack. I’ll be talking about this effort as part of the GPGPU session on Thursday.

During the afternoon hacking time, I was in solid meetings all day. This is the life of a tech lead sometimes. While it would be great to be heads down in vi this week, doing so would look out in many wonderful opportunities to talk to many people.

On Tuesday about 1/2 of GWG and the LHG got together and were talking about a variety of topics. Mostly at an architectural level and some of the factors that go into making good technology choices when it comes to video playback and so on. As LHG gets off the ground they have a number of interesting (and fun!) challenges ahead of them.

The Media and Libs subteam got together and we held a project review. This is a perfect activity for connect. It is quite good to perform detailed reviews from time to time. The good news is that the libvpx (VP8 &VP9), libjpeg-turbo and pixman optimization efforts for AARCH64 are coming along quite well with patches flowing upstream. Today (Wednesday) we’ll be meeting to set the next list of libs that we will optimize for AARCH64.

Monday was the first day of Linaro Connect here in Macau. It was also a somewhat lighter day for the Graphics Working Group as our team wasn’t hosting any sessions today. 

I attended the ARM VM standards meeting in the morning. It’s a proposal to put together a standard or whitepaper to give guidance on VMs on ARM systems. What drew me in was the seeming possibility the document would come down with a position on graphics drivers within an ARM VM. Through the course of the meeting it became clear that individual VM implementations such as KVM or XEN would not be mandated to as far as drivers were concerned from a graphics perspective. So alls well.

I also attended the RDK overview. That was an informational meeting about the RDK that Linaro had help move over to use OE as a basis. They did a great job to spread out the RDK to use the layer system within OE effectively. 70% of the packages in the resulting new OE based RDK are OE infrastructure. It was good to hear and see the success.

The afternoon was the first initial day of hacking. We had an short team meeting as I need to go and present to the Linaro TSC with the engineering status for the Graphics Working Group. 

Our team goals for the week include:

1 – Display Subteam

set hwcomposer milestones and dates

Further LSK board support discussions with members

Meet with LHG, come up to speed on LHG directions

2 – Media & Libs Subteam

Set direction for next 6-12 months on which libraries to optimize for AARCH64 next based on member and attendee input

Sync with ARM team also doing work with in this space

what tools is the ARM team using in the course of their work

3 – UMM Subteam

Interactive design and feedback on the user space allocator helper and vetting with the Wayland/Weston zero copy real world

gralloc discussion

4 – GPGPU Subteam

Heads down working out bugs with Shamrock, discuss further feature implementation 

GPGPU whitepaper release

Going crazy with hardware

Posted: February 3, 2014 in Uncategorized

So with the release of the AMD A10-7850 Kaveri, I thought I would jump in and get one. My current “server” is a mere System 76  i5 from about 3 years ago so it’s starting to get a bit old…  The larger reason tho is I wanted to get an HSA compliant system as well as something I could drop a R9 R290 into and use for OpenCL for a whole heck of a lot of GPU goodness. That’s two sets of GPUs,  one on the processor die and one on the graphics card. Instant environment for some benchmarks 🙂 to compare OpenCL on card vs OpenCL where there are shared page tables with the CPU vs eventually HSA.

So what did I pick up?

Well first the processor an AMD A10-7850K.

The motherboard it’ll go into is an Asus ATX A88X-Pro which is able to overclick memory to 2400Mhz

The memory I picked up is two sticks of Corsair Dominator DDR3 PC3 19200.

For cooling I did get a Corsair Hydro H100i. Be interesting to see how well that works. I very much enjoy the ARM world were we don’t have to think about this class of cooling. Having to drop $50-$100 for cooling is less than fun.

Then for a case and I honestly thought about letting it just set on the desk a 550D mid tower also by Corsair.

All in all should be a pretty sweet system. I have a couple of SSD drives I’ll be dropping into it so it should rock out pretty well. It’s not a Mac Pro… which *sigh* I would really like but for Linux box, it should do well.

After it’s assembled I’ll post again.

chromium & aarch64 ( I wish )

Posted: December 31, 2013 in Uncategorized

I’d rather like to be posting about my various efforts to port Chromium to aarch64. Unfortunately this isn’t that blog post.

I do have a good portion of the chromium test suite built and I do have the actual Chromium binary built too for aarch64. The problem is the environment to test it in. 

When I started working on this I was using the rtsm model and basically a good old fashioned aarch64 kernel with framebuffer. I’d built a more or less slimmed down fvwm and that worked well for running a linux based Chromium for aarch64. Cool.

Times changes and progress continues least it should be progress. Unfortunately the rtsm I was using was considered “old” so I moved up to the newer FVP Fastmodel. I’ve pulled in the latest lsk VE kernel hwpack_linaro-lsk-vexpress64-rtsm_20131204-545_arm64_supported.tar.gz and should be good to go right?

No. Immediate crash trying to start the kernel.

Errrmmm ok. So switching to hwpack_linaro-vexpress64-rtsm_20131125-536_arm64_supported.tar.gz, the resulting image boots. But I had switched over to using the ALIP OE image. As it starts to boot, it crashes with psplash. Awesome. So mounting the file system and disabling that I at least get to a command line prompt. 

So next I’ve built xf86-video-armsoc for aarch64 and pl111 (one small patch needed). Now to try and get X up and running again. It’s really no fun to regress and it’s especially no fun when you have some shiny new binaries you really want to run but can’t but such is the nature of the cutting edge sometimes. Wish me luck. 

Linux Plumbers 2013

Posted: September 26, 2013 in Uncategorized

I attended 2013 Linux Plumbers in New Orleans.  The Graphics and Display mini-conf and the Android and Display mini-conf were the two items of biggest interest to me. Besides myself Sumit Semwal and Ross Oldfield from the Linaro Graphics Working Group were also there. I’m sure they’ll be posting their own thoughts as well.

I won’t cover every topic nor will I dive too deep for all that was part of the mini-confs.

Graphics and Display – Rob Clark

Rob Clark gave a good rundown of all the various completely Open Source driver development activities going on and demostrated his own project the open source Freedreno project. Vivante, Adreno, Mali, and Tegra are all seeing various attempts. Some like Freedreno and Lima are getting to be impressively good.

Media Decode and Composition: Bridging the Gap  – Daniel Stone

This was good to see especially since it’s an issue especially for set top boxes. It does point the way that EGL probably isn’t the tech of the feature here but rather it’s Wayland.

Common Display Framework – Laurent Pinchart

I had high hopes for Laurent’s session as given that make up of the room I was optimistic we might see some concrete agreements that might help things move along with Laurent’s proposal. From my perspective Laurent received some general guidance after he covered the current state of the design but general direction such as asking for something more incremental and more based on KMS didn’t give me hope that CDF is close to being merged. I share the opinion that fitting into the KMS infrastructure with appropriate proposals to change KMS seems like the right thing. There wasn’t (or I missed it) any discussion touching on the various protocols which panels use and how to make them into something that would be KMS based.

Sync/dmabuf-fences/dmabuf-sync

The sync discussion seemed quite productive. After setting the context of where and why sync is useful on Android,  there was agreement that for dma-buf and with dma-fences make sense and should go forward.

Atomic Display Framework (ADF)

ADF is a “new” set of patch recently posted that is sort of a next generation hwcomposer. Essentially how to atomically get a number of buffers put together and up on the screen. It’s entirely android specific which is one of it’s problems. I wouldn’t be the least bit surprised if this showed up in a future version of android which if that happened before it was upstreamed would be unfortunate. :-/

ION

The ion discussion was centered various issues that should be address on the road towards upstreaming, least that was the goal. Between how constraints should (or shouldn’t) be handled, how to keep performance good when the slow case especially on android is never acceptable and quite a number of complaints there didn’t seem to be an “ah ha” moment and as such it seemed more like the discussion was a review of things previously touched on but without exact consensus. It IS a complicated problem. In the spirit of increment improvement that kind of complexity makes it difficult to divide and conquer through a serious of smaller acceptable patches.

John Stultz’s lwn.net acticle on ion if you haven’t read, you should. http://lwn.net/Articles/565469/