ARM Profiling and nginx

Posted: July 5, 2012 in linaro, server

In what is seeming to be a serial, I’ve been working with my linaro-server image and experimenting with nginx to see how much performance one might be able to eek out of a pandaboard pressed into being a word press server.

This morning I thought I would do a little profiling to get an idea where nginx is processor bound.

When driving work with apache bench, in top I see:

top - 10:46:24 up 8 days, 19:08,  3 users,  load average: 0.80, 1.05, 0.63
Tasks:  99 total,   4 running,  95 sleeping,   0 stopped,   0 zombie
Cpu(s):  7.6%us, 21.4%sy,  0.0%ni, 15.4%id,  0.0%wa,  0.0%hi, 55.6%si,  0.0%st
Mem:    974668k total,   825288k used,   149380k free,   142748k buffers
Swap:  1847468k total,       32k used,  1847436k free,   529984k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                            
21165 www-data  20   0 25656 3132  904 R   28  0.3   5:37.05 nginx                               
    3 root      20   0     0    0    0 S   27  0.0   1:47.97 ksoftirqd/0                         
21168 www-data  20   0 25728 3124  908 R   25  0.3   0:31.56 nginx                               
31721 root      20   0     0    0    0 S   24  0.0   0:03.63 kworker/0:1                         
21164 www-data  20   0 25576 3040  904 S   18  0.3   5:39.60 nginx                               
    7 root      RT   0     0    0    0 S   15  0.0   1:19.26 watchdog/0                          
21166 www-data  20   0 25932 3332  904 S   13  0.3   5:43.53 nginx                               
21167 www-data  20   0 25528 2908  904 S    9  0.3   5:35.88 nginx                               
21169 www-data  20   0 25160 2668  920 R    6  0.3   5:48.92 nginx                               
31804 root      20   0  2144 1016  760 R    1  0.1   0:00.94 top                                 
30605 root      20   0     0    0    0 S    1  0.0   0:02.44 kworker/1:1                         
31688 root      20   0  8172 2632 2008 S    1  0.3   0:01.10 sshd

This is probably a good indication that we’re cpu bound in the nginx code. Let’s however not jump to conclusions.

I was going to install oprofile (old habits die hard) but it’s been removed from the archive. (bug #653168). RIP oprofile. People have moved on to perf.

I installed perf. I quickly ran into the following. Consider

root@linaro-server:~# perf record -a -g sleep 20 
perf_3.3.1-39 not found
You may need to install linux-tools-3.3.1-39

You can work around this by calling the versioned perf command explicitly to get around the drain bramage. /usr/bin/perf is nothing more than a shell script anyway that tries to match the version of perf to the kernel you’re running.

to collect system wide

root@linaro-server:~# perf_3.2.0-26 record -a -g sleep 20

Which yields

+  17.34%          nginx  [kernel.kallsyms]   [k] __do_softirq
+  11.85%        swapper  [kernel.kallsyms]   [k] omap_default_idle
+   7.93%          nginx  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
+   3.55%          nginx  [kernel.kallsyms]   [k] __aeabi_llsr
+   3.13%        swapper  [kernel.kallsyms]   [k] __do_softirq
+   2.55%    ksoftirqd/0  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
+   1.30%        swapper  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
+   0.95%    ksoftirqd/0  [kernel.kallsyms]   [k] __aeabi_llsr
+   0.91%          nginx  [kernel.kallsyms]   [k] sub_preempt_count.part.61
+   0.71%          nginx  [kernel.kallsyms]   [k] sk_run_filter
+   0.61%          nginx  [kernel.kallsyms]   [k] add_preempt_count
+   0.60%          nginx  [kernel.kallsyms]   [k] tcp_clean_rtx_queue
+   0.60%          nginx  [kernel.kallsyms]   [k] kfree
+   0.57%          nginx  [kernel.kallsyms]   [k] __copy_skb_header
+   0.56%          nginx  [kernel.kallsyms]   [k] __rcu_read_unlock
+   0.53%          nginx  [kernel.kallsyms]   [k] tcp_v4_rcv
+   0.52%          nginx  [kernel.kallsyms]   [k] kmem_cache_alloc
+   0.52%          nginx  [kernel.kallsyms]   [k] _raw_spin_unlock
+   0.49%          nginx  [kernel.kallsyms]   [k] __netif_receive_skb
+   0.44%          nginx  [kernel.kallsyms]   [k] kmem_cache_free
+   0.40%          nginx  [kernel.kallsyms]   [k] skb_release_data
+   0.38%          nginx  [kernel.kallsyms]   [k] skb_release_head_state
+   0.38%          nginx  [kernel.kallsyms]   [k] __inet_lookup_established
+   0.36%          nginx  [kernel.kallsyms]   [k] sub_preempt_count
+   0.36%          nginx  [kernel.kallsyms]   [k] pfifo_fast_dequeue
+   0.36%          nginx  [kernel.kallsyms]   [k] packet_rcv_spkt
+   0.35%        swapper  [kernel.kallsyms]   [k] __aeabi_llsr
+   0.35%          nginx  [kernel.kallsyms]   [k] __copy_from_user
+   0.34%          nginx  [kernel.kallsyms]   [k] __memzero
+   0.32%          nginx  [kernel.kallsyms]   [k] __rcu_read_lock
+   0.30%          nginx  [kernel.kallsyms]   [k] __kfree_skb
+   0.30%          nginx  [kernel.kallsyms]   [k] skb_push
+   0.30%          nginx  [kernel.kallsyms]   [k] tcp_ack
+   0.30%    ksoftirqd/0  [kernel.kallsyms]   [k] tcp_v4_rcv
+   0.29%          nginx  [kernel.kallsyms]   [k] kfree_skbmem
+   0.28%          nginx  [kernel.kallsyms]   [k] ip_rcv
+   0.27%          nginx  [kernel.kallsyms]   [k] sch_direct_xmit
+   0.27%          nginx  [kernel.kallsyms]   [k] ip_route_input_common
+   0.27%    ksoftirqd/0  [kernel.kallsyms]   [k] tcp_clean_rtx_queue
+   0.26%          nginx  [kernel.kallsyms]   [k] enqueue_to_backlog
+   0.26%    ksoftirqd/0  [kernel.kallsyms]   [k] kfree
+   0.24%          nginx  [kernel.kallsyms]   [k] __skb_clone
+   0.24%          nginx  [kernel.kallsyms]   [k] _raw_spin_unlock_irq
+   0.24%    ksoftirqd/0  [kernel.kallsyms]   [k] sk_run_filter
+   0.24%          nginx  [kernel.kallsyms]   [k] dev_queue_xmit
+   0.24%          nginx  [kernel.kallsyms]   [k] __qdisc_run
+   0.24%          nginx  [kernel.kallsyms]   [k] tcp_rcv_state_process
+   0.24%          nginx  [kernel.kallsyms]   [k] tcp_transmit_skb
+   0.23%          nginx  [kernel.kallsyms]   [k] tcp_validate_incoming
+   0.23%          nginx  [kernel.kallsyms]   [k] skb_clone
+   0.23%    ksoftirqd/0  [kernel.kallsyms]   [k] __do_softirq
+   0.22%          nginx  [kernel.kallsyms]   [k] ip_local_deliver_finish
+   0.22%          nginx  [kernel.kallsyms]   [k] __raw_spin_lock_bh
+   0.22%    ksoftirqd/0  [kernel.kallsyms]   [k] __copy_skb_header
+   0.21%          nginx  [kernel.kallsyms]   [k] memcpy

And then collecting from a worker bee

root@linaro-server:~# perf_3.2.0-26 record -a -g -p 21164 sleep 20 
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.178 MB perf.data (~7759 samples) ]

Worker bee data:

+  24.15%  nginx  [kernel.kallsyms]   [k] __do_softirq
+  11.56%  nginx  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
+   3.91%  nginx  [kernel.kallsyms]   [k] __aeabi_llsr
+   1.81%  nginx  [kernel.kallsyms]   [k] sub_preempt_count.part.61
+   1.08%  nginx  [kernel.kallsyms]   [k] tcp_v4_rcv
+   1.02%  nginx  [kernel.kallsyms]   [k] __rcu_read_unlock
+   0.96%  nginx  [kernel.kallsyms]   [k] sub_preempt_count
+   0.91%  nginx  [kernel.kallsyms]   [k] __copy_skb_header
+   0.85%  nginx  [kernel.kallsyms]   [k] kmem_cache_alloc
+   0.79%  nginx  [kernel.kallsyms]   [k] __copy_from_user
+   0.79%  nginx  [kernel.kallsyms]   [k] skb_push
+   0.79%  nginx  [kernel.kallsyms]   [k] add_preempt_count
+   0.68%  nginx  [kernel.kallsyms]   [k] kfree
+   0.68%  nginx  [kernel.kallsyms]   [k] memcpy
+   0.68%  nginx  [kernel.kallsyms]   [k] _raw_spin_unlock
+   0.62%  nginx  [kernel.kallsyms]   [k] __rcu_read_lock
+   0.57%  nginx  [kernel.kallsyms]   [k] vector_swi
+   0.57%  nginx  [kernel.kallsyms]   [k] kfree_skbmem
+   0.57%  nginx  [kernel.kallsyms]   [k] dev_hard_start_xmit
+   0.57%  nginx  [kernel.kallsyms]   [k] sk_run_filter
+   0.57%  nginx  [kernel.kallsyms]   [k] __inet_lookup_established
+   0.51%  nginx  [kernel.kallsyms]   [k] __netif_receive_skb
+   0.51%  nginx  [kernel.kallsyms]   [k] dev_queue_xmit
+   0.51%  nginx  [kernel.kallsyms]   [k] ip_rcv
+   0.51%  nginx  [kernel.kallsyms]   [k] tcp_recvmsg
+   0.51%  nginx  [kernel.kallsyms]   [k] packet_rcv_spkt
+   0.45%  nginx  [kernel.kallsyms]   [k] get_parent_ip
+   0.45%  nginx  [kernel.kallsyms]   [k] __kmalloc
+   0.45%  nginx  [kernel.kallsyms]   [k] usbnet_start_xmit
+   0.45%  nginx  [kernel.kallsyms]   [k] ip_local_deliver_finish
+   0.45%  nginx  [kernel.kallsyms]   [k] tcp_rcv_state_process
+   0.45%  nginx  [kernel.kallsyms]   [k] _raw_spin_lock
+   0.40%  nginx  libc-2.15.so        [.] memcpy
+   0.40%  nginx  [kernel.kallsyms]   [k] kmem_cache_free
+   0.40%  nginx  [kernel.kallsyms]   [k] smsc95xx_tx_fixup
+   0.40%  nginx  [kernel.kallsyms]   [k] sk_stream_alloc_skb
+   0.40%  nginx  [kernel.kallsyms]   [k] tcp_clean_rtx_queue
    0.34%  nginx  libc-2.15.so        [.] _int_free
+   0.34%  nginx  [kernel.kallsyms]   [k] local_bh_enable
+   0.34%  nginx  [kernel.kallsyms]   [k] __aeabi_idiv
+   0.34%  nginx  [kernel.kallsyms]   [k] strlen
+   0.34%  nginx  [kernel.kallsyms]   [k] usbnet_bh
+   0.34%  nginx  [kernel.kallsyms]   [k] kfree_skb
+   0.34%  nginx  [kernel.kallsyms]   [k] net_rx_action
+   0.34%  nginx  [kernel.kallsyms]   [k] sch_direct_xmit
+   0.34%  nginx  [kernel.kallsyms]   [k] tcp_sendmsg
+   0.34%  nginx  [kernel.kallsyms]   [k] tcp_ack
+   0.34%  nginx  [kernel.kallsyms]   [k] __raw_spin_lock_irqsave
+   0.28%  nginx  nginx               [.] ngx_radix_tree_create
    0.28%  nginx  nginx               [.] ngx_http_core_merge_loc_conf
    0.28%  nginx  nginx               [.] ngx_http_header_filter
+   0.28%  nginx  nginx               [.] 0x5e5be
+   0.28%  nginx  [kernel.kallsyms]   [k] in_lock_functions
+   0.28%  nginx  [kernel.kallsyms]   [k] aa_revalidate_sk
+   0.28%  nginx  [kernel.kallsyms]   [k] __aeabi_uidiv
+   0.28%  nginx  [kernel.kallsyms]   [k] md5_transform
+   0.28%  nginx  [kernel.kallsyms]   [k] illegal_highdma
+   0.28%  nginx  [kernel.kallsyms]   [k] enqueue_to_backlog

Pretty consistant system wide and with an nginx worker bee process. We’re spending a lot of time in kernel in __do_softirq. This makes sense as we’re driving a lot of network activity. There doesn’t appear to be any nginx related low hanging fruit. It’s probably time to take a look at the network device driver for the panda board for both performance, the  page allocation failures and connection resets that apache bench reports when the number of concurrent connections is in the 400+ range.

Update:

Now at this point you might be thinking the next step in the journey is into the kernel source code. Not quite. Don’t forget that located in /proc/sys/ are a number of kernel values which can be set for various work loads without a recompile of the kernel.

In our case let’s walk through the list.

cd /proc/sys
cat fs/file-max 
80615

file-max is the maximum number of open files on your system. 80,000 might seem like a big number but recall that between the front end and back end, we are using /dev/shm/php-fpm-www.sock so as traffic increases the number of open files through sockets is going to rise dramatically. Here’s how you set the limit for each boot.

echo 'fs.file-max=209708' >> /etc/sysctl.conf

Next is netdev_max_backlog which is the maximum number of packets, queued on the INPUT side.

cat net/core/netdev_max_backlog 
1000

A 1000 for all the network traffic we’d like to be able to handle is quite low.

echo 'net.core.netdev_max_backlog = 5000' >> /etc/sysctl.conf

Next is net.core.somaxconn which limits the socket listen() backlog. Another value to increase in order to handle lots of incoming connections.

cat net/core/somaxconn 
128

128! A low number so another value to increase.

echo 'net.core.somaxconn = 4096' >> /etc/sysctl.conf

Last net.ipv4.tcp_max_syn_backlog sets the limit on sockets which are incomplete sockets.

cat net/ipv4/tcp_max_syn_backlog 
512

For our purposes this is low so this is another value to increase.

echo 'net.ipv4.tcp_max_syn_backlog = 4096' >> /etc/sysctl.conf

At this point you might think you need to reboot your machine for the new values to take effect. No. All you need do is this:

sysctl -p

Doing so with our new values in place and now apache bench runs with concurrent requests set in the 500, 600, 700 ranges no longer fail to complete. Things are stable such that I can do more scientific measurements, compute confident intervals and so on as you’d expect to put some strong meaning behind the numbers.

Thus far between the 3 blog posts to tune a nginx wordpress server on ARM, what have we done that was ARM specific? Not much. These last kernel settings are more of a concern on any machine with modest resources, they aren’t ARM specific. When tuning a Linux server, you’re typically going to look at these kinds of values.

Are we done? If we want to go hardcore, I think we’d up to the point where profiling the pandaboard ethernet driver would be the next step. No small piece of work but probably quite interesting.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s