In what is seeming to be a serial, I’ve been working with my linaro-server image and experimenting with nginx to see how much performance one might be able to eek out of a pandaboard pressed into being a word press server.
This morning I thought I would do a little profiling to get an idea where nginx is processor bound.
When driving work with apache bench, in top I see:
top - 10:46:24 up 8 days, 19:08, 3 users, load average: 0.80, 1.05, 0.63
Tasks: 99 total, 4 running, 95 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.6%us, 21.4%sy, 0.0%ni, 15.4%id, 0.0%wa, 0.0%hi, 55.6%si, 0.0%st
Mem: 974668k total, 825288k used, 149380k free, 142748k buffers
Swap: 1847468k total, 32k used, 1847436k free, 529984k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21165 www-data 20 0 25656 3132 904 R 28 0.3 5:37.05 nginx
3 root 20 0 0 0 0 S 27 0.0 1:47.97 ksoftirqd/0
21168 www-data 20 0 25728 3124 908 R 25 0.3 0:31.56 nginx
31721 root 20 0 0 0 0 S 24 0.0 0:03.63 kworker/0:1
21164 www-data 20 0 25576 3040 904 S 18 0.3 5:39.60 nginx
7 root RT 0 0 0 0 S 15 0.0 1:19.26 watchdog/0
21166 www-data 20 0 25932 3332 904 S 13 0.3 5:43.53 nginx
21167 www-data 20 0 25528 2908 904 S 9 0.3 5:35.88 nginx
21169 www-data 20 0 25160 2668 920 R 6 0.3 5:48.92 nginx
31804 root 20 0 2144 1016 760 R 1 0.1 0:00.94 top
30605 root 20 0 0 0 0 S 1 0.0 0:02.44 kworker/1:1
31688 root 20 0 8172 2632 2008 S 1 0.3 0:01.10 sshd
This is probably a good indication that we’re cpu bound in the nginx code. Let’s however not jump to conclusions.
I was going to install oprofile (old habits die hard) but it’s been removed from the archive. (bug #653168). RIP oprofile. People have moved on to perf.
I installed perf. I quickly ran into the following. Consider
root@linaro-server:~# perf record -a -g sleep 20
perf_3.3.1-39 not found
You may need to install linux-tools-3.3.1-39
You can work around this by calling the versioned perf command explicitly to get around the drain bramage. /usr/bin/perf is nothing more than a shell script anyway that tries to match the version of perf to the kernel you’re running.
to collect system wide
root@linaro-server:~# perf_3.2.0-26 record -a -g sleep 20
Which yields
+ 17.34% nginx [kernel.kallsyms] [k] __do_softirq
+ 11.85% swapper [kernel.kallsyms] [k] omap_default_idle
+ 7.93% nginx [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 3.55% nginx [kernel.kallsyms] [k] __aeabi_llsr
+ 3.13% swapper [kernel.kallsyms] [k] __do_softirq
+ 2.55% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 1.30% swapper [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 0.95% ksoftirqd/0 [kernel.kallsyms] [k] __aeabi_llsr
+ 0.91% nginx [kernel.kallsyms] [k] sub_preempt_count.part.61
+ 0.71% nginx [kernel.kallsyms] [k] sk_run_filter
+ 0.61% nginx [kernel.kallsyms] [k] add_preempt_count
+ 0.60% nginx [kernel.kallsyms] [k] tcp_clean_rtx_queue
+ 0.60% nginx [kernel.kallsyms] [k] kfree
+ 0.57% nginx [kernel.kallsyms] [k] __copy_skb_header
+ 0.56% nginx [kernel.kallsyms] [k] __rcu_read_unlock
+ 0.53% nginx [kernel.kallsyms] [k] tcp_v4_rcv
+ 0.52% nginx [kernel.kallsyms] [k] kmem_cache_alloc
+ 0.52% nginx [kernel.kallsyms] [k] _raw_spin_unlock
+ 0.49% nginx [kernel.kallsyms] [k] __netif_receive_skb
+ 0.44% nginx [kernel.kallsyms] [k] kmem_cache_free
+ 0.40% nginx [kernel.kallsyms] [k] skb_release_data
+ 0.38% nginx [kernel.kallsyms] [k] skb_release_head_state
+ 0.38% nginx [kernel.kallsyms] [k] __inet_lookup_established
+ 0.36% nginx [kernel.kallsyms] [k] sub_preempt_count
+ 0.36% nginx [kernel.kallsyms] [k] pfifo_fast_dequeue
+ 0.36% nginx [kernel.kallsyms] [k] packet_rcv_spkt
+ 0.35% swapper [kernel.kallsyms] [k] __aeabi_llsr
+ 0.35% nginx [kernel.kallsyms] [k] __copy_from_user
+ 0.34% nginx [kernel.kallsyms] [k] __memzero
+ 0.32% nginx [kernel.kallsyms] [k] __rcu_read_lock
+ 0.30% nginx [kernel.kallsyms] [k] __kfree_skb
+ 0.30% nginx [kernel.kallsyms] [k] skb_push
+ 0.30% nginx [kernel.kallsyms] [k] tcp_ack
+ 0.30% ksoftirqd/0 [kernel.kallsyms] [k] tcp_v4_rcv
+ 0.29% nginx [kernel.kallsyms] [k] kfree_skbmem
+ 0.28% nginx [kernel.kallsyms] [k] ip_rcv
+ 0.27% nginx [kernel.kallsyms] [k] sch_direct_xmit
+ 0.27% nginx [kernel.kallsyms] [k] ip_route_input_common
+ 0.27% ksoftirqd/0 [kernel.kallsyms] [k] tcp_clean_rtx_queue
+ 0.26% nginx [kernel.kallsyms] [k] enqueue_to_backlog
+ 0.26% ksoftirqd/0 [kernel.kallsyms] [k] kfree
+ 0.24% nginx [kernel.kallsyms] [k] __skb_clone
+ 0.24% nginx [kernel.kallsyms] [k] _raw_spin_unlock_irq
+ 0.24% ksoftirqd/0 [kernel.kallsyms] [k] sk_run_filter
+ 0.24% nginx [kernel.kallsyms] [k] dev_queue_xmit
+ 0.24% nginx [kernel.kallsyms] [k] __qdisc_run
+ 0.24% nginx [kernel.kallsyms] [k] tcp_rcv_state_process
+ 0.24% nginx [kernel.kallsyms] [k] tcp_transmit_skb
+ 0.23% nginx [kernel.kallsyms] [k] tcp_validate_incoming
+ 0.23% nginx [kernel.kallsyms] [k] skb_clone
+ 0.23% ksoftirqd/0 [kernel.kallsyms] [k] __do_softirq
+ 0.22% nginx [kernel.kallsyms] [k] ip_local_deliver_finish
+ 0.22% nginx [kernel.kallsyms] [k] __raw_spin_lock_bh
+ 0.22% ksoftirqd/0 [kernel.kallsyms] [k] __copy_skb_header
+ 0.21% nginx [kernel.kallsyms] [k] memcpy
And then collecting from a worker bee
root@linaro-server:~# perf_3.2.0-26 record -a -g -p 21164 sleep 20
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.178 MB perf.data (~7759 samples) ]
Worker bee data:
+ 24.15% nginx [kernel.kallsyms] [k] __do_softirq
+ 11.56% nginx [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 3.91% nginx [kernel.kallsyms] [k] __aeabi_llsr
+ 1.81% nginx [kernel.kallsyms] [k] sub_preempt_count.part.61
+ 1.08% nginx [kernel.kallsyms] [k] tcp_v4_rcv
+ 1.02% nginx [kernel.kallsyms] [k] __rcu_read_unlock
+ 0.96% nginx [kernel.kallsyms] [k] sub_preempt_count
+ 0.91% nginx [kernel.kallsyms] [k] __copy_skb_header
+ 0.85% nginx [kernel.kallsyms] [k] kmem_cache_alloc
+ 0.79% nginx [kernel.kallsyms] [k] __copy_from_user
+ 0.79% nginx [kernel.kallsyms] [k] skb_push
+ 0.79% nginx [kernel.kallsyms] [k] add_preempt_count
+ 0.68% nginx [kernel.kallsyms] [k] kfree
+ 0.68% nginx [kernel.kallsyms] [k] memcpy
+ 0.68% nginx [kernel.kallsyms] [k] _raw_spin_unlock
+ 0.62% nginx [kernel.kallsyms] [k] __rcu_read_lock
+ 0.57% nginx [kernel.kallsyms] [k] vector_swi
+ 0.57% nginx [kernel.kallsyms] [k] kfree_skbmem
+ 0.57% nginx [kernel.kallsyms] [k] dev_hard_start_xmit
+ 0.57% nginx [kernel.kallsyms] [k] sk_run_filter
+ 0.57% nginx [kernel.kallsyms] [k] __inet_lookup_established
+ 0.51% nginx [kernel.kallsyms] [k] __netif_receive_skb
+ 0.51% nginx [kernel.kallsyms] [k] dev_queue_xmit
+ 0.51% nginx [kernel.kallsyms] [k] ip_rcv
+ 0.51% nginx [kernel.kallsyms] [k] tcp_recvmsg
+ 0.51% nginx [kernel.kallsyms] [k] packet_rcv_spkt
+ 0.45% nginx [kernel.kallsyms] [k] get_parent_ip
+ 0.45% nginx [kernel.kallsyms] [k] __kmalloc
+ 0.45% nginx [kernel.kallsyms] [k] usbnet_start_xmit
+ 0.45% nginx [kernel.kallsyms] [k] ip_local_deliver_finish
+ 0.45% nginx [kernel.kallsyms] [k] tcp_rcv_state_process
+ 0.45% nginx [kernel.kallsyms] [k] _raw_spin_lock
+ 0.40% nginx libc-2.15.so [.] memcpy
+ 0.40% nginx [kernel.kallsyms] [k] kmem_cache_free
+ 0.40% nginx [kernel.kallsyms] [k] smsc95xx_tx_fixup
+ 0.40% nginx [kernel.kallsyms] [k] sk_stream_alloc_skb
+ 0.40% nginx [kernel.kallsyms] [k] tcp_clean_rtx_queue
0.34% nginx libc-2.15.so [.] _int_free
+ 0.34% nginx [kernel.kallsyms] [k] local_bh_enable
+ 0.34% nginx [kernel.kallsyms] [k] __aeabi_idiv
+ 0.34% nginx [kernel.kallsyms] [k] strlen
+ 0.34% nginx [kernel.kallsyms] [k] usbnet_bh
+ 0.34% nginx [kernel.kallsyms] [k] kfree_skb
+ 0.34% nginx [kernel.kallsyms] [k] net_rx_action
+ 0.34% nginx [kernel.kallsyms] [k] sch_direct_xmit
+ 0.34% nginx [kernel.kallsyms] [k] tcp_sendmsg
+ 0.34% nginx [kernel.kallsyms] [k] tcp_ack
+ 0.34% nginx [kernel.kallsyms] [k] __raw_spin_lock_irqsave
+ 0.28% nginx nginx [.] ngx_radix_tree_create
0.28% nginx nginx [.] ngx_http_core_merge_loc_conf
0.28% nginx nginx [.] ngx_http_header_filter
+ 0.28% nginx nginx [.] 0x5e5be
+ 0.28% nginx [kernel.kallsyms] [k] in_lock_functions
+ 0.28% nginx [kernel.kallsyms] [k] aa_revalidate_sk
+ 0.28% nginx [kernel.kallsyms] [k] __aeabi_uidiv
+ 0.28% nginx [kernel.kallsyms] [k] md5_transform
+ 0.28% nginx [kernel.kallsyms] [k] illegal_highdma
+ 0.28% nginx [kernel.kallsyms] [k] enqueue_to_backlog
Pretty consistant system wide and with an nginx worker bee process. We’re spending a lot of time in kernel in __do_softirq. This makes sense as we’re driving a lot of network activity. There doesn’t appear to be any nginx related low hanging fruit. It’s probably time to take a look at the network device driver for the panda board for both performance, the page allocation failures and connection resets that apache bench reports when the number of concurrent connections is in the 400+ range.
Update:
Now at this point you might be thinking the next step in the journey is into the kernel source code. Not quite. Don’t forget that located in /proc/sys/ are a number of kernel values which can be set for various work loads without a recompile of the kernel.
In our case let’s walk through the list.
cd /proc/sys
cat fs/file-max
80615
file-max is the maximum number of open files on your system. 80,000 might seem like a big number but recall that between the front end and back end, we are using /dev/shm/php-fpm-www.sock so as traffic increases the number of open files through sockets is going to rise dramatically. Here’s how you set the limit for each boot.
echo 'fs.file-max=209708' >> /etc/sysctl.conf
Next is netdev_max_backlog which is the maximum number of packets, queued on the INPUT side.
cat net/core/netdev_max_backlog
1000
A 1000 for all the network traffic we’d like to be able to handle is quite low.
echo 'net.core.netdev_max_backlog = 5000' >> /etc/sysctl.conf
Next is net.core.somaxconn which limits the socket listen() backlog. Another value to increase in order to handle lots of incoming connections.
cat net/core/somaxconn
128
128! A low number so another value to increase.
echo 'net.core.somaxconn = 4096' >> /etc/sysctl.conf
Last net.ipv4.tcp_max_syn_backlog sets the limit on sockets which are incomplete sockets.
cat net/ipv4/tcp_max_syn_backlog
512
For our purposes this is low so this is another value to increase.
echo 'net.ipv4.tcp_max_syn_backlog = 4096' >> /etc/sysctl.conf
At this point you might think you need to reboot your machine for the new values to take effect. No. All you need do is this:
sysctl -p
Doing so with our new values in place and now apache bench runs with concurrent requests set in the 500, 600, 700 ranges no longer fail to complete. Things are stable such that I can do more scientific measurements, compute confident intervals and so on as you’d expect to put some strong meaning behind the numbers.
Thus far between the 3 blog posts to tune a nginx wordpress server on ARM, what have we done that was ARM specific? Not much. These last kernel settings are more of a concern on any machine with modest resources, they aren’t ARM specific. When tuning a Linux server, you’re typically going to look at these kinds of values.
Are we done? If we want to go hardcore, I think we’d up to the point where profiling the pandaboard ethernet driver would be the next step. No small piece of work but probably quite interesting.