Performance Analysis of Linux Kernel Library User-Space Tcp Stack

This work was done by our intern Shravya Kaudki Srinivas during the summer of 2018.

The Edge team at Salesforce has a mission to improve network performance and deliver the best end-user performance for all Salesforce applications. In order to accomplish this, we work on a traffic path component called Accelerator (Accd) which has a user-space TCP stack that terminates TCP connections from clients and delivers optimizations driven by machine learning algorithms. The primary motivation for a user-space TCP stack is that they provide a quick and safe mechanism to incorporate latest features whereas upgrading host kernel is quite challenging.

Accd uses a TCP stack that is a derivative of the FreeBSD 9.1 kernel. The stack is incorporated into a user-space daemon as a static library. Over time the user-space FreeBSD stack has had many inline changes that make it hard to pull in updates from upstream FreeBSD. Hence, we need an architecture where the inline changes can be pulled out and applied on top of an unchanged user-space stack which can be updated independently. Of the various kernel options, we prefer a Linux based stack since it has many of the recent transport layer enhancements we are interested in (Eg. QUIC, TCP BBR).

Linux Kernel Library (LKL) is an architecture port of the Linux kernel into an open-source library. It is a viable approach for us since it allows us to use the Linux networking stack in user-space. Moreover, it has been designed so that recent kernel updates can be easily and regularly merged in — this meets our requirement of using latest open source advances in combination with our congestion control algorithms.

In order to assess the performance of LKL, we used a testbed as shown in Fig. 1. It consists of a client, a proxy and a server running on 3 containers. Since, Accd behaves like a TCP proxy, this architecture is very relevant to us. In the LKL setup, the TCP proxy links with LKL and uses the LKL-based user-space TCP stack. In the non-LKL setup, the TCP proxy uses the Linux Kernel TCP stack. We used wrk as the benchmarking tool since it is capable of generating significant load and computes the number of requests/second handled by server and the average latency of all the requests. We observed that there was a huge gap of 10X in terms of Requests/Second and Average Latency between performance of LKL and Linux host.

The main issue with the version of LKL (commit hash 4ff4382) we used is that it did not support an SMP architecture. In addition, the overhead of each LKL syscall is two times more than that of a host syscall because of 2 additional context switches. Finally, LKL threads contend for a global lock, essentially making it behave like a single threaded process. With optimizations such as:

Pinning LKL threads to a single core to make use of better cache locality and avoiding inter processor interrupt calls
Offloading checksum computation and TCP segmentation to NIC
Incorporating mergeable RX mode which helps in generating one interrupt for receive events across multiple connections instead of one interrupt per receive event per connection
Increasing size of read and write buffer to reduce the number of LKL syscalls made
Work-around the lack of SMP support and reduce the number of LKL syscalls made by one LKL instance by running multiple LKL instances (Fig 1.3) where each LKL instance is pinned to a core and use load balancer to split the requests amongst the multiple LKL instances

As shown in Fig.2 and Fig.3, it was observed that with optimizations and multiple LKL instances, the gap between LKL and host can be minimized to around 2x in terms of latency and requests/sec achieved. LKL is a viable approach for us since the performance of optimized LKL not only improves upon current FreeBSD stack but also comes within 2x of vanilla Linux performance.