Kernel-bypass networking is gaining popularity. This means moving control of Ethernet hardware directly into userspace processes to avoid the overhead of communicating with the operating system kernel. This gives userspace all of the raw performance traditionally enjoyed by the kernel – and all of the responsibility too. This is important for certain specialized applications that can gain as much as 20x more performance.
splice() is a Linux system call that can be used for copying data from a socket to another socket or a file. The big advantage of using splice is that it performs the data transfer with the so-called zero-copy: data remains in kernel space and is never copied to user space, and that makes a big difference.
This code shows how to use splice from Python. From the description:
Using the ‘splice’ syscall from Python, in this demonstration to transfer the output of some process to a client through a socket, using zero-copy transfers.
I did’t know Linux implemented this kind of optimization: code generation on the kernel for parsing network packets. When you want to filter some network packets that go through the kernel, this JIT mechanism generates native code for your filtering rules, so parsing is several orders of magnitude faster…
Contrary to common sense, large buffers are really a big problem on servers and routers. These are some of the countermeasures included in the latest Linux kernel (see this as well)
Kernel SamePage Merging is a recent linux kernel feature which combines identical memory pages from multiple processes into one copy on write memory region
Why you (probably) don’t need AIO on Linux
The need of a reliable and fast Asynchronous I/O implementation on Linux has always been a hot topic on the Linux world. Current AIO in the Linux kernel is not at the same level as in some other operating systems, and it imposes some requirements that makes it difficult (when not painful) to use. Some people argue that Linux provides enough facilities for doing fast asynchronous I/O operations and we don’t really care about a complete AIO implementation, while other people think that there are some remaining cases were applications could get an important boost with a real AIO system.
In my view, these are the main alternatives you have when doing asynchronous I/O on Linux:
- if you want to read/write just a few kBytes, it will be fast anyway, so don’t worry about asynchronous things. Reading a couple of megs will be done almost immediately: the readahead mechanism would load the file to memory anyway. However, there are some corner cases I will explain later on.
- small writes to a file will go to some dirty memory page in the buffer cache, so the operation will be almost immediate. Those pages will be written to disk after some time, or when the memory pressure increases.
- if you want to read/write more than a couple of pages,
mmap()the file to memory and then do the reads or writes. If you know the access pattern, do some advise (with
posix_fadvise()) before trying to read from the file, and use
msync()when you are done with that memory.
- if you want to move, copy files, either on the local machine or though the network, and specially if they are large files, you really must use
splice(), if your kernel supports it). This operation avoids kernel to user memory copies and frees you application from doing the real transfer (as it is performed by the OS), so it reduces CPU and memory consumption and can result in a performance improvement of several orders of magnitude.
So, with all these alternatives, should we ever use AIO? Short answer, it depends. All AIO libraries have complex interfaces and make your program flow more difficult to follow. You will have lots of callbacks and asynchronous events, so your program will get messy very easily. But sometimes AIO is the only solution if you don’t want to block in your program.
Then, when should we use AIO? First of all, use AIO when you don’t want to block. Yes, this is the most important reason, and sometimes you don’t really care about blocking. Think about a server that spawns lots of processes for serving connections to clients. For a modest number of clients, maybe you don’t really care about blocking: clients will have to wait anyway until we read/write data from/to disk, and other clients will not be blocked by these reads/writes as they are served by different processes.
However, if you care about blocking in your program, there are some important cases where you will probably need AIO:
- when you are reading and/or writing very frequently from/to disk, but not for sending that data to the network but for inspecting or modifying it.
sendfile()is useless in this case, but think twice about using AIO because maybe it will be easier to split your control plane in multiple threads or processes. You have to evaluate this option very carefully, as threads usually involve locking: if you have very frequent disk accesses and that means many locks/unlocks, performance can degrade very fast. If you cannot use threads/processes, AIO will be your last resource.
- when the I/O subsystem is very busy doing lots of big reads/writes, you can have some problems trying to write some data or reading some unpopular blocks. No easy solution here either. Check the multiple processes/threads alternative here too.
Linux socket auto-tunning
I’ve been struggling the past week with some performance issues in my network code. Our new network code base, built upon libevent 2.x, was giving us half the performance we got with our previous version, based on libevent 1.x. First, I thought it was a problem with the way we handled buffers, as we have switched to the “automatically” handled buffers that libevent provides, but it seems the problem was on some other socket tweaks I added in the middle of the port.
Looking at the tcpdumps, data was been sent in segments with a maximum size of 16Kbytes, and that was not good for sending big files. Even when sendfile() reported that it was sending large chunks of data (300Kbytes or so), the underlying sockets where really sending things on 16Kbytes blocks, reducing the global throughput to unacceptable levels.
Then I re-read this page on TCP tunning, and I found that:
Manually adjusting socket buffer sizes with setsockopt() disables autotuning. Application that are optimized for other operating systems may implicitly defeat Linux autotuning.
And it is true!. Skipping the code like:
int newBufSize = (1024 * 64); ::setsockopt(socket, SOL_SOCKET, SO_RCVBUF, newBufSize, sizeof(newBufSize)) ::setsockopt(socket, SOL_SOCKET, SO_SNDBUF, newBufSize, sizeof(newBufSize))
in our sockets, and only for kernels newer than 2.6.7, results in longer segment sizes and increased performance. Tricky, isn’t it?