TCP Network Performance

(Abstracted and extended from a discussion from the NuttX Google group)

Question

For some unknown reason, I am seeing poor TCP network performance.

Answer

First let’s talk about TCP send performance.

Source of Performance Bottlenecks

General TCP send performance is not determined by the TCP stack as much as it is by the network device driver. Bad network performance is due to time lost BETWEEN packet transfers. The packet transfers themselves go at the wire speed*. So if you want to improve performance on a given network, you have to reduce time lost between transfers. There is no other way.

Ignoring Ethernet issues like collisions, back-off delays, inter-packet gaps (IPG), etc.

The time between packets is limited primarily by the buffering design of the network driver. If you want to improve performance, then you must improve the buffering at the network driver. You need to support many full size (1500 byte) packet buffers. You must be able to query the network for new data to transfer, and queue those transfers in packet buffers. In order to reach peak performance, the network driver must have the next transfer buffered and ready-to-go before the previous transfer is finished to minimize the GAP between packet transfers.

Different network devices also support more or less efficient interfaces: The worst performing support interfaces that can handle only one packet at a time, the best performing are able to retain linked lists of packet buffers in memory and perform scatter-gather DMA for a sequence of packets.

In the NuttX TCP stack, you can also improve performance by enabling TCP write buffering. But the driver is the real key.

It would be good to have a real in-depth analysis of the network stack performance to identify bottlenecks and generate ideas for performance improvement. No one has ever done that. If I were aware of any stack related performance issue, I would certainly address it.

RFC 1122

There is one important feature missing the NuttX TCP that can help when there is no write buffering: Without write buffering send() will not return until the transfer has been ACKed by the recipient. But under RFC 1122, the host need not ACK each packet immediately; the host may wait for 500 MS before ACKing. This combination can cause very slow performance when small, non-buffered transfers are made to an RFC 1122 client. However, the RFC 1122 must ACK at least every second (odd) packet so sequences of packets with write buffering enabled do not suffer from this problem.

Update: RFC 1122 support was added to the NuttX TCP stack with commit 66ef6d143a627738ad7f3ce1c065f9b1f3f303b0 in December of 2019. That, however, that affects only received packet ACK behavior and has no impact on transmitted packet performance; write buffering is still recommended.

TCPBlaster

I created a new test application at apps/examples/tcpblaster to measure TCP performance and collected some data for the configuration that happens to be on my desk. The tcpblaster test gives you the read and write transfer rates in Kb/sec (I won’t mention the numbers because I don’t believe they would translate any other setup and, hence, would be misleading).

There is a nifty TCP Throughput Tool that gives some theoretical upper limits on performance. The tool needs to know the MSS (which is the Ethernet packet size that you configured minus the size of the Ethernet header, 14), the round-trip time (RTT)in milliseconds (which you can get from the Linux host ping), and a loss constant (which I left at the default). With these values, I can determine that the throughput for the NuttX TCP stack is approximately at the theoretical limits. You should not be able to do better any better than that (actually, it performs above the theoretical limit, but I suppose that is why it is “theoretical”).

So, If you are unhappy with your network performance, the I suggest you run the tcpblaster test, use that data (along with the RTT from ping) with the TCP Throughput Tool. If you are still unhappy with the performance, don’t go immediately pointing fingers at the stack (which everyone does). Instead, you should focus on optimizing your network configuration settings and reviewing the buffer handling of the Ethernet driver in you MCU.

If you do discover any significant performance issues with the stack I will of course gladly help you resolve them. Or if you have ideas for improved performance, I would also be happy to hear those.

What about Receive Performance?

All of the above discussion concerns transmit performance, i.e., “How fast can we send data over the network?” The other side is receive performance. Receive performance is very different thing. In this case it is the remote peer who is in complete control of the rate at which packets appear on the network and, hence, responsible for all of the raw bit transfer rates.

However, we might also redefine performance as the number of bytes that were successfully transferred. In order for the bytes to be successfully transferred they must be successfully received and processed on the NuttX target. If we fail in this if the packet is lost or dropped. A packet is lost if the network driver is not prepared to receive the packet when it was sent. A packet is dropped by the network if it is received but could not be processed either because there is some logical issue with the packet (not the case here) or if we have no space to buffer the newly received packet.

If a TCP packet is lost or dropped, then the penalty is big: The packet will not be ACKed, the remote peer may send a few more out-of-sequence packets which will also be dropped. Eventually, the remote peer will time out and retransmit the data from the point of the lost packet.

There is logic in the TCP protocol to help manage these data overruns. The TCP header includes a TCP receive window which tells the remote peer how much data the receiver is able to buffer. This value is sent in the ACK to each received packet. If well tuned, this receive window could possibly prevent packets from being lost due to the lack of read-ahead storage. This is a little better. The remote peer will hold off sending data instead of timing out and re-transmitting. But this is still a loss of performance; the gap between the transfer of packets caused by the hold-off will result in a reduced transfer rate.

So the issues for good reception are buffering and processing time. Buffering again applies to handling within the driver but unlike the transmit performance, this is not typically the bottleneck. And there is also a NuttX configuration option that controls read-ahead buffering of TCP packets. The buffering in the driver must be optimized to avoid lost packets; the ` buffering can be tuned to minimize the number packets dropped because we have no space to buffer them.

But the key to receive perform is management of processing delays. Small processing delays can occur in the network driver or in the TCP stack. But the major source of processing delay is the application which is the ultimate consumer of the incoming data. Imagine, for example, and FTP application that is receiving a file over a TCP and writing the file into FLASH memory. The primary bottleneck here will be the write to FLASH memory which is out of the control of software.

We obtain optimal receive performance when the processing delays keep up with the rate of the incoming packets. If the processing data rate is even slightly slower then the receive data rate, then there will be a growing backlog of buffered, incoming data to be processed. If this backlog continues to grow then eventually our ability to buffer data will be exhausted, packets will be held off or dropped, and performance will deteriorate. In an environment where a high-end, remote peer is interacting with the low-end, embedded system, that remote peer can easily overrun the embedded system due to the embedded system’s limited buffering space, its much lower processing capability, and its slower storage peripherals.