Packet Generation Options - A Quantitative Comparison

The determination of packet processing speeds of a firewall / router in a test largely absolves the need to take a look at the transport protocol (TCP or UDP). Towards this, packet generators are commonly used tomeasure the performance of routers, switches, and firewalls. Traditional bandwidth measurement at higher levels in the network stack make more sense for client devices running end-user applications. There are many commercial packet generating hardware appliances and applications used in the industry from vendors such as Ixia and Spirent. For software developers and homelab enthusiasts, and even for many hardware developers, PC software such as TRex and Ostinato fit the bill. While these software tools have a bit of a learning curve, there are simple command-line applications that can deliver quick performance measurement results.

FreeBSD supports a framework for fast packet I/O in netmap. It allows applications to access interface devices without the need to go through the host stack (assuming the existence of support from the device driver). Packet generators taking advantage of this framework can generate packets at line rates for even reasonably small packet sizes. The netmap source also includes pkt-gen, a sample packet generator application that utilizes the netmap framework. The open-source community has also created a number of applications utilizing netmap and pkt-gen, allowing for easier interactive testing as well as easy automation for common scenarios. One such application is ipgen. It also includes a built-in option to benchmark packet generation. iPerf is a popular network performance measurement tool. It outputs easy to understand bandwidth numbers particularly relevant to end users of client devices. iPerf3 includes a length parameter that allows control over the UDP datagram size, allowing the simulation of packet generation similar to pkt-gen and ipgen.

In the rest of this section, we benchmark each of these options on various machines in our testbed under different conditions. This includes the dimunitive Compulab fitlet-XA10-LAN with four gigabit LAN ports. It is an attractive x86-64 system for embedded networking applications requiring multiple network ports. While it is not in the same class as the other server systems being tested in this section, it does provide context to folks adopting these types of systems for packet generation / testbed applications.

iPerf3

The iPerf3 benchmarking tool is used to get a quick idea of the networking capabilities of end-user devices. In its most common usage avatar, various options such as the TCP window size / UDP packet length are left at default. The ability to alter the latter does provide an avenue to explore the packet generation capabilities of iPerf. Though iPerf allows the length parameter to be set to very high values for the UDP datagram size (up to the maximum theoretical value of around 64K), going above the MTU results in fragmentation.

`iperf3 -u -c ${ServerIP} -t ${RunDuration} -O 5 -f m -b 10G --length ${pktsize} 2>&1`

As part of our testing, the source was configured to send UDP datagrams of various lengths ranging from 16 bytes to 1500 bytes across the DUT in router mode, as shown in the testing script extract above.

The bandwidth drop when going from 1472 to 1500 for the datagram length is explained by fragmentation. Protocol overheads tag more bytes on top of the length parameter passed to iPerf3, and that exceeds the minimum configured MTU in the network path. Packet generators are expected to saturate the link bandwidth for all but the smallest packet sizes. The results above suggest that usage of iPerf3 for this purpose is not advisable.

ipgen

The ipgen tool is considered next because it has a built-in benchmark mode. This mode doesn't actually place the generated packets on the network interface - rather it is a pure test of the CPU and the memory subsystem's capability to generate raw packets of different sizes. Multiple instances of the packet generator running simultaneously need to be bound to different cores in order to obtain the best performance.

`timeout 10s cpuset -l $cpuset ipgen -X -s $pktsize 2>&1`

The ipgen benchmark involves generating packets of various sizes for 10 seconds each. The first set involves generating of a single stream, the second involves two simultaneous streams, and so on up to four simultaneous streams. The process is bound to distinct physical cores in case of systems having the physical core count different from the logical core count. The average packet generation rate for across all enabled streams (measured in million packets per second - Mpps) is presented in the graph below.

The generator must be able to output 1.488 Mpps on a 1G interface and 14.88 Mpps on a 10G interface in order to maintain wire speeds when minimum-sized packets are considered. Considering the network interfaces on the machines in the above graphs, the CPUs are suitably equipped for the presented best-case scenario where no attempt is made to dump out the generated packet contents or drive them on to a network interface. Enabling such activities is bound to introduce some performance penalties.

pkt-gen

The pkt-gen benchmark described here adds a practical layer to the benchmark mode seen in the previous sub-section. The generated packets are driven on the network interface to the external device (in this case, the E302-9D pfSense firewall) which is configured to drop them. The line-rate often acts as the limiting factor for large frame sizes.

`timeout ${RunDuration}s /usr/obj/usr/src/amd64.amd64/tools/tools/netmap/pkt-gen -i ${IntfName} -l ${pktsize} -s ${SrcIP} -d ${DestIP} -D ${DestMAC} -f tx -N -B 2>&1`

With the network interface as the limiting factor, benchmark numbers are presented only for a single stream. As expected, the CPU speed and cache organization plays a major role in this task, with the 5019D-4C-FN8TP (equipped with an actively cooled 2.2 GHz Intel Xeon D-2123IT) being able to generate packets at the line-rate even for minimum-sized packets.

Based on the above results, it is clear why the pkt-gen tool is adopted widely as a reliable packet generator for performance verification. It may not offer the flexibility and additional features needed for other purposes (fulfilled by offerings such as TRex and Ostinato), but it suffices for a majority of the testing we set out to do. Tools such as ipgen and iPerf3 are still used in a few sections, but, as we shall see further down, pkt-gen is able to stress the DUT the best without being bottlenecked by the stimulus generators.

Evaluation Setup and Testing Methodology pfSense Configuration for Benchmarking
Comments Locked

34 Comments

View All Comments

  • Jorgp2 - Thursday, July 30, 2020 - link

    Maybe you should learn the difference between a switch and a router first.
  • newyork10023 - Thursday, July 30, 2020 - link

    Why do you people have to troll everywhere you go?
  • Gonemad - Wednesday, July 29, 2020 - link

    Oh boy. I once got Wi-Fi "AC" 5GHz, 5Gbps, and 5G mobile networks mixed once by my mother. It took a while to explain those to her.

    Don't use 10G to mean 10 Gbps, please! HAHAHA.
  • timecop1818 - Wednesday, July 29, 2020 - link

    Fortunately, when Ethernet says 10Gbps, that's what it means.
  • imaheadcase - Wednesday, July 29, 2020 - link

    Put the name Supermicro on it and you know its not for consumers.
  • newyork10023 - Wednesday, July 29, 2020 - link

    The Supermicro manual states that a PCIe card installed is limited to networking (and will require a fan installed). An HBA card can't be installed?
  • abufrejoval - Wednesday, July 29, 2020 - link

    Since I use both pfSense as a firewall and a D-1541 Xeon machine (but not for the firewall) and I share the dream of systems that are practically silent, I feel compelled to add some thoughts:

    I started using pfSense on a passive J1900 Atom board which had dual Gbit on-board and cost less than €100. That worked pretty well until my broadband exceeded 200Mbit/s, mostly because it wasn’t just a firewall, but also added Suricata traffic inspection (tried Snort, too, very similar results).

    And that’s what’s wrong with this article: 10Gbit Xeon-Ds are great when all you do is push packet, but don’t look at them. They are even greater when you terminate SSL connections on them with the QuickAssist variants. They are great when they work together with their bigger CPU brothers, who will then crunch on the logic of the data.

    In the home-appliance context that you allude to, you won’t have ten types of machines to optimally distribute that work. QuickAssist won’t deliver benefits while the CPU will run out of steam far before even a Gbit connection is saturated when you use it just for the front end of the DMZ (firewall/SSL termination/VPN/deep inspection/load-balancing-failover).

    Put proxies, caches or even application servers on them as well, even a single 10Gbit interface may be a total waste.

    I had to resort to an i7-7700T which seems a bit quicker than the D-2123IT at only 35Watts TDP (and much cheaper) to sustain 500Mbit/s download bandwidth with the best gratis Suricata rule set. Judging by CPU load observations it will just about manage the Gbit loads its ports can handle, pretty sure that 2.5/5/10 Gbit will just throttle on inspection load, like the J1900 did at 200Mbit/s.

    I use a D-1541 as an additional compute node in an oVirt 3 node HCI gluster with 3x 2.5Gbit J5005 storage nodes. I can probably go to 6x 2.5Gbit before its 10Gbit NIC becomes a bottleneck.

    The D-1541’s benefit there is lots of RAM and cores, while it’s practically silent with 45 Watts TDP and none of the applications on it require vast amounts of CPU power.

    I am waiting for an 8-core AMD 4000 Pro 35 Watt TDP APU to come as Mini-ITX capable of handling 64 or 128GB of ECC-RAM to replace the Xeon D-1541 and bring the price for such a mini server below that of a laptop with the same ingredients.
  • newyork10023 - Wednesday, July 29, 2020 - link

    With an HBA (were it possible, hence my question), the 10Gbps serves a possible use (storage). Pushing and inspection exceeds x86 limits now. See TNSR for real x86 limits (wighout inspection).
  • abufrejoval - Wednesday, July 29, 2020 - link

    That would seem apply to the chassis, not to the mainboard or SoC.
    There is nothing to prevent it from working per se.

    I am pretty sure you can add a 16-port SAS HBA or even NVMeOF card and plenty of external storage, if thermals and power fit. A Mellanox 100Gbit card should be fine electrically, logically etc, even if there is nothing behind to sustain that throughput.

    I've had an Nvidia GTX1070 GPU in the SuperMicro Mini-ITX D-1541 for a while, no problem at all, functionally, even if games still seem to prefer Hertz over cores. Actually GPU accellerated machine learning inference was the original use case of that box.
  • newyork10023 - Wednesday, July 29, 2020 - link

    As pointed out, the D2123IT has no QAT, so a QAT accelerator would take up an available PCIe slot. It could push 10G packets then, but not save them or think (AI) on them.

Log in

Don't have an account? Sign up now