Overview
This article gives a detailed introduction of Queueing Modes and explains how they work.
Information
There are three queueing modes for transmission of packets out of the Exinda, and each one will modify the behavior of how the packets flow at the hardware level. The three modes are:
- Single Mode
- Multi Mode
- Multi-Per-VC
When traffic comes in, it is hashed (a very fast computation bringing the entire file down to a single numeric value), and the result of that hash determines what receive queue it goes into. The queue is explicitly determined by taking the hash and doing what is called modulus math to reduce it to a number between 0 and (number of receive queues= 1) and that is how it determines the receive(RX) queue that the packet flows into.
Each receive queue is mapped to a CPU core via an Interrupt Request Handler (IRQ). When a packet comes into an RX queue, it interrupts the CPU core it is assigned to. The CPU drops everything it is doing to handle the new packet interrupt. Once the CPU has it, it is processed with everything in the firmware (collector, optimizer, etc).
When the firmware processing is done, the packet is put on a transmit (TX) queue. This part is based on the queueing mode we can set:
- Single mode will put every packet on a single TX queue
- Multi per VC will put every packet belonging to a specific VC on its own tx queue
- Multimode will allow for packets to go on any TX queue (each is of size 1/n where n is the number of cores)
When the CPU is putting things on the TX queue, it needs to make sure it can do so without something else taking the spot that the packet is going to occupy. This can be done in environments where there is a multi-core CPU where each core can be doing different tasks simultaneously. To prevent this, a CPU core 'locks' the queue, so nobody else but it can access it.
- Think about modifying a Word document. Two people can open and look at a document at the same time, but only one person can edit the file at a time. If more than one person could, the file would get confused and not know what changes to make/keep. The person who opens the file first generates a write lock so nobody else can write to the file until they close it. This is the same type of thing.
- Locks are related to the 'deadlocks' that processes can get into. A deadlock is the following scenario: Two processes/tasks are working simultaneously. Process 1 locks a resource it needs, Process 2 locks a resource it needs. Process 1 needs to access what Process 2 has locked, and Process 2 needs to access what Process 1 has locked. They are both waiting for the other to release their resource but neither will, leading to a 'deadlocked' state that must either be dealt with in the code (i.e., time out the type of situation where processes will fail) or restarting the process.
If multiple packets are going on the same TX queue, the CPU cores processing them will continually ask for the lock until it is received and can place the packet. Then, they all work on getting new packets to process. Changing how the TX queueing mode operates changes how the CPU cores to process and place the packets.
- Single mode: each CPU core places packets on one single tx queue. Constant fighting for the lock.
- Multi-Mode: each CPU core places packets on one queue assigned to it (each core has a different one). This means no fighting for a lock (but it also cuts down the total bandwidth per queue)
- Multi Per VC: Each VC has its own transmission queue. After packets are processed, the CPU core will put it in the appropriate TX queue for the VC it belongs to. This can result in fighting for the lock.
The ntuple switch on the NIC basically forces an RX queue to sort of map to a TX queue the CPU will never have to fight with other CPU to get the semaphore/lock to put the packet on the TX queue, but it will more frequently get interrupted with new packets, slowing it down.
In an ideal world, the hashes would map evenly to an RX queue, which would put all CPU cores under the same load and not burden one more. But this just does not happen because of the last part to get them to go into an RX queue is not perfect. Theoretically, the incoming packets would round robin into the RX queues and forced into an equal load. But that would be so hard on system resources because they would have to keep track of every single flow manually, versus just recompute a hash, so this is not feasible.
Due to this randomness factor on the incoming packets, it is impossible to fully guarantee full throughput. If one CPU is getting all the load, there will be some throughput problems. The more flows you have, theoretically the more it should even out, but there is still a chance that everything will only map to 1 or 2 queues and there is nothing we can do about it because that is how Intel has designed it. The best we can do is work the ntuple switch, turning it on or off.