* Networking Linux Simple Kernel Buffers (SKB): called "mbufs" in BSD-based OSs. * Device drivers (part 1: receiving packets) [last discussed what happens if not enough RAM] If there's enough RAM inside the NIC, it'll receive the packet and store it inside the NIC's RAM: next step is ... (give packet to OS). NIC interrupts CPU ("network interrupt"). CPU is interrupted (if it can), preserves state of running task/process, CPU registers, etc. (in task struct), and executes a network interrupt handler. Note: for every interrupt number, there's an interrupt handler function in a global interrupt handler dispatch table. When the CPU accepts an interrupt, it DISABLES other interrupts at the same level. Therefore interrupt handlers should run fast and never block (can't really interrupt one handler with another -- causes recursion). OS interrupt handler: 1. just got invoked for "networking" for a given NIC. 2. handler needs to get NIC data into an skb. 3. handler asks SKB for fast, pre-allocated SKB of a suitable size (faster than asking to allocate or calling kmalloc which can block) 4. handler then transfers data from NIC to skb: (a) setup Direct Memory Access (DMA), tell DMA processor to copy the data asynchronously from NIC to kernel mem, also give it a callback fxn to execute when copy is done. (b) issue processor I/O copy instructions to copy from NIC's addr space to kernel's addr space. Slow instrux, copy one word/byte at a time. 5. setup packet data so it can be further processed by an async queue in the kernel after it's been fully copied into an skb. but don't do that processing right now! 6. handler is effectively done, return, all interrupts are re-enabled. 7. Signal NIC that you're done (e.g., DMA response or other electronics). Back inside the NIC, when it got the "done" signal from the OS's network interrupt handler, it can free up the buffer used for that received packet. * OS sending a packet via NIC Bottom most layer of OS, e.g., net driver has a packet to transmit, in an skb. Needs to give it to the NIC. Simple case: NIC has memory free. Either setup DMA w/ callback fxn, or copy bytes/words one at a time to the NIC. When call's done, and NIC has packet, OS can free up the SKB. What if NIC is busy or its memory full and it can't receive the packet from OS? Using a single bit (wire) called the "tx on/off" (tx: transmit). If NIC is busy, it'll set TX to "off" (telling OS "don't transmit to me"). CPU checks status of "TX" bit: if off, OS will NOT send packet and just wait: in other words, the OS (a "heavy writer" to the NIC) throttles itself. When NIC has packet to transmit: 1. sample the wire (or network) 2. make sure it's quiet 3. then start transmitting 4a. if all bits were transmitted correctly on wire, NIC can free up buffer 4b. if someone else transmits at same time, signals on wire can get mixed up, corrupted bits. If that happens, stop transmitting, and back off a bit (possible exponential back-off). Then try again until success. * What happens inside OS? After OS received packet into skb from NIC (upon receiving a new packet), it puts the skb into a queue for further processing. Linux sets up a system of async interrupts to process many queues of different types, called "Soft IRQs" (soft interrupts). Soft IRQ system: 1. defines different types of processing. If the softirq is NET_RX, it means there's work to do for network receiving of packets; NET_TX is for transmitting packets; there are other softirqs and you can even define your own. NET_RX, NET_TX, etc. are bits in a global bitmap of the softirq subsystem. 2. kernel starts N kthreads for processing softirqs, usually N==#cores, e.g., ksoftirqd/cpu0, ksoftirqd/cpu1, etc. If you run ps -ef on a linux system, every "process" in [brackets] is a kernel thread. 3. scheduler checks global softirq bitmap, and if any of the bits are on, it wakes up one or more of the ksoftirqd/cpuN kthreads. Then these kthreds will get scheduled, eventually... Ksoftirqd/cpuN kthreads: when run, they check the global softirq bitmap, then execute specific "soft interrupt handlers" for each type of softirq: NET_RX will invoke code to receive packets in some queue, NET_TX will invoke code to send packet in some queue, etc. In networking, there's many layers. Each layer has its own queues with data in skbs, and handlers for dealing with that date. a dev. drv. ethernet queue will process a packet, remove ethernet headers, then add pkg to an IP queue; when IP queue consumer is running, it'll again check headers, then move pkt to another queue (UDP, TCP, etc.). This processing from queue to queue continues up until we have data to give a waiting user process. At final stage, kernel will copy_to_user the packet payload, and change process state from waiting to ready (scheduler will let it run some time later). On writing to a socket, kernel copies data from user, then return from the syscall (unless syscall asked to block). Data is then put into a VFS/socket layer queue; then processed and moved down the queues: tcp or udp, IP, ethernet, and eventually dev. driver. Note: there's even more layers (firewalling, etc. TBD) When softirqs wakeup, they have to decide WHICH queues to process first? Upon NET_TX: process packets from lowest queues first, then go up the chain. Allows upper queues to drain asynchronously. Upon NET_RX: process packets from topmost queues first, then down the chain. Allows bottom queues and NIC to move newly received data up the chain. Next time: some more d-s and details of network architecture ... and a cautionary tale about locking and networking...