The journey of a packet through the Linux 2.6.10 network stack Harald Welte laforge@gnumonks.org 2004 Harald Welte <laforge@gnumonks.org> Sep 14, 2004 1 netfilter core team $Revision: 1.4 $ This document describes the journey of a network packet inside the linux kernel 2.6.x. This has changed quite a bit since 2.6 because the globally serialized bottom half was abandoned in favor of the new softirq system.
Preface I have to excuse for my ignorance, but this document has a strong focus on the "default case": x86 architecture and ip packets which get forwarded. If you want to contribute your favourite part, feel free to send me a patch. While I've been working on netfilter/iptabes for quite some time, I am definitely no core networking guru and the information provided by this document may be wrong. So don't expect too much, I'll always appreciate Your comments and bugfixes. The document tries to reflect the latest kernel at the time of it's writing, which is 2.6.10-rc2. If you are working on an earlier or later kernel, parts of the network stack might already have changed again.
Receiving the packet
The receive interrupt If the network card receives an ethernet frame which matches the local MAC address, an address programmed into the multicast filter or for the linklayer broadcast address, it issues an interrupt. The network driver for this particular card handles the interrupt, fetches the packet data via DMA / PIO / whatever into RAM. It then allocates a skb and calls a function of the protocol independent device support routines: net/core/dev.c:netif_rx(skb). Please note that in Linux 2.6.x, drivers can also be written to support the so-called NAPI (New API). NAPI tries to prevent DoS attacks caused by packet floods that make the cpu spin in the hardirq handler. Instead of using netif_rx() the way described above, they disable interrupt generation on the card and schedule polling by calling the function include/linux/netdev.h:netif_rx_schedule(dev).
netif_rx() At this early time, the kernel checks whether there are any users of netpoll registered. Netpoll is a low-level mechanism for network access to incoming packets used by code that wants to avoid using the full network stack, like netconsole. If the driver didn't already timestamp the skb, and some piece of code inside the kernel requested timestamps by asserting netstamp_needed, the kernel timestamps the skb now by calling include/net/sock.h:net_timestamp(). Afterwards the skb gets enqueued in the apropriate queue for the processor handling this packet. If the queue backlog is full the packet is dropped at this place. After enqueuing the skb the receive softinterrupt is marked for execution via include/linux/netdev.h:netif_rx_schedule(). The cautious reader will have discovered that this function was previously mentioned in relation to NAPI drivers. And yes, indeed, this is the point where the two codepaths rejoin and continue their common way through the rest of the stack. If the queue is already full (queue->throttle != 0), then the packet is dropped rather than enqueued. netif_rx() returns the queue congestion level to give some feedback to the driver. The congestion level can be either NET_RX_SUCCESS, NET_RX_CN_LOW, NET_RX_CN_MOD, NET_RX_CN_HIGH or NET_RX_DROP.
The interrupt handler now exits and all interrupts are reenabled
The network RX softirq Like in Linux 2.4, the whole network stack is running in softirq context. Softirqs have the major advantage that they may run on more than one CPU simultaneously (as opposed to the old "bottom halves" in Linux 2.2.x). Our network receive softirq is registered in net/core/dev.c:net_dev_init() using the function kernel/softirq.c:open_softirq() provided by the softirq subsystem. Further handling of our packet is done in the network receive softirq (NET_RX_SOFTIRQ) which is called from kernel/softirq.c:__do_softirq() via kernel/softirq.c:do_softirq(). do_softirq() itself is called from three places within the kernel: from kernel/irq/handle.c:irq_exit(), which is called by architecture-specific code after the hardware interrupt handler has finished. from kernel/softirq.c:ksoftirqd(), that is the kernel softirq daemon. from kernel/softirq.c:local_bh_enable(), that is FIXME. from net/core/dev.c:netif_rx_in(), which is a special version of netif_rx(), used by bluetooth bnep and the tun driver. So if execution passes one of these points, __do_softirq() is called, it detects the NET_RX_SOFTIRQ marked an calls net/core/dev.c:net_rx_action(). Here the sbks are dequeued from the local CPU's backlog queue (net/core/dev.c:process_backlog()) using a weighting scheme between the different incoming devices.
netif_receive_skb() The next function is net/core/dev.c:netif_receive_skb(), which is the main input function for the receive softirq. First there is again a check for any netpoll useers via netpoll_rx(). If there is no timestamp, and timestamps have been requested somehwere in the kernel, net_timestamp() is called. In case the incoming interface is part of a groub of bound interfaces, skb_bond() saves skb->dev to skb->real_dev and changes skb->dev to point to the master device structure. Now the packet is devlivered to all layer 3 protocol handlers that have registered for all packets (such as PF_PACKET sockets) by calling deliver_skb() If the kernel supports 'tc actions' (i.e. it was compiled with CONFIG_NET_CLS_ACT enabled), the ingress filter is now run via ing_filter(). If the filter verdict is TC_ACT_SHOT or TC_ACT_STOLEN, the skb is dropped by kfree_skb() and thus all further processing of the packet stopped. Next, it is checked (include/linux/divert.h:handle_diverter())if somebody uses the packet dirverter, nother obscure feature of the linux kernel. If yes, processing continues at net/core/dv.c:divert_frame()). If the kernel has support for ethernet bridging (i.e. CONFIG_BRIDGE is enabled), it is handled via net/core/dev.c:handle_bridge() and br_handle_frame_hook(). Finally, the regular layer 3 packet handlers are called by a lookup in the ptype hash and a successive call to net/core/dev.c:deliver_skb().
The IPv4 packet handler The IP packet handler is registered via net/core/dev.c:dev_add_pack() called from net/ipv4/ip_output.c:ip_init(). The IPv4 packet handling function is net/ipv4/ip_input.c:ip_rcv(). After some initial checks (if the packet header is correct (IPv4) and if its size is over 20 bytes) the IPv4 header checksum is verified. Every packet failing one of the sanity checks is dropped at this point. If the packet passes the tests, we determine the size of the ip packet and trim the skb in case the transport medium has appended some padding. Now it is the first time one of the netfilter hooks (NF_IP_PRE_ROUTING) is called. Netfilter provides a generic and abstract interface to the standard routing code. This is currently used for packet filtering, mangling, NAT and queuing packets to userspace. For further reference see my conference paper 'The netfilter subsystem in Linux 2.4' or one of Rustys unreliable guides, i.e the netfilter-hacking-guide. After successful traversal the netfilter hook (i.e. no registered module returned NF_DROP or NF_STOLEN), net/ipv4/ip_input.c:ip_rcv_finish() is called. Inside net/ipv4/ip_input:ip_rcv_finish(), the 'virtual path cache', also refereed to as destination cache is inintialied by calling the routing function net/ipv4/route.c:ip_route_input()
<function>ip_route_input()</function> First, the routing cache (rt_hash_table) is looked up. In case we already have an entry in the hash, skb->dst will be set up to point at the routing cache entry. (Please note, the first member of rtable is a dst_entry, so this kind of typecase will work). If there is no routing cache hit, the packet hits the multicast recognition logic. This was moved from the route cache to net/ipv4/route.c:ip_route_input() since there appear to be too many ethernet devices with broken or even missing multicast filters. As a result, Linux ended up to create way too many routing cache entries for whatever incoming multicast packets. If the packet is valid multicast, processing continues at net/ipv4/route.c:ip_route_input_mc(). For non-multicast packets, net/ipv4/route.c:ip_route_input_slow() is the next step.
<function>ip_route_input_slow()</function> After lots of safety checks (martian sources, loopback, local source addresses, ...) the 'real' routing table (aka fib, the forwarding information base) is looked up. If our packet has IP options, they are parsed at this time. The parsed options find themselves inside the IPCB(skb)->opt. Further processing of the packet happens as determined by the destination cache within the call to include/net/dst.h:dst_input(), which iterates over the destination stack and calls the input functions of the no key ine the hash-table, a new one is calculated and inserted into the hash-table. Then we call net/ipv4/route.c:ip_route_input_slow().

The packet journey goes on with the call to the input method of the skb which has been positioned in net/ipv4/route.c:ip_route_input() to one of the following function: net/ipv4/ip_input.c:ip_local_deliver() The packet's destination is local, we have to process the layer 4 protocol and pass it to an userspace process. net/ipv4/ip_forward.c:ip_forward() The packet's destination is not local, we have to forward it to another network net/ipv4/route.c:ip_error() An error occurred, we are unable to find an apropriate routing table entry for this packet. net/ipv4/ipmr.c:ip_mr_input() It is a Multicast packet and we have to do some multicast routing. Packet forwarding to another device

If the routing decided that this packet has to be forwarded to another device, the function net/ipv4/ip_forward.c:ip_forward() is called.

The first task of this function is to check the ip header's TTL. If it is <= 1 we drop the packet and return an ICMP time exceeded message to the sender.

We check the header's tailroom if we have enough tailroom for the destination device's link layer header and expand the skb if neccessary.

Next the TTL is decremented by one.

If our new packet is bigger than the MTU of the destination device and the don't fragment bit in the IP header is set, we drop the packet and send a ICMP frag needed message to the sender.

Finally it is time to call another one of the netfilter hooks - this time it is the NF_IP_FORWARD hook.

Assuming that the netfilter hooks is returning a NF_ACCEPT verdict, the function net/ipv4/ip_forward.c:ip_forward_finish() is the next step in our packet's journey.

ip_forward_finish() itself checks if we need to set any additional options in the IP header, and has and has net/ipv4/ip_options.c:ip_forward_options() doing this. Afterwards it calls include/net/ip.h:ip_send().

If we need some fragmentation, net/ipv4/output.c:ip_fragment() gets called, otherwise we continue in net/ipv4/ip_forward:ip_finish_output().

ip_output() does the NAT process and then calls the netfilter postrouting hook NF_POSTROUTING_HOOK and ip_finish_output2() on successfull traversal of the hook.

ip_finish_output2() calls prepends the hardware (link layer) header to our skb and calls dst->hh->hh_output() which seems to usually be net/core/dev.c:dev_queue_transmit().

dev_queue_xmit() enqueues the packet for transmission by the network device. Acknowledgements

Of course I wouldn't have been able to write this document if not lots of other people had influenced me in some way, enabling me to understand all that code in the first place.

I want to list here:

Linus Torvalds, who got us started with that whole thing in the first place.

Alan Cox, David Miller, Alexey Kuznetsov, Andi Kleen: The net.gods

Rusty Russell for his great work on netfilter and his help at LBW2000

Directly contributed to this document have so far: Alexandre Dagan: <alexandre.dagan@linuxmail.org>