IPIP源码分析 · network

[TOC] ### **带着问题阅读源码** * IPIP包的内部Header为何只有IP Header，没有MAC Header？ * 收包主机上，主机物理网卡收到包后，是如何去掉外部IP Header，然后转发给tunl0设备的？说明：本文的内核源码为5.4.200 ### **发包流程** 假设我们从主机121上的容器Pod0 Ping 主机122上的容器Pod0。那么数据包的原始IP头为172.24.121.2 -> 172.24.122.2。当数据包进入到主机121的内核网络后，根据路由，它会从tunl0网卡出去。数据包在网络层的最后一个内核函数为 `ip_finish_output2()`，我们看一下这个函数的核心源码，如下： ``` static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb) { ... neigh = ip_neigh_for_gw(rt, skb, &is_v6gw); # 根据device及gw(即tunl0及192.168.92.122)获取neighbour if (!IS_ERR(neigh)) { int res; sock_confirm_neigh(skb, neigh); /* if crossing protocols, can not use the cached header */ res = neigh_output(neigh, skb, is_v6gw); # 从neigh_output函数出去 rcu_read_unlock_bh(); return res; } ... } ``` 首先，该函数会根据device及gw(即tunl0及192.168.92.122)，找到它的neighbour。这里需要注意，这个neighbour并没有包含下一跳的mac地址(比如在跨子网的情况下，下一跳地址不在同一个子网内，本主机不可能直达它的mac地址)。在内核中，neighbour是一个比较复杂的数据结构，不仅有数据，还有函数。然后，它会继续调用`neigh_output()`函数。我们来看一下这个函数的核心源码： ``` static inline int neigh_output(struct neighbour *n, struct sk_buff *skb, bool skip_cache) { const struct hh_cache *hh = &n->hh; /* n->nud_state and hh->hh_len could be changed under us. * neigh_hh_output() is taking care of the race later. */ if (!skip_cache && (READ_ONCE(n->nud_state) & NUD_CONNECTED) && READ_ONCE(hh->hh_len)) return neigh_hh_output(hh, skb); # 如果neigh状态为connected且neighbour有缓存，则从此函数出 return n->output(n, skb); # tunl0本身没有mac地址，数据包会从此函数出 } ``` `n->output()`表示从上面查找到的neigh的output()函数。这里，我们需要来分析一下这个函数到底是什么。首先，我们来看一下neighbour这个struct，如下： ``` struct neighbour { ... int (*output)(struct neighbour *, struct sk_buff *); # output为一个函数指针 ... struct net_device *dev; # 这个neighour所在的device u8 primary_key[0]; # 这个neighbour的L3地址，IPv4协议则为IP } __randomize_layout; ``` 可以看出，output是neighbour的一个函数指针。那么这个指针指向的实际函数是什么呢？这里，我们就需要来分析一下这个neighbour实例的初始化过程，看neighbour实例在初始化时output指针被赋值了哪个函数。我们看回上面`ip_finish_output2()`函数的代码，发现neigh实例是函数`ip_neigh_for_gw()`的返回值，我们继续看这个函数的源码，如下： ``` static inline struct neighbour *ip_neigh_for_gw(struct rtable *rt, struct sk_buff *skb, bool *is_v6gw) { struct net_device *dev = rt->dst.dev; struct neighbour *neigh; if (likely(rt->rt_gw_family == AF_INET)) { # 如果是IPv4协议，则调用ip_neigh_gw4() neigh = ip_neigh_gw4(dev, rt->rt_gw4); } else if (rt->rt_gw_family == AF_INET6) { neigh = ip_neigh_gw6(dev, &rt->rt_gw6); *is_v6gw = true; } else { neigh = ip_neigh_gw4(dev, ip_hdr(skb)->daddr); } return neigh; } ``` 我们继续看`ip_neigh_gw4()`函数，如下： ``` static inline struct neighbour *ip_neigh_gw4(struct net_device *dev, __be32 daddr) { struct neighbour *neigh; neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)daddr); # 根据device和daddr(dest address)查找 if (unlikely(!neigh)) # 如果没有找到，则创建一个 neigh = __neigh_create(&arp_tbl, &daddr, dev, false); return neigh; } ``` 要知道output指针实际指向的函数，我们需要跟踪`__neigh_create()`的源码，看一下它是如何初始化neigh中的output指针，该函数源码如下： ``` static struct neighbour * ___neigh_create(struct neigh_table *tbl, const void *pkey, struct net_device *dev, u8 flags, bool exempt_from_gc, bool want_ref) { ... n = neigh_alloc(tbl, dev, flags, exempt_from_gc); # 为neigh申请内存空间 ... /* Protocol specific setup. */ if (tbl->constructor && (error = tbl->constructor(n)) < 0) { # 如果tbl的constructor非空，则调用constructor()函数，对neigh进行构造，即初始化 rc = ERR_PTR(error); goto out_neigh_release; } ... neigh_dbg(2, "neigh %p is created\n", n); rc = n; out: return rc; out_tbl_unlock: write_unlock_bh(&tbl->lock); out_neigh_release: if (!exempt_from_gc) atomic_dec(&tbl->gc_entries); neigh_release(n); goto out; } ``` 我们关注上面代码中的`tbl->constructor(n)`，看名字就知道这是一个构造函数，对n进行初始化工作。要查看这个constructor()函数的代码，要先定位到tbl这个指针。我们回看`ip_neigh_for_gw4()`函数的代码，可以看到这个tbl指针是`&arp_table`。我们搜索一下`arp_table`这个对象，发现它的代码如下： ``` struct neigh_table arp_tbl = { .family = AF_INET, .key_len = 4, .protocol = cpu_to_be16(ETH_P_IP), .hash = arp_hash, .key_eq = arp_key_eq, .constructor = arp_constructor, # constructor指针指向arp_constructor()函数 ... }; EXPORT_SYMBOL(arp_tbl); ``` 可以看出，`arp_tbl`是结构体`neigh_table`的一个对象。在`neigh_table`结构体中有一个变量constructor，它是一个函数指针(这里我们不再列出`neigh_table`的结构体定义，可自行查看)，而在`arp_table`这个对象中，该函数指针指向的函数为`arp_constructor`。我们继续看`arp_constructor`函数的核心源码，如下： ``` static int arp_constructor(struct neighbour *neigh) { __be32 addr; struct net_device *dev = neigh->dev; ... if (!dev->header_ops) { # 如果网络设备的header_ops为NULL neigh->nud_state = NUD_NOARP; neigh->ops = &arp_direct_ops; neigh->output = neigh_direct_output; # neigh的output函数指针，指向neigh_direct_output函数 } else { ... } return 0; } ``` 对于ipip网络设备，其header_ops为NULL（可以查看文章末尾的 [附录一：ipip网络设备的header_ops]()）。所以constructor()函数实际上会进入到if语句中，最终output函数指针被赋值为`neigh_direct_output()`函数。我们继续查看这个函数的源码，如下： ``` int neigh_direct_output(struct neighbour *neigh, struct sk_buff *skb) { return dev_queue_xmit(skb); } EXPORT_SYMBOL(neigh_direct_output); ``` 它什么也没做，直接调用了`dev_queue_xmit()`函数。这个函数是发包过程中L2层的第一个函数。 **由于neighbour层的主要作用是给数据包设置二层硬件地址，经过上面的分析我们发现，对于IPIP tunnel，neigh->output()函数就是neigh_direct_output()，直接去到了L2层。这就是为什么IPIP数据包的Inner Header没有L2地址。** 接下来，我们继续分析，为什么这个数据包没有直接从网卡发送出去，而是在外面再添加了一个ip头。查看`dev_queue_xmit()`函数的源码，如下： ``` int dev_queue_xmit(struct sk_buff *skb) { return __dev_queue_xmit(skb, NULL); } ``` 继续查看`__dev_queue_xmit()`函数的源码，如下： ``` static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) { ... if (q->enqueue) { # 根据下面的注释，ipip tunnel没有队列，所以不会进入这里 rc = __dev_xmit_skb(skb, q, dev, txq); goto out; } /* The device has no queue. Common case for software devices: * loopback, all the sorts of tunnels... * ... */ if (dev->flags & IFF_UP) { ... if (READ_ONCE(txq->xmit_lock_owner) != cpu) { if (dev_xmit_recursion()) goto recursion_alert; skb = validate_xmit_skb(skb, dev, &again); if (!skb) goto out; HARD_TX_LOCK(dev, txq, cpu); if (!netif_xmit_stopped(txq)) { dev_xmit_recursion_inc(); skb = dev_hard_start_xmit(skb, dev, txq, &rc); # 数据包最终会从这里出 dev_xmit_recursion_dec(); if (dev_xmit_complete(rc)) { HARD_TX_UNLOCK(dev, txq); goto out; } } ... } else { ... } } ... } ``` 对于IPIP隧道，数据包最终会从`dev_hard_start_xmit()`函数出。我们查看这个函数的代码，如下： ``` struct sk_buff *dev_hard_start_xmit(struct sk_buff *first, struct net_device *dev, struct netdev_queue *txq, int *ret) { struct sk_buff *skb = first; int rc = NETDEV_TX_OK; while (skb) { struct sk_buff *next = skb->next; skb_mark_not_on_list(skb); rc = xmit_one(skb, dev, txq, next != NULL); # xmit_one()函数每次发送一个数据包，数据包会从这里出 if (unlikely(!dev_xmit_complete(rc))) { skb->next = next; goto out; } skb = next; if (netif_tx_queue_stopped(txq) && skb) { rc = NETDEV_TX_BUSY; break; } } out: *ret = rc; return skb; } ``` 我们再查看`xmit_one()`函数的源码，如下： ``` static int xmit_one(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, bool more) { unsigned int len; int rc; if (dev_nit_active(dev)) dev_queue_xmit_nit(skb, dev); len = skb->len; trace_net_dev_start_xmit(skb, dev); rc = netdev_start_xmit(skb, dev, txq, more); # 最终会走到这个函数 trace_net_dev_xmit(skb, rc, dev, len); return rc; } ``` 我们继续看`netdev_start_xmit()`函数，如下： ``` static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, bool more) { const struct net_device_ops *ops = dev->netdev_ops; netdev_tx_t rc; rc = __netdev_start_xmit(ops, skb, dev, more); # 最终调用这个函数 if (rc == NETDEV_TX_OK) txq_trans_update(txq); return rc; } ``` 我们再来看`__netdev_start_xmit()`函数，如下： ``` static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops, struct sk_buff *skb, struct net_device *dev, bool more) { __this_cpu_write(softnet_data.xmit.more, more); return ops->ndo_start_xmit(skb, dev); } ``` 最终调用`ops->ndo_start_xmit()`，这个函数是网络设备真正的发送函数。它是一个函数指针（ndo是net device operations的缩写），要知道这个函数指针真实指向的函数，我们需要知道ops这个对象在初始化时是如何赋值这个指针的。在上面的`netdev_start_xmit()`函数中，我们可以看到`ops = dev->netdev_ops`这一行代码，那么，我们只需要知道ipip这类dev的`netdev_ops`。对于每种类型的网络设备，有一个初始化设置函数`xxx_setup()`，比如ipip设备的初始化函数为`ipip_tunnel_setup()`，vxlan设备的叫`vxlan_setup()`。在初始化函数中，会初始化这类设备的`netdev_ops`。我们查看`ipip_tunnel_setup()`函数，如下： ``` static void ipip_tunnel_setup(struct net_device *dev) { dev->netdev_ops = &ipip_netdev_ops; # netdev_ops的赋值在这里 dev->type = ARPHRD_TUNNEL; dev->flags = IFF_NOARP; dev->addr_len = 4; dev->features |= NETIF_F_LLTX; netif_keep_dst(dev); dev->features |= IPIP_FEATURES; dev->hw_features |= IPIP_FEATURES; ip_tunnel_setup(dev, ipip_net_id); } ``` 我们再查看`ipip_netdev_ops`这个对象的定义： ``` static const struct net_device_ops ipip_netdev_ops = { .ndo_init = ipip_tunnel_init, .ndo_uninit = ip_tunnel_uninit, .ndo_start_xmit = ipip_tunnel_xmit, # ipip tunnel设备的ndo_start_xmit为 ipip_tunnel_xmit .ndo_do_ioctl = ipip_tunnel_ioctl, .ndo_change_mtu = ip_tunnel_change_mtu, .ndo_get_stats64 = ip_tunnel_get_stats64, .ndo_get_iflink = ip_tunnel_get_iflink, }; ``` 可见，ipip tunnel设备的`ndo_start_xmit`函数指针实际指向的函数为`ipip_tunnel_xmit()`。这个函数，就是ipip网络设备注册的发送的函数。我们继续看这个函数的代码： ``` /* * This function assumes it is being called from dev_queue_xmit() * and that skb is filled properly by that function. */ static netdev_tx_t ipip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev) { ... if (tunnel->collect_md) ip_md_tunnel_xmit(skb, dev, ipproto, 0); else ip_tunnel_xmit(skb, dev, tiph, ipproto); return NETDEV_TX_OK; ... } ``` 上面具体走哪一个函数目前还不清楚，但是如果看`ip_md_tunnel_xmit()`函数的源码，会发现其实它最终调用的也是`iptunnel_xmit()`函数，和`ip_tunnel_xmit()`函数最终调用的一致，如下： ``` void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev, const struct iphdr *tnl_params, u8 protocol) { ... dst = tnl_params->daddr; # 获取tunnel的daddr，即remote address if (dst == 0) { # 如果remote address为0，说明不是点对点tunnel /* NBMA tunnel */ # 为Non-Blocking-Multi-Address tunnel ... tun_info = skb_tunnel_info(skb); if (tun_info && (tun_info->mode & IP_TUNNEL_INFO_TX) && ip_tunnel_info_af(tun_info) == AF_INET && tun_info->key.u.ipv4.dst) { ... } else if (skb->protocol == htons(ETH_P_IP)) { rt = skb_rtable(skb); # 获取路由表 dst = rt_nexthop(rt, inner_iph->daddr); # 根据内部IP头的daddr，在路由表中查找路由然后得到下一跳地址 } ... } ... iptunnel_xmit(NULL, rt, skb, fl4.saddr, fl4.daddr, protocol, tos, ttl, df, !net_eq(tunnel->net, dev_net(dev))); return; ... } EXPORT_SYMBOL_GPL(ip_tunnel_xmit); ``` 可以看出，它最终是从`iptunnel_xmit()`函数出去。我们看这个函数的源码，如下： ``` void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb, __be32 src, __be32 dst, __u8 proto, __u8 tos, __u8 ttl, __be16 df, bool xnet) { ... /* Push down and install the IP header. */ skb_push(skb, sizeof(struct iphdr)); # 对数据包进行位置调整，让出位置给外部的header skb_reset_network_header(skb); iph = ip_hdr(skb); # 获取数据包外部header的位置 iph->version = 4; # 从这里开始，设置外部的IP header iph->ihl = sizeof(struct iphdr) >> 2; iph->frag_off = ip_mtu_locked(&rt->dst) ? 0 : df; iph->protocol = proto; iph->tos = tos; iph->daddr = dst; iph->saddr = src; iph->ttl = ttl; __ip_select_ident(net, iph, skb_shinfo(skb)->gso_segs ?: 1); err = ip_local_out(net, sk, skb); # 最终绕到网络层 ... } EXPORT_SYMBOL_GPL(iptunnel_xmit); ``` 可以看出，该函数会重新设置数据包的外部Header，然后调用`ip_local_out()`重新投递到网络层。 ### **收包流程** 当主机物理网卡收到IPIP包时，会把它当成一个普通IP包进行处理，一层一层的往上传，一直到达网络层中的`ip_rcv_finish()`函数。然后根据路由，判断此包是发往本机的，最终会走到`ip_local_deliver_finish()`内核函数。我们来看一下这个函数的源码： ``` static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { __skb_pull(skb, skb_network_header_len(skb)); rcu_read_lock(); ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol); # 从外部IP头中获取上层协议 rcu_read_unlock(); return 0; } ``` 上面的`ip_hdr(skb)->protocol`表示从外部IP头中获取L4的协议编号。如果了解IP协议头的话，就会知道在IP协议头中有一个Protocol字段，表示IP包封装的上层协议是什么，参考[维基百科](https://en.wikipedia.org/wiki/IPv4)，如下： ![](https://img.kancloud.cn/eb/5b/eb5b6434e719ba955795c15f21a39547_2279x639.png) IP数据包可以封装多种上层协议，比如TCP、UDP、ICMP、GRE等等。**如果IP数据包里面封装的协议是还是IP协议，那么此时我们把里面的协议也叫IPIP协议，把IPIP协议也当成是L4层的一种协议**。从IP头中获取的Protocol字段是协议的编号，每个协议都有一个唯一的编号，对应关系可以参考[维基百科](https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers)，TCP对应的编号为6、IPIP对应的编号为4。如果我们使用tcpdump抓包，并使用wireshark查看，也可以看到外部IP头中的Protocol字段的值为4，表示上层为IPIP协议： ![](https://img.kancloud.cn/e8/81/e881362533651e1c2b73f6bb7c31f8c4_2053x738.png) 获取到上层的协议编号后，然后继续调用`ip_protocol_deliver_rcu()`函数： ``` void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol) { const struct net_protocol *ipprot; ... ipprot = rcu_dereference(inet_protos[protocol]); # 根据协议号获取该协议注册的net_protocol对象 if (ipprot) { ... ret = INDIRECT_CALL_2(ipprot->handler, tcp_v4_rcv, udp_rcv, skb); # 调用net_protocol指向的 handler 函数（即协议注册的handler函数） ... } else { ... } } ``` 上述函数中，首先根据协议号获取该协议注册的net_protocol对象，再调用该协议对象中的handler函数进行处理。我们先来看一下net_protocol结构体的定义： ``` /* This is used to register protocols. */ struct net_protocol { int (*early_demux)(struct sk_buff *skb); int (*early_demux_handler)(struct sk_buff *skb); int (*handler)(struct sk_buff *skb); # 函数指针，L4协议注册时需要赋值本协议的处理函数 /* This returns an error if we weren't able to handle the error. */ int (*err_handler)(struct sk_buff *skb, u32 info); unsigned int no_policy:1, netns_ok:1, /* does the protocol do more stringent * icmp tag validation than simple * socket lookup? */ icmp_strict_tag_validation:1; }; ``` 接下来，我们再看一下IPIP协议的注册函数： ``` static int __init tunnel4_init(void) { if (inet_add_protocol(&tunnel4_protocol, IPPROTO_IPIP)) # IPIP协议注册的net_protocol对象为 tunnel4_protocol goto err; ... err: pr_err("%s: can't add protocol\n", __func__); return -EAGAIN; } ``` 我们可以看到，IPIP协议注册的net_protocol对象为tunnel4_protocol，我们继续查看这个对象的定义： ``` static const struct net_protocol tunnel4_protocol = { .handler = tunnel4_rcv, # handler函数为 tunnel4_rcv .err_handler = tunnel4_err, .no_policy = 1, .netns_ok = 1, }; ``` 可以看到，它的handler函数为`tunnel4_rcv()`，至此，我们知道对于IPIP数据包，在`ip_local_deliver_finish()`函数中，会继续调用`tunnel4_rcv()`函数进行处理。我们继续看这个函数的源码： ``` static int tunnel4_rcv(struct sk_buff *skb) { struct xfrm_tunnel *handler; if (!pskb_may_pull(skb, sizeof(struct iphdr))) goto drop; for_each_tunnel_rcu(tunnel4_handlers, handler) # 等价于 for(handler = tunnel4_handlers; handler != NULL; handler = handler->next) if (!handler->handler(skb)) return 0; icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); drop: kfree_skb(skb); return 0; } ``` 上面最主要的逻辑是for循环语句。这是一个宏定义，把它展开的话就是等价于如上注释的内容。`tunnel4_handlers`是一个全局变量，代表一个链表的头，tunnel4类型的隧道协议会把自己以结构体`xfrm_tunnel`的形式添加到这个链表中。最终处理数据包的逻辑为代码`handler->handler(skb)`，第二个`handler`其实是`xfrm_tunnel`结构体中一个函数指针。`xfrm_tunnel`结构体的定义很简单，如下： ``` struct xfrm_tunnel { int (*handler)(struct sk_buff *skb); int (*err_handler)(struct sk_buff *skb, u32 info); struct xfrm_tunnel __rcu *next; int priority; }; ``` 其中最重要的字段是`handler`函数指针和`priority`优先级。一个协议要把自己添加到该链表中，需要调用如下函数： ``` int xfrm4_tunnel_register(struct xfrm_tunnel *handler, unsigned short family) { struct xfrm_tunnel __rcu **pprev; struct xfrm_tunnel *t; int ret = -EEXIST; int priority = handler->priority; mutex_lock(&tunnel4_mutex); for (pprev = fam_handlers(family); (t = rcu_dereference_protected(*pprev, lockdep_is_held(&tunnel4_mutex))) != NULL; pprev = &t->next) { if (t->priority > priority) # 根据协议的priority把协议插入到链表中，链表是根据priority降序排列 break; if (t->priority == priority) goto err; } handler->next = *pprev; rcu_assign_pointer(*pprev, handler); ret = 0; err: mutex_unlock(&tunnel4_mutex); return ret; } ``` 上面函数的主要作用是，把协议对应的`xfrm_tunnel`对象插入到链表中，协议的priority越大，协议插入的位置就越前。那么在`tunnel4_rcv()`函数中越先处理包。我们来看一下IPIP协议是如何把自己插入到这个链表中来的。如下： ``` static struct xfrm_tunnel ipip_handler __read_mostly = { .handler = ipip_rcv, .err_handler = ipip_err, .priority = 1, }; static int __init ipip_init(void) { int err; ... err = xfrm4_tunnel_register(&ipip_handler, AF_INET); # 注册到 ... } ``` 可以看到，IPIP协议注册的结构体中，`handler`函数指针指向的函数为`ipip_rcv`，`priority`为1。（另外，通过全局搜索`xfrm4_tunnel_register`函数会发现还有一个`xfrm_tunnel_handler`也注册了，而且priority为3，不知道该结构有什么用，这里先不管）。所以，最后数据包会到达`ipip_rcv`函数进行处理，该函数的源码如下： ``` static int ipip_rcv(struct sk_buff *skb) { return ipip_tunnel_rcv(skb, IPPROTO_IPIP); } ``` 继续查看`ipip_tunnel_rcv()`函数的源码： ``` static int ipip_tunnel_rcv(struct sk_buff *skb, u8 ipproto) { ... tunnel = ip_tunnel_lookup(itn, skb->dev->ifindex, TUNNEL_NO_KEY, iph->saddr, iph->daddr, 0); # 查询tunnel，因为可能有很多tunnel，比如创建了有local和remote很的tunnel if (tunnel) { ... return ip_tunnel_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error); } return -1; drop: kfree_skb(skb); return 0; } ``` 最终，它会调用`ip_tunnel_rcv()`函数，该函数的源码如下： ``` int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb, const struct tnl_ptk_info *tpi, struct metadata_dst *tun_dst, bool log_ecn_error) { ... skb_set_network_header(skb, (tunnel->dev->type == ARPHRD_ETHER) ? ETH_HLEN : 0); # 重新设置network header，即去掉外层IP Header ... gro_cells_receive(&tunnel->gro_cells, skb); # 重新投递到内核协议栈，从tunnel对应的网卡接收 ... ``` 当数据包重新投递到tunnel对应的网卡后，从二层协议开始再走一次内核协议栈。 `gro_cells_receive`函数后面的调用流程依次为（可以参考[此文](https://www.daimajiaoliu.com/daima/a598f3911129803)或者通过在线源码追踪） ``` gro_cells_receive -> netif_rx -> netif_rx_internal -> enqueue_to_backlog -> ____napi_schedule -> __raise_softirq_irqoff ``` 其中`enqueue_to_backlog`函数是把数据包放到网卡（这里就是IPIP虚拟网卡）的接收队列中，最后触发一个软中断。根据张彦飞《深入理解Linux网络》一书的第13页的图2.2，ksoftirqd线程处理软中断时，会调用网卡驱动注册的poll函数开始收包。再根据《深入理解Linux网络》一书的第280页的最后两行： > 所有的虚拟设备的收包poll函数都是一样的，都是在设备层被初始化成process_backlog 所以，我们继续追踪`process_backlog`函数，发现它调用了`__netif_receive_skb`函数。自此，我们清楚了整个流程。 ### **内核函数总结** ``` # 发包 ip_finish_output2 -> ip_neigh_for_gw -> neigh_output -> n->output (neigh_direct_output) -> dev_queue_xmit -> __dev_queue_xmit -> dev_hard_start_xmit -> xmit_one -> dev_queue_xmit_nit // tcpdump位置 -> netdev_start_xmit -> __netdev_start_xmit -> ops->ndo_start_xmit (ipip_tunnel_xmit) ipip_tunnel_xmit -> iptunnel_xmit -> ip_local_out -> ... -> ip_finish_output2 ``` ``` # 收包 ip_local_deliver_finish -> ip_protocol_deliver_rcu -> ipprot->handler (tunnel4_rcv) -> ipip_rcv -> ipip_tunnel_rcv -> ip_tunnel_rcv -> gro_cells_receive -> netif_rx -> netif_rx_internal -> enqueue_to_backlog -> ____napi_schedule -> __raise_softirq_irqoff process_backlog -> __netif_receive_skb ``` ### **附录一：ipip网络设备的header_ops** 首先我们看一下net_device这个结构体中header_ops的定义，如下： ``` struct net_device { ... const struct header_ops *header_ops; ... } ``` 它是一个指针类型。在`ipip_tunnel_setup()`函数中，并没有对header_ops这个字段进行赋值，所以该字段取默认值NULL ``` static void ipip_tunnel_setup(struct net_device *dev) { dev->netdev_ops = &ipip_netdev_ops; dev->type = ARPHRD_TUNNEL; dev->flags = IFF_NOARP; dev->addr_len = 4; dev->features |= NETIF_F_LLTX; netif_keep_dst(dev); dev->features |= IPIP_FEATURES; dev->hw_features |= IPIP_FEATURES; ip_tunnel_setup(dev, ipip_net_id); } ``` ### **参考** * https://morven.life/posts/networking-3-ipip/ * https://github.com/beacer/notes/blob/master/kernel/ipip.md