Garden | Socket

socket 在所有的网络操作系统中都是必不可少的，而且在所有的网络应用程序中也是必不可少的。它是网络通信中应用程序对应的进程和网络协议之间的接口之间的接口。本文将介绍其使用和内核实现原理，参考 Linux 内核版本为 v5.4。

Background

socket 在网络系统中的作用是：

socket 位于协议之上，屏蔽了不同网络协议之间的差异；
socket 是网络编程的入口，它提供了大量的系统调用，构成了网络程序的主体；
在 Linux 系统中，socket 属于文件系统的一部分，网络通信可以被看作是对文件的读取，使得我们对网络的控制和对文件的控制一样方便。

结构

无论是用 socket 操作 TCP，还是 UDP，我们首先都要调用 socket 函数。

1

int socket(int domain, int type, int protocol);

socket 函数用于创建一个 socket 的文件描述符，唯一标识一个 socket。我们把它叫作文件描述符，因为在内核中，我们会创建类似文件系统的数据结构，并且后续的操作都有用到它。

在创建 socket 的时候，有三个参数：

family

表示地址族。不是所有的 Socket 都要通过 IP 进行通信，还有其他的通信方式。例如，下面的定义中，domain sockets 就是通过本地文件进行通信的，不需要 IP 地址。只不过，通过 IP 地址只是最常用的模式，所以我们这里着重分析这种模式。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


/* Supported address families. */
#define AF_UNSPEC	0
#define AF_UNIX		1	/* Unix domain sockets 		*/
#define AF_LOCAL	1	/* POSIX name for AF_UNIX	*/
#define AF_INET		2	/* Internet IP Protocol 	*/
#define AF_AX25		3	/* Amateur Radio AX.25 		*/
#define AF_IPX		4	/* Novell IPX 			*/
#define AF_APPLETALK	5	/* AppleTalk DDP 		*/
#define AF_NETROM	6	/* Amateur Radio NET/ROM 	*/
#define AF_BRIDGE	7	/* Multiprotocol bridge 	*/
#define AF_ATMPVC	8	/* ATM PVCs			*/
#define AF_X25		9	/* Reserved for X.25 project 	*/
#define AF_INET6	10	/* IP version 6			*/
// ...
#define AF_XDP		44	/* XDP sockets			*/

#define AF_MAX		45	/* For now.. */

对应的也有 Protocol Famliy

1
2
3
4
5
6
7
8
9


/* Protocol families, same as address families. */
#define PF_UNSPEC	AF_UNSPEC
#define PF_UNIX		AF_UNIX
#define PF_LOCAL	AF_LOCAL
#define PF_INET		AF_INET
#define PF_AX25		AF_AX25
#define PF_IPX		AF_IPX
#define PF_APPLETALK	AF_APPLETALK
// ...

type

常用的 Socket 类型有三种，分别是 SOCK_STREAM、SOCK_DGRAM 和 SOCK_RAW。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


enum sock_type {
	SOCK_STREAM	= 1,       // stream (connection) socket
	SOCK_DGRAM	= 2,       // datagram (conn.less) socket
	SOCK_RAW	= 3,         // raw socket
	SOCK_RDM	= 4,         // reliably-delivered message
	SOCK_SEQPACKET	= 5,   // sequential packet socket
	SOCK_DCCP	= 6,         // Datagram Congestion Control Protocol socket
	SOCK_PACKET	= 10,      // linux specific way of getting packets at the dev level.
                         // For writing rarp and other similar things on the user level.
};
#define SOCK_MAX (SOCK_PACKET + 1)

SOCK_STREAM 是面向数据流的，协议 IPPROTO_TCP 属于这种类型。
SOCK_DGRAM 是面向数据报的，协议 IPPROTO_UDP 属于这种类型。如果在内核里面看的话，IPPROTO_ICMP 也属于这种类型。
SOCK_RAW 是原始的 IP 包，IPPROTO_IP 属于这种类型。

这一节，我们重点看 SOCK_STREAM 类型和 IPPROTO_TCP 协议。

protocol

第三个参数是 protocol，是协议。协议数目是比较多的，也就是说，多个协议会属于同一种类型。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56


#if __UAPI_DEF_IN_IPPROTO
/* Standard well-defined IP protocols.  */
enum {
  IPPROTO_IP = 0,		/* Dummy protocol for TCP		*/
#define IPPROTO_IP		IPPROTO_IP
  IPPROTO_ICMP = 1,		/* Internet Control Message Protocol	*/
#define IPPROTO_ICMP		IPPROTO_ICMP
  IPPROTO_IGMP = 2,		/* Internet Group Management Protocol	*/
#define IPPROTO_IGMP		IPPROTO_IGMP
  IPPROTO_IPIP = 4,		/* IPIP tunnels (older KA9Q tunnels use 94) */
#define IPPROTO_IPIP		IPPROTO_IPIP
  IPPROTO_TCP = 6,		/* Transmission Control Protocol	*/
#define IPPROTO_TCP		IPPROTO_TCP
  IPPROTO_EGP = 8,		/* Exterior Gateway Protocol		*/
#define IPPROTO_EGP		IPPROTO_EGP
  IPPROTO_PUP = 12,		/* PUP protocol				*/
#define IPPROTO_PUP		IPPROTO_PUP
  IPPROTO_UDP = 17,		/* User Datagram Protocol		*/
#define IPPROTO_UDP		IPPROTO_UDP
  IPPROTO_IDP = 22,		/* XNS IDP protocol			*/
#define IPPROTO_IDP		IPPROTO_IDP
  IPPROTO_TP = 29,		/* SO Transport Protocol Class 4	*/
#define IPPROTO_TP		IPPROTO_TP
  IPPROTO_DCCP = 33,		/* Datagram Congestion Control Protocol */
#define IPPROTO_DCCP		IPPROTO_DCCP
  IPPROTO_IPV6 = 41,		/* IPv6-in-IPv4 tunnelling		*/
#define IPPROTO_IPV6		IPPROTO_IPV6
  IPPROTO_RSVP = 46,		/* RSVP Protocol			*/
#define IPPROTO_RSVP		IPPROTO_RSVP
  IPPROTO_GRE = 47,		/* Cisco GRE tunnels (rfc 1701,1702)	*/
#define IPPROTO_GRE		IPPROTO_GRE
  IPPROTO_ESP = 50,		/* Encapsulation Security Payload protocol */
#define IPPROTO_ESP		IPPROTO_ESP
  IPPROTO_AH = 51,		/* Authentication Header protocol	*/
#define IPPROTO_AH		IPPROTO_AH
  IPPROTO_MTP = 92,		/* Multicast Transport Protocol		*/
#define IPPROTO_MTP		IPPROTO_MTP
  IPPROTO_BEETPH = 94,		/* IP option pseudo header for BEET	*/
#define IPPROTO_BEETPH		IPPROTO_BEETPH
  IPPROTO_ENCAP = 98,		/* Encapsulation Header			*/
#define IPPROTO_ENCAP		IPPROTO_ENCAP
  IPPROTO_PIM = 103,		/* Protocol Independent Multicast	*/
#define IPPROTO_PIM		IPPROTO_PIM
  IPPROTO_COMP = 108,		/* Compression Header Protocol		*/
#define IPPROTO_COMP		IPPROTO_COMP
  IPPROTO_SCTP = 132,		/* Stream Control Transport Protocol	*/
#define IPPROTO_SCTP		IPPROTO_SCTP
  IPPROTO_UDPLITE = 136,	/* UDP-Lite (RFC 3828)			*/
#define IPPROTO_UDPLITE		IPPROTO_UDPLITE
  IPPROTO_MPLS = 137,		/* MPLS in IP (RFC 4023)		*/
#define IPPROTO_MPLS		IPPROTO_MPLS
  IPPROTO_RAW = 255,		/* Raw IP packets			*/
#define IPPROTO_RAW		IPPROTO_RAW
  IPPROTO_MAX
};
#endif

并不是上面的 type 和 protocol 可以随意组合的，如 SOCK_STREAM 不可以跟 IPPROTO_UDP 组合。当 protocol 为 0 时，会自动选择 type 类型对应的默认协议。

为了管理 family、type、protocol 这三个分类层次，内核会创建对应的数据结构。

UDP

TCP

三次握手建立连接

Socket vs Sock

内核中 socket 有两个数据结构，一个是 socket，另一个是 sock：

socket 是 general BSD socket，是应用程序和 4 层协议之间的接口，屏蔽掉了相关的 4 层协议部分
sock 是内核中保存 socket 所需要使用的相关的 4 层协议的信息
socket 和 sock 这两个结构都有保存对方的指针，因此可以很容易的存取对方

同样的，在 socket 和 sock 层分别有 proto_ops 和 proto 两个数据结构：

socket 的 ops 域保存了对于不同的 socket 类型的操作函数
sock 中有一个 sk_common 保存了一个 skc_prot 域保存的是相应的协议簇的操作函数的集合
proto 相当于对 proto_ops 的一层封装,最终会在 proto 中调用 proto_ops

内核调用路径

内核在调用相关操作都是

直接调用 socket 的 ops 域
然后在 ops 域中调用相应的 sock 结构体中的 sock_common 域的 skc_prot 的操作集中的相对应的函数

举个例子,假设现在我们使用 tcp 协议然后调用 bind 方法,内核会先调用 sys_bind 方法（属于 socket 的操作集）:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen)
{
	struct socket *sock;

  /* ... */
				err = sock->ops->bind(sock,
						      (struct sockaddr *)
						      &address, addrlen);
  /* ... */
}

可以看到它调用的是 ops 域的 bind 方法.而这时我们的 ops 域是 inet_stream_ops来看它的 bind 方法（属于 sock 的操作集）:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


// net/ipv4/af_inet.c
int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
{
	struct sock *sk = sock->sk;
	int err;

	/* If the socket has its own bind function then use it. (RAW) */
	if (sk->sk_prot->bind) {
		return sk->sk_prot->bind(sk, uaddr, addr_len);
	}

   /* ... */

	return __inet_bind(sk, uaddr, addr_len, false, true);
}

它最终调用的是 sock 结构的 sk_prot 域(也就是 sock_common 的 skc_prot 域) 的 bind 方法，而此时我们的 skc_prot 的值是 tcp_prot，因此最终会调用 tcp_prot 的 bind 方法。对于 bind 而言，因为没有 bind，所以还是调用的 __inet_bind。

下面就是示意图：

socket

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


// struct socket - general BSD socket
struct socket {
	socket_state		state;           // socket state (%SS_CONNECTED, etc)
	short			type;                  // socket type (%SOCK_STREAM, etc)
	unsigned long		flags;           // socket flags (%SOCK_NOSPACE, etc)
	struct socket_wq	*wq;           // wait queue for several uses
	struct file		*file;             // File back pointer for gc
	struct sock		*sk;               // internal networking protocol agnostic socket representation
	const struct proto_ops	*ops;    // protocol specific socket operations
};

proto_ops

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


struct proto_ops {
	int		family;
	struct module	*owner;
	int		(*release)   (struct socket *sock);
	int		(*bind)	     (struct socket *sock, struct sockaddr *myaddr, int sockaddr_len);
	int		(*connect)   (struct socket *sock, struct sockaddr *vaddr, int sockaddr_len, int flags);
	int		(*socketpair)(struct socket *sock1, struct socket *sock2);
	int		(*accept)    (struct socket *sock, struct socket *newsock, int flags, bool kern);
	int		(*getname)   (struct socket *sock, struct sockaddr *addr, int peer);
	__poll_t	(*poll)	     (struct file *file, struct socket *sock, struct poll_table_struct *wait);
	int		(*ioctl)     (struct socket *sock, unsigned int cmd, unsigned long arg);
	int		(*listen)    (struct socket *sock, int len);
	int		(*shutdown)  (struct socket *sock, int flags);
	int		(*setsockopt)(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen);
	int		(*getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen);
	int		(*sendmsg)   (struct socket *sock, struct msghdr *m, size_t total_len);
	int		(*recvmsg)   (struct socket *sock, struct msghdr *m, size_t total_len, int flags);
	int		(*mmap)	     (struct file *file, struct socket *sock, struct vm_area_struct * vma);
	ssize_t		(*sendpage)  (struct socket *sock, struct page *page, int offset, size_t size, int flags);
	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags);
	int		(*set_peek_off)(struct sock *sk, int val);
	int		(*peek_len)(struct socket *sock);
};

socket state

1
2
3
4
5
6
7


typedef enum {
	SS_FREE = 0,			/* not allocated		*/
	SS_UNCONNECTED,			/* unconnected to any socket	*/
	SS_CONNECTING,			/* in process of connecting	*/
	SS_CONNECTED,			/* connected to socket		*/
	SS_DISCONNECTING		/* in process of disconnecting	*/
} socket_state;

sock

sock_common

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95


/**
 *	struct sock_common - minimal network layer representation of sockets
 *
 *	This is the minimal network layer representation of sockets, the header
 *	for struct sock and struct inet_timewait_sock.
 */
struct sock_common {
	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
	 * address on 64bit arches : cf INET_MATCH()
	 */
	union {
		__addrpair	skc_addrpair;
		struct {
			__be32	skc_daddr;            // Foreign IPv4 addr
			__be32	skc_rcv_saddr;        // Bound local IPv4 addr
		};
	};
	union  {
		unsigned int	skc_hash;         // hash value used with various protocol lookup tables
		__u16		skc_u16hashes[2];       //  two u16 hash values used by UDP lookup tables
	};
	/* skc_dport && skc_num must be grouped as well */
	union {
		__portpair	skc_portpair;
		struct {
			__be16	skc_dport;            // placeholder for inet_dport/tw_dport
			__u16	skc_num;                // placeholder for inet_num/tw_num
		};
	};

	unsigned short		skc_family;       // network address family
	volatile unsigned char	skc_state;  // Connection state
	unsigned char		skc_reuse:4;        // %SO_REUSEADDR setting
	unsigned char		skc_reuseport:1;    // %SO_REUSEPORT setting
	unsigned char		skc_ipv6only:1;
	unsigned char		skc_net_refcnt:1;
	int			skc_bound_dev_if;           // bound device index if != 0
	union {
		struct hlist_node	skc_bind_node;     // bind hash linkage for various protocol lookup tables
		struct hlist_node	skc_portaddr_node; // second hash linkage for UDP/UDP-Lite protocol
	};
	struct proto		*skc_prot;          // protocol handlers inside a network family
	possible_net_t		skc_net;          // reference to the network namespace of this socket

#if IS_ENABLED(CONFIG_IPV6)
	struct in6_addr		skc_v6_daddr;
	struct in6_addr		skc_v6_rcv_saddr;
#endif

	atomic64_t		skc_cookie;

	/* following fields are padding to force
	 * offset(struct sock, sk_refcnt) == 128 on 64bit arches
	 * assuming IPV6 is enabled. We use this padding differently
	 * for different kind of 'sockets'
	 */
	union {
		unsigned long	skc_flags;    /* place holder for sk_flags
                                 *		%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
                                 *		%SO_OOBINLINE settings, %SO_TIMESTAMPING settings
                                 */
		struct sock	*skc_listener; /* request_sock */
		struct inet_timewait_death_row *skc_tw_dr; /* inet_timewait_sock */
	};
	/*
	 * fields between dontcopy_begin/dontcopy_end
	 * are not copied in sock_copy()
	 */
	/* private: */
	int			skc_dontcopy_begin[0];
	/* public: */
	union {
		struct hlist_node	skc_node;              //  main hash linkage for various protocol lookup tables
		struct hlist_nulls_node skc_nulls_node;  //  main hash linkage for TCP/UDP/UDP-Lite protocol
	};
	unsigned short		skc_tx_queue_mapping;    // tx queue number for this connection
#ifdef CONFIG_XPS
	unsigned short		skc_rx_queue_mapping;    // rx queue number for this connection
#endif
	union {
		int		skc_incoming_cpu;  // record/match cpu processing incoming packets
		u32		skc_rcv_wnd;
		u32		skc_tw_rcv_nxt; /* struct tcp_timewait_sock  */
	};

	refcount_t		skc_refcnt;  // reference count
	/* private: */
	int                     skc_dontcopy_end[0];
	union {
		u32		skc_rxhash;
		u32		skc_window_clamp;
		u32		skc_tw_snd_nxt; /* struct tcp_timewait_sock */
	};
	/* public: */
};

sock

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126


struct sock {
	/*
	 * Now struct inet_timewait_sock also uses sock_common, so please just
	 * don't add nothing before this first member (__sk_common) --acme
	 */
	struct sock_common	__sk_common;           // shared layout with inet_timewait_sock

	socket_lock_t		sk_lock;                   // synchronizer
	atomic_t		sk_drops;                      // raw/udp drops counter
	int			sk_rcvlowat;                       // %SO_RCVLOWAT setting
	struct sk_buff_head	sk_error_queue;        // rarely used
	struct sk_buff_head	sk_receive_queue;      // incoming packets
	/*
	 * The backlog queue is special, it is always used with
	 * the per-socket spinlock held and requires low latency
	 * access. Therefore we special case it's implementation.
	 * Note : rmem_alloc is in this structure to fill a hole
	 * on 64bit arches, not because its logically part of
	 * backlog.
	 */
	struct {
		atomic_t	rmem_alloc;
		int		len;
		struct sk_buff	*head;
		struct sk_buff	*tail;
	} sk_backlog;              // always used with the per-socket spinlock held
#define sk_rmem_alloc sk_backlog.rmem_alloc

	int			sk_forward_alloc;                // space allocated forward
#ifdef CONFIG_NET_RX_BUSY_POLL
	unsigned int		sk_ll_usec;              // usecs to busypoll when there is no data
	/* ===== mostly read cache line ===== */
	unsigned int		sk_napi_id;              // id of the last napi context to receive data for sk
#endif
	int			sk_rcvbuf;                       // size of receive buffer in bytes

	struct sk_filter __rcu	*sk_filter;      // socket filtering instructions
	union {
		struct socket_wq __rcu	*sk_wq;        // sock wait queue and async head
		struct socket_wq	*sk_wq_raw;
	};

	struct dst_entry	*sk_rx_dst;            // receive input route used by early demux
	struct dst_entry __rcu	*sk_dst_cache;   // destination cache
	atomic_t		sk_omem_alloc;               // "o" is "option" or "other"
	int			sk_sndbuf;                       // size of send buffer in bytes

	/* ===== cache line for TX ===== */
	int			sk_wmem_queued;               // persistent queue size
	refcount_t		sk_wmem_alloc;          // transmit queue bytes committed
	unsigned long		sk_tsq_flags;         // TCP Small Queues flags
	union {
		struct sk_buff	*sk_send_head;      // front of stuff to transmit
		struct rb_root	tcp_rtx_queue;
	};
	struct sk_buff_head	sk_write_queue;   // Packet sending queue
	__s32			sk_peek_off;                // current peek_offset value
	int			sk_write_pending;             // a write to stream socket waits to start
	__u32			sk_dst_pending_confirm;     // need to confirm neighbour
	u32			sk_pacing_status;             // see enum sk_pacing, Pacing status (requested, handled by sch_fq)
	long			sk_sndtimeo;                // %SO_SNDTIMEO setting
	struct timer_list	sk_timer;           // sock cleanup timer
	__u32			sk_priority;                // %SO_PRIORITY setting
	__u32			sk_mark;                    // generic packet mark
	u32			sk_pacing_rate;               // bytes per second,  Pacing rate (if supported by transport/packet scheduler)
	u32			sk_max_pacing_rate;           // Maximum pacing rate (%SO_MAX_PACING_RATE)
	struct page_frag	sk_frag;            // cached page frag
	netdev_features_t	sk_route_caps;      // route capabilities (e.g. %NETIF_F_TSO)
	netdev_features_t	sk_route_nocaps;    // forbidden route capabilities (e.g NETIF_F_GSO_MASK)
	netdev_features_t	sk_route_forced_caps;
	int			sk_gso_type;                  // GSO type (e.g. %SKB_GSO_TCPV4)
	unsigned int		sk_gso_max_size;      // Maximum GSO segment size to build
	gfp_t			sk_allocation;              // allocation mode
	__u32			sk_txhash;                  // computed flow hash for use on transmit

	/*
	 * Because of non atomicity rules, all
	 * changes are protected by socket lock.
	 */
	unsigned int		__sk_flags_offset[0]; // empty field used to determine location of bitfield

	unsigned int		sk_padding : 1,       // unused element for alignment
				sk_kern_sock : 1,               // True if sock is using kernel lock classes
				sk_no_check_tx : 1,             //  %SO_NO_CHECK setting, set checksum in TX packets
				sk_no_check_rx : 1,             // allow zero checksum in RX packets
				sk_userlocks : 4,               // %SO_SNDBUF and %SO_RCVBUF settings
				sk_protocol  : 8,               // which protocol this socket belongs in this network family
				sk_type      : 16;              // socket type (%SOCK_STREAM, etc)

	u16			sk_gso_max_segs;              // Maximum number of GSO segments
	u8			sk_pacing_shift;              // scaling factor for TCP Small Queues
	unsigned long	        sk_lingertime;  // %SO_LINGER l_linger setting
	struct proto		*sk_prot_creator;     // sk_prot of original sock creator (see ipv6_setsockopt, IPV6_ADDRFORM for instance)
	rwlock_t		sk_callback_lock;         // used with the callbacks in the end of this struct
	int			sk_err,                       // last error
				sk_err_soft;                    // errors that don't cause failure but are the cause of a persistent failure not just 'timed out'
	u32			sk_ack_backlog;               // current listen backlog
	u32			sk_max_ack_backlog;           // listen backlog set in listen()
	kuid_t			sk_uid;                   // user id of owner
	struct pid		*sk_peer_pid;           // &struct pid for this socket's peer
	const struct cred	*sk_peer_cred;      // %SO_PEERCRED setting
	long			sk_rcvtimeo;                // %SO_RCVTIMEO setting
	ktime_t			sk_stamp;                 // time stamp of last packet received
	u16			sk_tsflags;                   // SO_TIMESTAMPING socket options
	u8			sk_shutdown;                  // mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN
	u32			sk_tskey;                     // counter to disambiguate concurrent tstamp requests
	atomic_t		sk_zckey;                 // counter to order MSG_ZEROCOPY notifications

	u8			sk_clockid;                   // clockid used by time-based scheduling (SO_TXTIME)
	u8			sk_txtime_deadline_mode : 1,  // set deadline mode for SO_TXTIME
				sk_txtime_report_errors : 1,
				sk_txtime_unused : 6;           // unused txtime flags

	struct socket		*sk_socket;           // Identd and reporting IO signals
	void			*sk_user_data;              // RPC layer private data
	struct sock_cgroup_data	sk_cgrp_data;            // cgroup data for this cgroup
	struct mem_cgroup	*sk_memcg;                     // this socket's memory cgroup association
	void			(*sk_state_change)(struct sock *sk);   // callback to indicate change in the state of the sock
	void			(*sk_data_ready)(struct sock *sk);     // callback to indicate there is data to be processed
	void			(*sk_write_space)(struct sock *sk);    // callback to indicate there is bf sending space available
	void			(*sk_error_report)(struct sock *sk);   // callback to indicate errors (e.g. %MSG_ERRQUEUE)
	int			(*sk_backlog_rcv)(struct sock *sk, struct sk_buff *skb);  // callback to process the backlog
	void                    (*sk_destruct)(struct sock *sk);          // called at sock freeing time, i.e. when all refcnt == 0
	struct sock_reuseport __rcu	*sk_reuseport_cb; // reuseport group container
	struct rcu_head		sk_rcu;                     // used during RCU grace period
};

proto

很重要的是其连接着对应的 proto 操作集，在操作集中维护着对应的 hashinfo，hashinfo 存储着 sock、端口等重要信息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82


/* Networking protocol blocks we attach to sockets.
 * socket layer -> transport layer interface
 */
struct proto {
	void			(*close)(struct sock *sk, long timeout);
	int			  (*pre_connect)(struct sock *sk, struct sockaddr *uaddr, int addr_len);
	int			  (*connect)(struct sock *sk, struct sockaddr *uaddr,	int addr_len);
	int			  (*disconnect)(struct sock *sk, int flags);

	struct sock *		(*accept)(struct sock *sk, int flags, int *err, bool kern);

	int			  (*ioctl)(struct sock *sk, int cmd, unsigned long arg);
	int			  (*init)(struct sock *sk);
	void			(*destroy)(struct sock *sk);
	void			(*shutdown)(struct sock *sk, int how);
	int			  (*setsockopt)(struct sock *sk, int level, int optname, char __user *optval, unsigned int optlen);
	int			  (*getsockopt)(struct sock *sk, int level, int optname, char __user *optval,	int __user *option);
	void			(*keepalive)(struct sock *sk, int valbool);
	int			  (*sendmsg)(struct sock *sk, struct msghdr *msg, size_t len);
	int			  (*recvmsg)(struct sock *sk, struct msghdr *msg, size_t len, int noblock, int flags, int *addr_len);
	int			  (*sendpage)(struct sock *sk, struct page *page, int offset, size_t size, int flags);
	int			  (*bind)(struct sock *sk, struct sockaddr *uaddr, int addr_len);

	int			  (*backlog_rcv) (struct sock *sk, struct sk_buff *skb);

	void		  (*release_cb)(struct sock *sk);

	/* Keeping track of sk's, looking them up, and port selection methods. */
	int			  (*hash)(struct sock *sk);
	void			(*unhash)(struct sock *sk);
	void			(*rehash)(struct sock *sk);
	int			  (*get_port)(struct sock *sk, unsigned short snum);

	bool			(*stream_memory_free)(const struct sock *sk);
	bool			(*stream_memory_read)(const struct sock *sk);
	/* Memory pressure */
	void			(*enter_memory_pressure)(struct sock *sk);
	void			(*leave_memory_pressure)(struct sock *sk);
	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
	struct percpu_counter	*sockets_allocated;	/* Current number of sockets. */
	/*
	 * Pressure flag: try to collapse.
	 * Technical note: it is used by multiple contexts non atomically.
	 * All the __sk_mem_schedule() is of this nature: accounting
	 * is strict, actions are advisory and have some latency.
	 */
	unsigned long		*memory_pressure;
	long			*sysctl_mem;

	int			*sysctl_wmem;
	int			*sysctl_rmem;
	u32			sysctl_wmem_offset;
	u32			sysctl_rmem_offset;

	int			max_header;
	bool			no_autobind;

	struct kmem_cache	*slab;
	unsigned int		obj_size;
	slab_flags_t		slab_flags;
	unsigned int		useroffset;	/* Usercopy region offset */
	unsigned int		usersize;	/* Usercopy region size */

	struct percpu_counter	*orphan_count;

	struct request_sock_ops	*rsk_prot;
	struct timewait_sock_ops *twsk_prot;

	union {
		struct inet_hashinfo	*hashinfo;
		struct udp_table	*udp_table;
		struct raw_hashinfo	*raw_hash;
		struct smc_hashinfo	*smc_hash;
	} h;

	struct module		*owner;

	char			name[32];

	struct list_head	node;
	int			(*diag_destroy)(struct sock *sk, int err);
} __randomize_layout;

inet_sock

inet_sock 是 INET 域的 socket 表示，是对 struct sock 的一个扩展，提供 INET 域的一些属性，如 TTL，组播列表，IP 地址，端口等；

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


/** struct inet_sock - representation of INET sockets
 *
 * @sk - ancestor class
 * @pinet6 - pointer to IPv6 control block
 * @inet_daddr - Foreign IPv4 addr
 * @inet_rcv_saddr - Bound local IPv4 addr
 * @inet_dport - Destination port
 * @inet_num - Local port
 * @inet_saddr - Sending source
 * @uc_ttl - Unicast TTL
 * @inet_sport - Source port
 */
struct inet_sock {
	struct sock		sk;
#define inet_daddr		sk.__sk_common.skc_daddr
#define inet_rcv_saddr		sk.__sk_common.skc_rcv_saddr
#define inet_dport		sk.__sk_common.skc_dport
#define inet_num		sk.__sk_common.skc_num

	__be32			inet_saddr;
	__s16			uc_ttl;
	__u16			cmsg_flags;
	__be16			inet_sport;
	__u16			inet_id;

  /* ... */
};

raw_sock

它是 RAW 协议的一个 socket 表示，是对 struct inet_sock 的扩展，它要处理与 ICMP 相关的内容；

1
2
3
4
5
6


struct raw_sock {
	/* inet_sock has to be the first member */
	struct inet_sock   inet;
	struct icmp_filter filter;
	u32		   ipmr_table;
};

udp_sock

它是 UDP 协议的 socket 表示，是对 struct inet_sock 的扩展；

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


struct udp_sock {
	/* inet_sock has to be the first member */
	struct inet_sock inet;
#define udp_port_hash		inet.sk.__sk_common.skc_u16hashes[0]
#define udp_portaddr_hash	inet.sk.__sk_common.skc_u16hashes[1]
#define udp_portaddr_node	inet.sk.__sk_common.skc_portaddr_node
	int		 pending;	/* Any pending frames ? */
	unsigned int	 corkflag;	/* Cork is required */
	__u8		 encap_type;	/* Is this an Encapsulation socket? */
	unsigned char	 no_check6_tx:1,/* Send zero UDP6 checksums on TX? */
			 no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
	/*
	 * Following member retains the information to create a UDP header
	 * when the socket is uncorked.
	 */
	__u16		 len;		/* total length of pending frames */
	__u16		 gso_size;

  /*
	 * For encapsulation sockets.
	 */
	int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
	void (*encap_destroy)(struct sock *sk);

	/* GRO functions for UDP socket */
	struct sk_buff *	(*gro_receive)(struct sock *sk, struct list_head *head, struct sk_buff *skb);
	int			(*gro_complete)(struct sock *sk,	struct sk_buff *skb,int nhoff);

	/* udp_recvmsg try to use this before splicing sk_receive_queue */
	struct sk_buff_head	reader_queue ____cacheline_aligned_in_smp;

	/* This field is dirtied by udp_recvmsg() */
	int		forward_deficit;
};

inet_connection_sock

inet_connection_sock 是所有 面向连接 的 socket 表示，是对 struct inet_sock 的扩展；它的第一个域就是 inet_sock。主要维护了与其绑定的端口结构，以及监听等状态过程中的信息。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


// inet_connection_sock - INET connection oriented sock
struct inet_connection_sock {
	/* inet_sock has to be the first member! */
	struct inet_sock	  icsk_inet;
	struct request_sock_queue icsk_accept_queue;           // FIFO of established children
	struct inet_bind_bucket	  *icsk_bind_hash;             // Bind node
	unsigned long		  icsk_timeout;                        // Timeout
 	struct timer_list	  icsk_retransmit_timer;             // Resend (no ack)
 	struct timer_list	  icsk_delack_timer;
	__u32			  icsk_rto;                                  // Retransmit timeout
	__u32			  icsk_pmtu_cookie;                          // Last pmtu seen by socket
	const struct tcp_congestion_ops *icsk_ca_ops;          // Pluggable congestion control hook
	const struct inet_connection_sock_af_ops *icsk_af_ops; //  Operations which are AF_INET{4,6} specific
	const struct tcp_ulp_ops  *icsk_ulp_ops;               // Pluggable ULP control hook
	void			  *icsk_ulp_data;                            // ULP private data
	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);  //  Clean acked data hook
	struct hlist_node         icsk_listen_portaddr_node;       // hash to the portaddr listener hashtable
	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
	__u8			  icsk_ca_state:6,              // Congestion control state
				  icsk_ca_setsockopt:1,
				  icsk_ca_dst_locked:1;
	__u8			  icsk_retransmits;             // Number of unrecovered [RTO] timeouts
	__u8			  icsk_pending;                 // Scheduled timer event
	__u8			  icsk_backoff;                 // Backoff
	__u8			  icsk_syn_retries;             // Number of allowed SYN (or equivalent) retries
	__u8			  icsk_probes_out;              // unanswered 0 window probes
	__u16			  icsk_ext_hdr_len;             // Network protocol overhead (IP/IPv6 options)
	struct {
		__u8		  pending;	 /* ACK is pending			   */
		__u8		  quick;	 /* Scheduled number of quick acks	   */
		__u8		  pingpong;	 /* The session is interactive		   */
		__u8		  blocked;	 /* Delayed ACK was blocked by socket lock */
		__u32		  ato;		 /* Predicted tick of soft clock	   */
		unsigned long	  timeout;	 /* Currently scheduled timeout		   */
		__u32		  lrcvtime;	 /* timestamp of last received data packet */
		__u16		  last_seg_size; /* Size of last incoming segment	   */
		__u16		  rcv_mss;	 /* MSS used for delayed ACK decisions	   */
	} icsk_ack;                               // Delayed ACK control data
	struct {
		int		  enabled;

		/* Range of MTUs to search */
		int		  search_high;
		int		  search_low;

		/* Information on the current probe. */
		int		  probe_size;

		u32		  probe_timestamp;
	} icsk_mtup;                             // MTU probing control data
	u32			  icsk_user_timeout;

	u64			  icsk_ca_priv[88 / sizeof(u64)];
#define ICSK_CA_PRIV_SIZE      (11 * sizeof(u64))
};

tcp_sock

tcp_sock 是 TCP 协议的 socket 表示，是对 struct inet_connection_sock 的扩展，主要增加滑动窗口，拥塞控制一些 TCP 专用属性；它的第一个域就是 inet_connection_sock 。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254


struct tcp_sock {
	/* inet_connection_sock has to be the first member of tcp_sock */
	struct inet_connection_sock	inet_conn;
	u16	tcp_header_len;	/* Bytes of tcp header to send		*/
	u16	gso_segs;	/* Max number of segs per GSO packet	*/

/*
 *	Header prediction flags
 *	0x5?10 << 16 + snd_wnd in net byte order
 */
	__be32	pred_flags;

/*
 *	RFC793 variables by their proper names. This means you can
 *	read the code and the spec side by side (and laugh ...)
 *	See RFC793 and RFC1122. The RFC writes these in capitals.
 */
	u64	bytes_received;	/* RFC4898 tcpEStatsAppHCThruOctetsReceived
				 * sum(delta(rcv_nxt)), or how many bytes
				 * were acked.
				 */
	u32	segs_in;	/* RFC4898 tcpEStatsPerfSegsIn
				 * total number of segments in.
				 */
	u32	data_segs_in;	/* RFC4898 tcpEStatsPerfDataSegsIn
				 * total number of data segments in.
				 */
 	u32	rcv_nxt;	/* What we want to receive next 	*/
	u32	copied_seq;	/* Head of yet unread data		*/
	u32	rcv_wup;	/* rcv_nxt on last window update sent	*/
 	u32	snd_nxt;	/* Next sequence we send		*/
	u32	segs_out;	/* RFC4898 tcpEStatsPerfSegsOut
				 * The total number of segments sent.
				 */
	u32	data_segs_out;	/* RFC4898 tcpEStatsPerfDataSegsOut
				 * total number of data segments sent.
				 */
	u64	bytes_sent;	/* RFC4898 tcpEStatsPerfHCDataOctetsOut
				 * total number of data bytes sent.
				 */
	u64	bytes_acked;	/* RFC4898 tcpEStatsAppHCThruOctetsAcked
				 * sum(delta(snd_una)), or how many bytes
				 * were acked.
				 */
	u32	dsack_dups;	/* RFC4898 tcpEStatsStackDSACKDups
				 * total number of DSACK blocks received
				 */
 	u32	snd_una;	/* First byte we want an ack for	*/
 	u32	snd_sml;	/* Last byte of the most recently transmitted small packet */
	u32	rcv_tstamp;	/* timestamp of last received ACK (for keepalives) */
	u32	lsndtime;	/* timestamp of last sent data packet (for restart window) */
	u32	last_oow_ack_time;  /* timestamp of last out-of-window ACK */

	u32	tsoffset;	/* timestamp offset */

	struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
	struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */

	u32	snd_wl1;	/* Sequence for window update		*/
	u32	snd_wnd;	/* The window we expect to receive	*/
	u32	max_window;	/* Maximal window ever seen from peer	*/
	u32	mss_cache;	/* Cached effective mss, not including SACKS */

	u32	window_clamp;	/* Maximal window to advertise		*/
	u32	rcv_ssthresh;	/* Current window clamp			*/

	/* Information of the most recently (s)acked skb */
	struct tcp_rack {
		u64 mstamp; /* (Re)sent time of the skb */
		u32 rtt_us;  /* Associated RTT */
		u32 end_seq; /* Ending TCP sequence of the skb */
		u32 last_delivered; /* tp->delivered at last reo_wnd adj */
		u8 reo_wnd_steps;   /* Allowed reordering window */
#define TCP_RACK_RECOVERY_THRESH 16
		u8 reo_wnd_persist:5, /* No. of recovery since last adj */
		   dsack_seen:1, /* Whether DSACK seen after last adj */
		   advanced:1;	 /* mstamp advanced since last lost marking */
	} rack;
	u16	advmss;		/* Advertised MSS			*/
	u8	compressed_ack;
	u32	chrono_start;	/* Start time in jiffies of a TCP chrono */
	u32	chrono_stat[3];	/* Time in jiffies for chrono_stat stats */
	u8	chrono_type:2,	/* current chronograph type */
		rate_app_limited:1,  /* rate_{delivered,interval_us} limited? */
		fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
		fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */
		is_sack_reneg:1,    /* in recovery from loss with SACK reneg? */
		unused:2;
	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
		thin_lto    : 1,/* Use linear timeouts for thin streams */
		recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */
		repair      : 1,
		frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
	u8	repair_queue;
	u8	syn_data:1,	/* SYN includes data */
		syn_fastopen:1,	/* SYN includes Fast Open option */
		syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
		syn_fastopen_ch:1, /* Active TFO re-enabling probe */
		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
		save_syn:1,	/* Save headers of SYN packet */
		is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */
		syn_smc:1;	/* SYN includes SMC */
	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */

/* RTT measurement */
	u64	tcp_mstamp;	/* most recent packet received/sent */
	u32	srtt_us;	/* smoothed round trip time << 3 in usecs */
	u32	mdev_us;	/* medium deviation			*/
	u32	mdev_max_us;	/* maximal mdev for the last rtt period	*/
	u32	rttvar_us;	/* smoothed mdev_max			*/
	u32	rtt_seq;	/* sequence number to update rttvar	*/
	struct  minmax rtt_min;

	u32	packets_out;	/* Packets which are "in flight"	*/
	u32	retrans_out;	/* Retransmitted packets out		*/
	u32	max_packets_out;  /* max packets_out in last window */
	u32	max_packets_seq;  /* right edge of max_packets_out flight */

	u16	urg_data;	/* Saved octet of OOB data and control flags */
	u8	ecn_flags;	/* ECN status bits.			*/
	u8	keepalive_probes; /* num of allowed keep alive probes	*/
	u32	reordering;	/* Packet reordering metric.		*/
	u32	reord_seen;	/* number of data packet reordering events */
	u32	snd_up;		/* Urgent pointer		*/

/*
 *      Options received (usually on last packet, some only on SYN packets).
 */
	struct tcp_options_received rx_opt;

/*
 *	Slow start and congestion control (see also Nagle, and Karn & Partridge)
 */
 	u32	snd_ssthresh;	/* Slow start size threshold		*/
 	u32	snd_cwnd;	/* Sending congestion window		*/
	u32	snd_cwnd_cnt;	/* Linear increase counter		*/
	u32	snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
	u32	snd_cwnd_used;
	u32	snd_cwnd_stamp;
	u32	prior_cwnd;	/* cwnd right before starting loss recovery */
	u32	prr_delivered;	/* Number of newly delivered packets to
				 * receiver in Recovery. */
	u32	prr_out;	/* Total number of pkts sent during Recovery. */
	u32	delivered;	/* Total data packets delivered incl. rexmits */
	u32	delivered_ce;	/* Like the above but only ECE marked packets */
	u32	lost;		/* Total data packets lost incl. rexmits */
	u32	app_limited;	/* limited until "delivered" reaches this val */
	u64	first_tx_mstamp;  /* start of window send phase */
	u64	delivered_mstamp; /* time we reached "delivered" */
	u32	rate_delivered;    /* saved rate sample: packets delivered */
	u32	rate_interval_us;  /* saved rate sample: time elapsed */

 	u32	rcv_wnd;	/* Current receiver window		*/
	u32	write_seq;	/* Tail(+1) of data held in tcp send buffer */
	u32	notsent_lowat;	/* TCP_NOTSENT_LOWAT */
	u32	pushed_seq;	/* Last pushed seq, required to talk to windows */
	u32	lost_out;	/* Lost packets			*/
	u32	sacked_out;	/* SACK'd packets			*/

	struct hrtimer	pacing_timer;
	struct hrtimer	compressed_ack_timer;

	/* from STCP, retrans queue hinting */
	struct sk_buff* lost_skb_hint;
	struct sk_buff *retransmit_skb_hint;

	/* OOO segments go in this rbtree. Socket lock must be held. */
	struct rb_root	out_of_order_queue;
	struct sk_buff	*ooo_last_skb; /* cache rb_last(out_of_order_queue) */

	/* SACKs data, these 2 need to be together (see tcp_options_write) */
	struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
	struct tcp_sack_block selective_acks[4]; /* The SACKS themselves*/

	struct tcp_sack_block recv_sack_cache[4];

	struct sk_buff *highest_sack;   /* skb just after the highest
					 * skb with SACKed bit set
					 * (validity guaranteed only if
					 * sacked_out > 0)
					 */

	int     lost_cnt_hint;

	u32	prior_ssthresh; /* ssthresh saved at recovery start	*/
	u32	high_seq;	/* snd_nxt at onset of congestion	*/

	u32	retrans_stamp;	/* Timestamp of the last retransmit,
				 * also used in SYN-SENT to remember stamp of
				 * the first SYN. */
	u32	undo_marker;	/* snd_una upon a new recovery episode. */
	int	undo_retrans;	/* number of undoable retransmissions. */
	u64	bytes_retrans;	/* RFC4898 tcpEStatsPerfOctetsRetrans
				 * Total data bytes retransmitted
				 */
	u32	total_retrans;	/* Total retransmits for entire connection */

	u32	urg_seq;	/* Seq of received urgent pointer */
	unsigned int		keepalive_time;	  /* time before keep alive takes place */
	unsigned int		keepalive_intvl;  /* time interval between keep alive probes */

	int			linger2;


/* Sock_ops bpf program related variables */
#ifdef CONFIG_BPF
	u8	bpf_sock_ops_cb_flags;  /* Control calling BPF programs
					 * values defined in uapi/linux/tcp.h
					 */
#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
#else
#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
#endif

/* Receiver side RTT estimation */
	u32 rcv_rtt_last_tsecr;
	struct {
		u32	rtt_us;
		u32	seq;
		u64	time;
	} rcv_rtt_est;

/* Receiver queue space */
	struct {
		u32	space;
		u32	seq;
		u64	time;
	} rcvq_space;

/* TCP-specific MTU probe information. */
	struct {
		u32		  probe_seq_start;
		u32		  probe_seq_end;
	} mtu_probe;
	u32	mtu_info; /* We received an ICMP_FRAG_NEEDED / ICMPV6_PKT_TOOBIG
			   * while socket was owned by user.
			   */

#ifdef CONFIG_TCP_MD5SIG
/* TCP AF-Specific parts; only used by MD5 Signature support so far */
	const struct tcp_sock_af_ops	*af_specific;

/* TCP MD5 Signature Option information */
	struct tcp_md5sig_info	__rcu *md5sig_info;
#endif

/* TCP fastopen related information */
	struct tcp_fastopen_request *fastopen_req;
	/* fastopen_rsk points to request_sock that resulted in this big
	 * socket. Used to retransmit SYNACKs etc.
	 */
	struct request_sock *fastopen_rsk;
	u32	*saved_syn;
};

Socket

socket 系统调用

我们从 socket 系统调用开始：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
{
	return __sys_socket(family, type, protocol);
}

int __sys_socket(int family, int type, int protocol)
{
  int retval;
  struct socket *sock;
  int flags;

  if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
    flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;

	retval = sock_create(family, type, protocol, &sock);
	if (retval < 0)
		return retval;

	return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
}

这里面的代码比较容易看懂，socket 系统调用会调用 sock_create 创建一个 struct socket 结构，然后通过 sock_map_fd 和文件描述符对应起来。

sock_create

接下来，我们打开 sock_create 函数看一下，可以看到它调用了 __sock_create，添加了 network namespace 和是否是 kern 的参数。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


#define current get_current()  // return current task_struct

int sock_create(int family, int type, int protocol, struct socket **res)
{
	return __sock_create(current->nsproxy->net_ns, family, type, protocol, res, 0);
}

int sock_create_kern(struct net *net, int family, int type, int protocol, struct socket **res)
{
	return __sock_create(net, family, type, protocol, res, 1);
}

int __sock_create(struct net *net, int family, int type, int protocol,
			 struct socket **res, int kern)

{
	int err;
	struct socket *sock;
	const struct net_proto_family *pf;

  err = security_socket_create(family, type, protocol, kern);

	sock = sock_alloc();
  sock->type = type;

	rcu_read_lock();
	pf = rcu_dereference(net_families[family]);
  /*
	 * We will call the ->create function, that possibly is in a loadable
	 * module, so we have to bump that loadable module refcnt first.
	 */
	if (!try_module_get(pf->owner))
		goto out_release;

  /* Now protected by module ref count */
	rcu_read_unlock();

  err = pf->create(net, sock, protocol, kern);

  /*
	 * Now to bump the refcnt of the [loadable] module that owns this
	 * socket at sock_release time we decrement its refcnt.
	 */
	if (!try_module_get(sock->ops->owner))
		goto out_module_busy;

  /*
	 * Now that we're done with the ->create function, the [loadable]
	 * module can have its refcnt decremented
	 */
	module_put(pf->owner);
	err = security_socket_post_create(sock, family, type, protocol, kern);
	if (err)
		goto out_sock_release;
	*res = sock;

  return 0;

}

这里主要做了两件事：

调用 sock_alloc 分配了一个 struct socket 结构，并创建了该 socket 对应的 inode
根据 family 参数拿到了对应的 net_proto_family，并调用 pf->create

net_proto_family 初始化

这里解释下 net_proto_family：linux 网络子系统有一个 net_families 数组，我们能够以 family 参数为下标，找到对应的 struct net_proto_family，每个 net_proto_family 都有一个 create 函数：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


// include/linux/net.h
struct net_proto_family {
	int		family;
	int		(*create)(struct net *net, struct socket *sock,
				  int protocol, int kern);
	struct module	*owner;
};

// net/socket.c
static const struct net_proto_family __rcu *net_families[NPROTO] __read_mostly;

每一种地址族都有自己的 net_proto_family，IP 地址族的 net_proto_family 定义如下，里面最重要的 create 函数指向 inet_create。

1
2
3
4
5
6


// net/ipv4/af_inet.c
static const struct net_proto_family inet_family_ops = {
  .family = PF_INET,
  .create = inet_create, //这个用于 socket 系统调用创建
  .owner	= THIS_MODULE,
}

inet_family 是在 inet_init 中注册到 net_families 中的：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17



static int __init inet_init(void)
{
  /* ... */
  (void)sock_register(&inet_family_ops);

  /* Register the socket-side information for inet_create. */
	for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
		INIT_LIST_HEAD(r);

	for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
		inet_register_protosw(q);

  /* ... */
}

fs_initcall(inet_init);

sock_register

查看 sock_register 的实现，实际上就是向 net_families 这个数组插入对应的协议实现：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


int sock_register(const struct net_proto_family *ops)
{
	int err;

	spin_lock(&net_family_lock);
	if (rcu_dereference_protected(net_families[ops->family],
				      lockdep_is_held(&net_family_lock)))
		err = -EEXIST;
	else {
		rcu_assign_pointer(net_families[ops->family], ops);
		err = 0;
	}
	spin_unlock(&net_family_lock);

	pr_info("NET: Registered protocol family %d\n", ops->family);
	return err;
}

注册 inetsw[]

在 inet_init 里面接下来初始化的是 inetsw 和 inetsw_array。

这里的 inetsw 也是一个数组，type 作为下标，里面的内容是 struct inet_protosw，是协议，也即 inetsw 数组对于每个类型有一项，这一项里面是属于这个类型的协议。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


/* The inetsw table contains everything that inet_create needs to
 * build a new socket.
 */
static struct list_head inetsw[SOCK_MAX];
static DEFINE_SPINLOCK(inetsw_lock);

static int __init inet_init(void)
{
  /* ... */

  /* Register the socket-side information for inet_create. */
  for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
    INIT_LIST_HEAD(r);
  for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
    inet_register_protosw(q);

   /* ... */
}

inetsw 数组是在系统初始化的时候初始化的，就像下面代码里面实现的一样。

首先，一个循环会将 inetsw 数组的每一项，都初始化为一个链表。咱们前面说了，一个 type 类型会包含多个 protocol，因而我们需要一个链表。接下来一个循环，是将 inetsw_array 注册到 inetsw 数组里面去。inetsw_array 的定义如下，这个数组里面的内容很重要，后面会用到它们。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


static struct inet_protosw inetsw_array[] =
{
	{
		.type =       SOCK_STREAM,
		.protocol =   IPPROTO_TCP,
		.prot =       &tcp_prot,
		.ops =        &inet_stream_ops,
		.flags =      INET_PROTOSW_PERMANENT | INET_PROTOSW_ICSK,
	},

  {
		.type =       SOCK_DGRAM,
		.protocol =   IPPROTO_UDP,
		.prot =       &udp_prot,
		.ops =        &inet_dgram_ops,
		.flags =      INET_PROTOSW_PERMANENT,
  },

  {
		.type =       SOCK_DGRAM,
		.protocol =   IPPROTO_ICMP,
		.prot =       &ping_prot,
		.ops =        &inet_sockraw_ops,
		.flags =      INET_PROTOSW_REUSE,
  },

  {
    .type =       SOCK_RAW,
    .protocol =   IPPROTO_IP,	/* wild card */
    .prot =       &raw_prot,
    .ops =        &inet_sockraw_ops,
    .flags =      INET_PROTOSW_REUSE,
  }
};

#define INETSW_ARRAY_LEN ARRAY_SIZE(inetsw_array)

sock_alloc

初始化 socket 对象，主要与文件系统打交道：

在文件系统中分配 inode

通过 new_inode_pseudo 在 socket 文件系统中创建新的 inode
- 其中 sock_mnt 为 socks 的根节点，是在内核初始化安装 socket 文件系统时赋值的
- mnt_sb 是该文件系统安装点的超级块对象的指针

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


struct socket *sock_alloc(void)
{
	struct inode *inode;
	struct socket *sock;

	inode = new_inode_pseudo(sock_mnt->mnt_sb);
	if (!inode)
		return NULL;

	sock = SOCKET_I(inode);

	inode->i_ino = get_next_ino();
	inode->i_mode = S_IFSOCK | S_IRWXUGO;
	inode->i_uid = current_fsuid();
	inode->i_gid = current_fsgid();
	inode->i_op = &sockfs_inode_ops;

	return sock;
}

根据 inode 取得 socket 对象

由于创建 inode 是文件系统的通用逻辑，因此其返回值是 inode 对象的指针；
但这里在创建 socket 的 inode 后，需要根据 inode 得到 socket 对象；
内联函数 SOCKET_I由此而来：

1
2
3
4
5
6
7
8
9


struct socket_alloc {
	struct socket socket;
	struct inode vfs_inode;
};

static inline struct socket *SOCKET_I(struct inode *inode)
{
	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
}

回到 sock_alloc 函数，SOCKET_I 根据 inode 取得 socket 变量后，记录当前进程的一些信息

如 fsuid, fsgid，并增加 sockets_in_use 的值(该变量表示创建 socket 的个数)；
设置 inode 的 imode 为 S_IFSOCK
设置 inode 的 i_op 为 sockfs_inode_ops

1
2
3
4


static const struct inode_operations sockfs_inode_ops = {
	.listxattr = sockfs_listxattr,
	.setattr = sockfs_setattr,
};

inet_create

我们回到函数 __sock_create。接下来，在这里面，这个 inet_create 会被调用。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75


static int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
{
  struct sock *sk;
  struct inet_protosw *answer;
  struct inet_sock *inet;
  struct proto *answer_prot;
  unsigned char answer_flags;
  int try_loading_module = 0;
  int err;

  sock->state = SS_UNCONNECTED;

  /* Look for the requested type/protocol pair. */
lookup_protocol:
  list_for_each_entry_rcu(answer, &inetsw[sock->type], list) {
    err = 0;
    /* Check the non-wild match. */
    if (protocol == answer->protocol) {
      if (protocol != IPPROTO_IP)
        break;
    } else {
      /* Check for the two wild cases. */
      if (IPPROTO_IP == protocol) {
        protocol = answer->protocol;
        break;
      }
      if (IPPROTO_IP == answer->protocol)
        break;
    }
    err = -EPROTONOSUPPORT;
  }

  /* ... */
  sock->ops = answer->ops;
  answer_prot = answer->prot;
  answer_flags = answer->flags;

  /* ... */
  sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot, kern);

  /* ... */
  inet = inet_sk(sk);
  inet->nodefrag = 0;
  if (SOCK_RAW == sock->type) {
    inet->inet_num = protocol;
    if (IPPROTO_RAW == protocol)
      inet->hdrincl = 1;
  }
  inet->inet_id = 0;
  sock_init_data(sock, sk);

  sk->sk_destruct     = inet_sock_destruct;
  sk->sk_protocol     = protocol;
  sk->sk_backlog_rcv = sk->sk_prot->backlog_rcv;

  inet->uc_ttl  = -1;
  inet->mc_loop  = 1;
  inet->mc_ttl  = 1;
  inet->mc_all  = 1;
  inet->mc_index  = 0;
  inet->mc_list  = NULL;
  inet->rcv_tos  = 0;

  if (inet->inet_num) {
    inet->inet_sport = htons(inet->inet_num);
    /* Add to protocol hash chains. */
    err = sk->sk_prot->hash(sk);
  }

  if (sk->sk_prot->init) {
    err = sk->sk_prot->init(sk);
  }

   /* ... */
}

设置 socket 状态为 SS_UNCONNECTED

1

sock->state = SS_UNCONNECTED;

找到 type/protocol 对应的 inet_protosw

在 inet_create 中，我们先会看到一个循环 list_for_each_entry_rcu。在这里，socket 的第二个参数 type 开始起作用。因为循环查看的是 inetsw[sock->type]。

我们回到 inet_create 的 list_for_each_entry_rcu 循环中。到这里就好理解了：

这是在 inetsw 数组中，根据 type 找到属于这个类型的列表
依次比较列表中的 struct inet_protosw 的 protocol 是不是用户指定的 protocol；
- 如果是，就得到了符合用户指定的 family->type->protocol 的 struct inet_protosw *answer 对象

接下来，struct socket *sock 的 ops 成员变量，被赋值为 answer 的 ops。对于 TCP 来讲，就是 inet_stream_ops。后面任何用户对于这个 socket 的操作，都是通过 inet_stream_ops 进行的。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


const struct proto_ops inet_stream_ops = {
	.family		   = PF_INET,
	.owner		   = THIS_MODULE,
	.release	   = inet_release,
	.bind		   = inet_bind,
	.connect	   = inet_stream_connect,
	.socketpair	   = sock_no_socketpair,
	.accept		   = inet_accept,
	.getname	   = inet_getname,
	.poll		   = tcp_poll,
	.ioctl		   = inet_ioctl,
	.listen		   = inet_listen,
	.shutdown	   = inet_shutdown,
	.setsockopt	   = sock_common_setsockopt,
	.getsockopt	   = sock_common_getsockopt,
	.sendmsg	   = inet_sendmsg,
	.recvmsg	   = inet_recvmsg,

  /* ... */
}

初始化 socket

1
2
3


  sock->ops = answer->ops;
  answer_prot = answer->prot;
  answer_flags = answer->flags;

结合例子，socket 变量的 ops 指向 inet_stream_ops 结构体变量；

分配 sock 结构体变量 sk_alloc

接下来，我们创建一个 struct sock *sk 对象。这里比较让人困惑。socket 和 sock 看起来几乎一样，容易让人混淆，这里需要说明一下，socket 是用于负责对上给用户提供接口，并且和文件系统关联。而 sock 负责向下对接内核网络协议栈。前面代码中已经看到，我们通过 sock_alloc 来初始化 socket 对象，通过 sk_alloc 来初始化 sock 对象。

1

  sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot, kern);

在 sk_alloc 函数中，struct inet_protosw *answer 结构的 tcp_prot 赋值给了 struct sock *sk 的 sk_prot 成员。tcp_prot 的定义如下，里面定义了很多的函数，都是 sock 之下内核协议栈的动作。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


struct proto tcp_prot = {
  .name      = "TCP",
  .owner      = THIS_MODULE,
  .close      = tcp_close,
  .connect    = tcp_v4_connect,
  .disconnect    = tcp_disconnect,
  .accept      = inet_csk_accept,
  .ioctl      = tcp_ioctl,
  .init      = tcp_v4_init_sock,
  .destroy    = tcp_v4_destroy_sock,
  .shutdown    = tcp_shutdown,
  .setsockopt    = tcp_setsockopt,
  .getsockopt    = tcp_getsockopt,
  .keepalive    = tcp_set_keepalive,
  .recvmsg    = tcp_recvmsg,
  .sendmsg    = tcp_sendmsg,
  .sendpage    = tcp_sendpage,
  .backlog_rcv    = tcp_v4_do_rcv,
  .release_cb    = tcp_release_cb,
  .hash      = inet_hash,
  .get_port    = inet_csk_get_port,
}

这里创建了 struct sock，主要用于初始化各种网络协议栈相关的对象。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


/*
 * sk_alloc - All socket objects are allocated here
 */
struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
		      struct proto *prot, int kern)
{
	struct sock *sk;

	sk = sk_prot_alloc(prot, priority | __GFP_ZERO, family);
	if (sk) {
		sk->sk_family = family;
		/*
		 * See comment in struct sock definition to understand
		 * why we need sk_prot_creator -acme
		 */
		sk->sk_prot = sk->sk_prot_creator = prot;
		sk->sk_kern_sock = kern;
		sock_lock_init(sk);
		sk->sk_net_refcnt = kern ? 0 : 1;
		if (likely(sk->sk_net_refcnt)) {
			get_net(net);
			sock_inuse_add(net, 1);
		}

		sock_net_set(sk, net);
		refcount_set(&sk->sk_wmem_alloc, 1);

		mem_cgroup_sk_alloc(sk);
		cgroup_sk_alloc(&sk->sk_cgrp_data);
		sock_update_classid(&sk->sk_cgrp_data);
		sock_update_netprioidx(&sk->sk_cgrp_data);
	}

	return sk;
}

建立 socket 与 sock 的关系 sock_init_data

在 inet_create 函数中，接下来创建一个 struct inet_sock 结构，这个结构一开始就是 struct sock，然后扩展了一些其他的信息，剩下的代码就填充这些信息。这一幕我们会经常看到，将一个结构放在另一个结构的开始位置，然后扩展一些成员，通过对于指针的强制类型转换，来访问这些成员。

1

inet = inet_sk(sk);

这里为什么能直接将 sock 结构体变量强制转化为 inet_sock 结构体变量呢？只有一种可能，那就是在分配 sock 结构体变量时，真正分配的是 inet_sock 或是其他结构体。

我们回到分配 sock 结构体的那块代码(net/core/sock.c)：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
		int family)
{
	struct sock *sk;
	struct kmem_cache *slab;

	slab = prot->slab;
	if (slab != NULL)
    // TCP 高速缓存
		sk = kmem_cache_alloc(slab, priority & ~__GFP_ZERO);
	else
    // 内存分配
		sk = kmalloc(prot->obj_size, priority);

	return sk;

}

上面的代码在分配 sock 结构体时，有两种途径：

一是从 tcp 专用高速缓存中分配，这种情况下在初始化高速缓存时，指定了结构体大小为 prot->obj_size
二是从内存直接分配，这种情况下也有指定大小为 prot->obj_sizee

根据这点，我们看下 tcp_prot 变量中的 obj_size：

1
2


// net/ipv4/tcp_ipv4.c
.obj_size       = sizeof(struct tcp_sock),

也就是说，分配的真实结构体是 tcp_sock；由于 tcp_sock、inet_connection_sock、inet_sock、sock 之间均为 0 处偏移量，因此可以直接将 tcp_sock 直接强制转化为 inet_sock；这几个结构体间的关系如下：

创建完 sock 变量之后，便是初始化 sock 结构体，并建立 sock 与 socket 之间的引用关系；调用函数为 sock_init_data()

该函数主要工作是：

初始化 sock 结构的缓冲区、队列等；
初始化 sock 结构的状态为 TCP_CLOSE；
建立 socket 与 sock 结构的相互引用关系；

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


// net/core/sock.c
void sock_init_data(struct socket *sock, struct sock *sk)
{
	sk_init_common(sk);
	sk->sk_send_head	=	NULL;

	timer_setup(&sk->sk_timer, NULL, 0);

	sk->sk_allocation	=	GFP_KERNEL;
	sk->sk_rcvbuf		=	sysctl_rmem_default;
	sk->sk_sndbuf		=	sysctl_wmem_default;
	sk->sk_state		=	TCP_CLOSE;
	sk_set_socket(sk, sock);

	if (sock) {
		sk->sk_type	=	sock->type;
		sk->sk_wq	=	sock->wq;
		sock->sk	=	sk;
		sk->sk_uid	=	SOCK_INODE(sock)->i_uid;
	} else {
		sk->sk_wq	=	NULL;
		sk->sk_uid	=	make_kuid(sock_net(sk)->user_ns, 0);
	}
}

使用 tcp 协议初始化 sock

inet_create() 函数最后，通过相应的协议来初始化 sock 结构：

1
2
3


  if (sk->sk_prot->init) {
    err = sk->sk_prot->init(sk);
  }

对 TCP 协议而言，其 init 函数是 tcp_v4_init_sock，它主要是对 tcp_sock 和 inet_connection_sock 进行一些初始化。

tcp_v4_init_sock

1
2
3
4
5


static int tcp_v4_init_sock(struct sock *sk)
{
	tcp_init_sock(sk);
	return 0;
}

tcp_init_sock

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56


/* Address-family independent initialization for a tcp_sock.
 *
 * NOTE: A lot of things set to zero explicitly by call to
 *       sk_alloc() so need not be done here.
 */
void tcp_init_sock(struct sock *sk)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct tcp_sock *tp = tcp_sk(sk);

	tp->out_of_order_queue = RB_ROOT;
	sk->tcp_rtx_queue = RB_ROOT;
	tcp_init_xmit_timers(sk);
	INIT_LIST_HEAD(&tp->tsq_node);
	INIT_LIST_HEAD(&tp->tsorted_sent_queue);

	icsk->icsk_rto = TCP_TIMEOUT_INIT;
	tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
	minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U);

	/* So many TCP implementations out there (incorrectly) count the
	 * initial SYN frame in their delayed-ACK and congestion control
	 * algorithms that we must have the following bandaid to talk
	 * efficiently to them.  -DaveM
	 */
	tp->snd_cwnd = TCP_INIT_CWND;

	/* There's a bubble in the pipe until at least the first ACK. */
	tp->app_limited = ~0U;

	/* See draft-stevens-tcpca-spec-01 for discussion of the
	 * initialization of these values.
	 */
	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
	tp->snd_cwnd_clamp = ~0;
	tp->mss_cache = TCP_MSS_DEFAULT;

	tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering;
	tcp_assign_congestion_control(sk);

	tp->tsoffset = 0;
	tp->rack.reo_wnd_steps = 1;

	sk->sk_state = TCP_CLOSE;

	sk->sk_write_space = sk_stream_write_space;
	sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);

	icsk->icsk_sync_mss = tcp_sync_mss;

	sk->sk_sndbuf = sock_net(sk)->ipv4.sysctl_tcp_wmem[1];
	sk->sk_rcvbuf = sock_net(sk)->ipv4.sysctl_tcp_rmem[1];

	sk_sockets_allocated_inc(sk);
	sk->sk_route_forced_caps = NETIF_F_GSO;
}

sock_map_fd

创建好与 socket 相关的结构后，需要与文件系统关联，详见 sock_map_fd() 函数：

申请文件描述符，并分配 file 结构和目录项结构；
关联 socket 相关的文件操作函数表和目录项操作函数表；
将 file->private_date 指向 socket；

关于 VFS，可以参考 VFS

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


static int sock_map_fd(struct socket *sock, int flags)
{
	struct file *newfile;
	int fd = get_unused_fd_flags(flags);
	if (unlikely(fd < 0)) {
		sock_release(sock);
		return fd;
	}

	newfile = sock_alloc_file(sock, flags, NULL);
	if (!IS_ERR(newfile)) {
		fd_install(fd, newfile);
		return fd;
	}

	put_unused_fd(fd);
	return PTR_ERR(newfile);
}

get_unused_fd_flags

从当前进程的 files 结构分配一个 unused fd

1
2
3
4


int get_unused_fd_flags(unsigned flags)
{
	return __alloc_fd(current->files, 0, rlimit(RLIMIT_NOFILE), flags);
}

sock_alloc_file

创建一个 file 对象，内核通过把 socket 指针赋值给 file 的 private_data。这样一来，就可以通过 fd，在 fdtable 中得到 file 对象，然后得到 socket 对象：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
{
	struct file *file;

	if (!dname)
		dname = sock->sk ? sock->sk->sk_prot_creator->name : "";

	file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
				O_RDWR | (flags & O_NONBLOCK),
				&socket_file_ops);
	if (IS_ERR(file)) {
		sock_release(sock);
		return file;
	}

	sock->file = file;
	file->private_data = sock;
	return file;
}

这里的 socket_file_ops 定义如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


// net/socket.c
static const struct file_operations socket_file_ops = {
	.owner =	THIS_MODULE,
	.llseek =	no_llseek,
	.read_iter =	sock_read_iter,
	.write_iter =	sock_write_iter,
	.poll =		sock_poll,
	.unlocked_ioctl = sock_ioctl,
#ifdef CONFIG_COMPAT
	.compat_ioctl = compat_sock_ioctl,
#endif
	.mmap =		sock_mmap,
	.release =	sock_close,
	.fasync =	sock_fasync,
	.sendpage =	sock_sendpage,
	.splice_write = generic_splice_sendpage,
	.splice_read =	sock_splice_read,
};

经过这个赋值之后，我们直接对 socket 的 fd 进行 read/write 等操作，就会进入到对应协议处理。以 read 为例，这里会调用 sock_recvmsg ，最终就调用到对应协议的 recvmsg：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


static ssize_t sock_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
	struct file *file = iocb->ki_filp;
	struct socket *sock = file->private_data;
	struct msghdr msg = {.msg_iter = *to,
			     .msg_iocb = iocb};
	ssize_t res;

	if (file->f_flags & O_NONBLOCK)
		msg.msg_flags = MSG_DONTWAIT;

	if (iocb->ki_pos != 0)
		return -ESPIPE;

	if (!iov_iter_count(to))	/* Match SYS5 behaviour */
		return 0;

	res = sock_recvmsg(sock, &msg, msg.msg_flags);
	*to = msg.msg_iter;
	return res;
}

fd_install

为当前进程构建 fd 与 file 的映射关系：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


// fs/file.c
void fd_install(unsigned int fd, struct file *file)
{
	__fd_install(current->files, fd, file);
}

void __fd_install(struct files_struct *files, unsigned int fd,
		struct file *file)
{
	struct fdtable *fdt;

	rcu_read_lock_sched();

	if (unlikely(files->resize_in_progress)) {
		rcu_read_unlock_sched();
		spin_lock(&files->file_lock);
		fdt = files_fdtable(files);
		BUG_ON(fdt->fd[fd] != NULL);
		rcu_assign_pointer(fdt->fd[fd], file);
		spin_unlock(&files->file_lock);
		return;
	}
	/* coupled with smp_wmb() in expand_fdtable() */
	smp_rmb();
	fdt = rcu_dereference_sched(files->fdt);
	BUG_ON(fdt->fd[fd] != NULL);
	rcu_assign_pointer(fdt->fd[fd], file);
	rcu_read_unlock_sched();
}

socket 调用总结

最终来总结一下：

内核中，socket 是作为一个伪文件系统来实现的，它在初始化时注册到内核
每个进程的 files_struct 域保存了所有的句柄,包括 socket 的.
- 一般的文件操作，内核直接调用 vfs 层的方法，然后会自动调用 socket 实现的相关方法
- 内核通过 inode 结构的 imode 域就可以知道当前的句柄所关联的是不是 socket 类型
- 这时遇到 socket 独有的操作，就通过 containof 方法，来计算出 socket 的地址，从而就可以进行相关的操作

Bind

地址结构

struct sockaddr

struct sockaddr 其实相当于一个基类的地址结构，其他的结构都能够直接转到 sockaddr。举个例子比如当 sa_family 为 PF_INET 时，sa_data 就包含了端口号和 ip 地址：

1
2
3
4
5


// include/linux/socket.h
struct sockaddr {
	sa_family_t	sa_family;	/* address family, AF_xxx	*/
	char		sa_data[14];	/* 14 bytes of protocol address	*/
};

struct sockaddr_in

sockaddr_in 表示了所有的 ipv4 的地址结构，即代表 AF_INET 域的地址，可以看到它相当于 sockaddr 的一个子类。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


// include/uapi/linux/in.h
#define __SOCK_SIZE__	16		/* sizeof(struct sockaddr)	*/
struct sockaddr_in {
  __kernel_sa_family_t	sin_family;	/* Address family		*/
  __be16		sin_port;	/* Port number			*/
  struct in_addr	sin_addr;	/* Internet address		*/

  /* Pad to size of `struct sockaddr'. */
  unsigned char		__pad[__SOCK_SIZE__ - sizeof(short int) -
			sizeof(unsigned short int) - sizeof(struct in_addr)];
};

struct in_addr
  {
    in_addr_t s_addr;
  };

struct sockaddr_storage

这里还有一个内核比较新的地址结构 sockaddr_storage，他可以容纳所有类型的套接口结构，比如 ipv4，ipv6 等。相比于 sockaddr，它是强制对齐的。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


// include/uapi/linux/socket.h
struct __kernel_sockaddr_storage {
	union {
		struct {
			__kernel_sa_family_t	ss_family; /* address family */
			/* Following field(s) are implementation specific */
			char __data[_K_SS_MAXSIZE - sizeof(unsigned short)];
				/* space to achieve desired size, */
				/* _SS_MAXSIZE value minus size of ss_family */
		};
		void *__align; /* implementation specific desired alignment */
	};
};

存储数据结构

inet_hashinfo

第一个是 inet_hashinfo，它主要用来管理 tcp 的 bind hash bucket：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


// include/net/inet_hashtables.h
#define INET_LHTABLE_SIZE	32

struct inet_hashinfo {
	/* This is for sockets with full identity only.  Sockets here will
	 * always be without wildcards and will have the following invariant:
	 *
	 *          TCP_ESTABLISHED <= sk->sk_state < TCP_CLOSE
	 *
	 */
	struct inet_ehash_bucket	*ehash;
	spinlock_t			*ehash_locks;
	unsigned int			ehash_mask;
	unsigned int			ehash_locks_mask;

	struct kmem_cache		*bind_bucket_cachep;
	struct inet_bind_hashbucket	*bhash;
	unsigned int			bhash_size;

  /* ... */
	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
					____cacheline_aligned_in_smp;
};

在这个结构体中，有几个元素：

ehash：e 指的是 establish，管理除了 LISTEN 状态的所有 socket 的 hash 表
- 类似于 C++中的 unordered_map<Key, sock*>
- key 是 {source_ip, destination_ip, source_port, destination_port}
- sock* 可以指向 tcp_request_sock、tcp_sock、tcp_timewait_sock 之一
bhash: b 指的是 bind，负责端口分配
- 类似于 C++中的 map<uint16_t, list<tcp_sock*>> 其中 list 对应 inet_bind_bucket
listening_hash：负责侦听(listening) socket，bucket 数量是在编译内核时确定，通常是 32 个。

tcp_hashinfo 是 inet_hashinfo 结构体的实例：

1
2
3


// net/ipv4/tcp_ipv4.c
struct inet_hashinfo tcp_hashinfo;
EXPORT_SYMBOL(tcp_hashinfo);

tcp_hashinfo 在 tcp_init 中被初始化：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


// net/ipv4/tcp.c
void __init tcp_init(void)
{
  /* ... */

	inet_hashinfo_init(&tcp_hashinfo);
	inet_hashinfo2_init(&tcp_hashinfo, "tcp_listen_portaddr_hash",
			    thash_entries, 21,  /* one slot per 2 MB*/
			    0, 64 * 1024);

	/* ... */
}

然后 tcp_hashinfo 会被赋值给 tcp_proto 和 sock 的 sk_prot 域. 其具体的实现在 /net/ipv4/tcp.c 中。

inet_ehash_bucket

struct inet_ehash_bucket 管理所有的 tcp 状态在 TCP_ESTABLISHED 和 TCP_CLOSE 之间的 sock：

1
2
3
4


// include/net/inet_hashtables.h
struct inet_ehash_bucket {
	struct hlist_nulls_head chain;
};

这里的 bucket 里面是 hlist_nulls_head 是有可能会碰到两个 hash 值一样的 socket，以链表形式连起来（这也就是 hash table）

在 tcp_init 中，我们会初始化 ehash，这里申请了一个系统级较大的 hash 表，hash 表有 thash_entries 个 bucket，它的大小在内核启动的时候确定（有参数可以配置），不会更改（即不会 rehash）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


// net/ipv4/tcp.c
void __init tcp_init(void)
{
	tcp_hashinfo.ehash =
		alloc_large_system_hash("TCP established",
					sizeof(struct inet_ehash_bucket),
					thash_entries,
					17, /* one slot per 128 KB of memory */
					0,
					NULL,
					&tcp_hashinfo.ehash_mask,
					0,
					thash_entries ? 0 : 512 * 1024);
	for (i = 0; i <= tcp_hashinfo.ehash_mask; i++)
		INIT_HLIST_NULLS_HEAD(&tcp_hashinfo.ehash[i].chain, i);

	if (inet_ehash_locks_alloc(&tcp_hashinfo))
		panic("TCP: failed to alloc ehash_locks");

}

参考内核日志：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


[    1.053697] NET: Registered protocol family 2
[    1.055864] TCP established hash table entries: 1048576 (order: 12, 16777216 bytes)
[    1.059805] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[    1.088226] hashinfo create: new net:ffffffff82013d40 reuse TCP_HASHINFO ffffffff823df040
[    1.098897] TCP: Hash tables configured (established 1048576 bind 65536)
[    1.100420] death_row create: reuse TCP_DEATH_ROW net:ffffffff82013d40
[    1.101934] TCP: reno registered
[    1.102982] UDP hash table entries: 8192 (order: 6, 262144 bytes)
[    1.104444] UDP: udp_hash3_init
[    1.105868] UDP hash3 hash table entries: 1048576 (order: 11, 8388608 bytes)
[    1.108546] UDP-Lite hash table entries: 8192 (order: 6, 262144 bytes)

参考 alloc_large_system_hash 函数声明：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


void *__init alloc_large_system_hash(const char *tablename,
				     unsigned long bucketsize,
				     unsigned long numentries,
				     int scale,
				     int flags,
				     unsigned int *_hash_shift,
				     unsigned int *_hash_mask,
				     unsigned long low_limit,
				     unsigned long high_limit)
{
  /* ... */
}

inet_bind_hashbucket

inet_bind_hashbucket 是哈希桶结构， lock 成员是用于操作时对桶进行加锁， chain 成员是相同哈希值的节点的链表。结构如下，存储了所有的端口的信息：

1
2
3
4
5
6


// include/net/inet_hashtables.h

struct inet_bind_hashbucket {
	spinlock_t		lock;
	struct hlist_head	chain;
};

在 tcp_init 中，我们会初始化 bhash，这里申请了一个系统级较大的 hash 表，它的大小在内核启动的时候确定（有参数可以配置），不会更改（即不会 rehash）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


// net/ipv4/tcp.c
void __init tcp_init(void)
{
  /* ... */
	tcp_hashinfo.bind_bucket_cachep =
		kmem_cache_create("tcp_bind_bucket",
				  sizeof(struct inet_bind_bucket), 0,
				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);

	tcp_hashinfo.bhash =
		alloc_large_system_hash("TCP bind",
					sizeof(struct inet_bind_hashbucket),
					tcp_hashinfo.ehash_mask + 1,
					17, /* one slot per 128 KB of memory */
					0,
					&tcp_hashinfo.bhash_size,
					NULL,
					0,
					64 * 1024);
	tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
	for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
		spin_lock_init(&tcp_hashinfo.bhash[i].lock);
		INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain);
	}
	/* ... */
}

inet_bind_bucket

前面提到的哈希桶结构中的 chain 链表中的每个节点，其宿主结构体是 inet_bind_bucket ，该结构体通过成员 node 链入链表；

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


// include/net/inet_hashtables.h

struct inet_bind_bucket {
	possible_net_t		ib_net;
	int			l3mdev;
	unsigned short		port;
	struct hlist_node	node;   // 作为bhash中chain链表的节点
	struct hlist_head	owners; // 绑定在该端口上的sock链表

  /* ... */
};

用户编程实例

1
2
3
4
5
6


struct sockaddr_in server_address;
server_address.sin_family = AF_INET;
server_address.sin_addr.s_addr = inet_addr("0.0.0.0");
server_address.sin_port = htons(9734);
server_len = sizeof(server_address);
bind(server_sockfd, (struct sockaddr *)&server_address, server_len);

bind 实现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen)
{
	struct socket *sock;
	struct sockaddr_storage address;
	int err, fput_needed;

	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (sock) {
		err = move_addr_to_kernel(umyaddr, addrlen, &address);
		if (!err) {
			err = security_socket_bind(sock,
						   (struct sockaddr *)&address,
						   addrlen);
			if (!err)
				err = sock->ops->bind(sock,
						      (struct sockaddr *)
						      &address, addrlen);
		}
		fput_light(sock->file, fput_needed);
	}
	return err;
}

查找 socket 结构

根据 fd 取得关联的 socket 结构：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
{
	struct fd f = fdget(fd);
	struct socket *sock;

	*err = -EBADF;
	if (f.file) {
		sock = sock_from_file(f.file, err);
		if (likely(sock)) {
			*fput_needed = f.flags;
			return sock;
		}
		fdput(f);
	}
	return NULL;
}

struct socket *sock_from_file(struct file *file, int *err)
{
	if (file->f_op == &socket_file_ops)
		return file->private_data;	/* set in sock_map_fd */

	*err = -ENOTSOCK;
	return NULL;
}

将用户空间地址结构复制到内核空间

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


int move_addr_to_kernel(void __user *uaddr, int ulen, struct sockaddr_storage *kaddr)
{
	if (ulen < 0 || ulen > sizeof(struct sockaddr_storage))
		return -EINVAL;
	if (ulen == 0)
		return 0;
	if (copy_from_user(kaddr, uaddr, ulen))
		return -EFAULT;
	return audit_sockaddr(ulen, kaddr);
}

调用 bind 函数 inet_bind

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


// net/ipv4/af_inet.c
int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
{
	struct sock *sk = sock->sk;
	int err;

	/* If the socket has its own bind function then use it. (RAW) */
	if (sk->sk_prot->bind) {
		return sk->sk_prot->bind(sk, uaddr, addr_len);
	}
	if (addr_len < sizeof(struct sockaddr_in))
		return -EINVAL;

	/* BPF prog is run before any checks are done so that if the prog
	 * changes context in a wrong way it will be caught.
	 */
	err = BPF_CGROUP_RUN_PROG_INET4_BIND(sk, uaddr);
	if (err)
		return err;

	return __inet_bind(sk, uaddr, addr_len, false, true);
}

可以看到最终时调用了 __inet_bind 函数

地址类型检查

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


// net/ipv4/af_inet.c
int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
		bool force_bind_address_no_port, bool with_lock)
{
  /* ... */

	if (addr->sin_family != AF_INET) {
		/* Compatibility games : accept AF_UNSPEC (mapped to AF_INET)
		 * only if s_addr is INADDR_ANY.
		 */
		err = -EAFNOSUPPORT;
		if (addr->sin_family != AF_UNSPEC ||
		    addr->sin_addr.s_addr != htonl(INADDR_ANY))
			goto out;
	}


	tb_id = l3mdev_fib_table_by_index(net, sk->sk_bound_dev_if) ? : tb_id;
	chk_addr_ret = inet_addr_type_table(net, addr->sin_addr.s_addr, tb_id);

	/* Not specified by any standard per-se, however it breaks too
	 * many applications when removed.  It is unfortunate since
	 * allowing applications to make a non-local bind solves
	 * several problems with systems using dynamic addressing.
	 * (ie. your servers still start up even if your ISDN link
	 *  is temporarily down)
	 */
	err = -EADDRNOTAVAIL;
	if (!inet_can_nonlocal_bind(net, inet) &&
	    addr->sin_addr.s_addr != htonl(INADDR_ANY) &&
	    chk_addr_ret != RTN_LOCAL &&
	    chk_addr_ret != RTN_MULTICAST &&
	    chk_addr_ret != RTN_BROADCAST)
		goto out;
	 /* ... */
}

这里的 inet_addr_type_table 用于检查该地址所属的类型：

广播地址
本地地址
多播地址

然后再做对应判断

端口范围的检查

如果端口号小于 inet_prot_sock，也即是 1024 必须得有 root 权限，如果没有则退出
capable 就是用来判断权限的，如果没有 CAP_NET_BIND_SERVICE 则退出

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


// net/ipv4/af_inet.c

static inline int inet_prot_sock(struct net *net)
{
	return PROT_SOCK;
}

int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
		bool force_bind_address_no_port, bool with_lock)
{
  /* ... */
  snum = ntohs(addr->sin_port);
	err = -EACCES;
	if (snum && snum < inet_prot_sock(net) &&
	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
		goto out;

  /* ... */
}

设置发送源地址和接收源地址

这里先检查 sock 的状态，如果不是 TCP_CLOSE 或端口为 0，则出错返回（这里也映射到创建 socket 时要将 sock 结构体变量的状态设置为 TCP_CLOSE 上了）
如果地址类型是多播或广播，则源地址设置为 0，而接收地址为设置的 ip 地址

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


// net/ipv4/af_inet.c
int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
		bool force_bind_address_no_port, bool with_lock)
{
  /* ... */

  /* Check these errors (active socket, double bind). */
	err = -EINVAL;
	if (sk->sk_state != TCP_CLOSE || inet->inet_num)
		goto out_release_sock;

	inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr;
	if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST)
		inet->inet_saddr = 0;  /* Use device */

  /* ... */
}

检查端口是否被占用 inet_csk_get_port

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


// net/ipv4/af_inet.c
int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
		bool force_bind_address_no_port, bool with_lock)
{
  /* ... */

	/* Make sure we are allowed to bind here. */
	if (snum || !(inet->bind_address_no_port ||
		      force_bind_address_no_port)) {
		if (sk->sk_prot->get_port(sk, snum)) {
			inet->inet_saddr = inet->inet_rcv_saddr = 0;
			err = -EADDRINUSE;
			goto out_release_sock;
		}
		err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
		if (err) {
			inet->inet_saddr = inet->inet_rcv_saddr = 0;
			goto out_release_sock;
		}
	}
}

用于检查端口的函数为 sk->sk_prot->get_port(sk, snum)，在 tcp 中被实例化为 inet_csk_get_port，这里我先来介绍下 inet_csk_get_port 的流程：

当绑定的 port 为 0 时，也即是需要 kernel 来分配一个新的 port
1. 首先得到系统的 port 范围
2. 随机分配一个 port
3. 从 bhash 中得到当前随机分配的端口的链表(也就是 inet_bind_bucket 链表)
4. 遍历这个链表(链表为空的话,也说明这个 port 没有被使用)，如果这个端口已经被使用，则将端口号加一，继续循环，直到找到当前没有被使用的 port，也就是没有在 bhash 中存在的 port
5. 新建一个 inet_bind_bucket，并插入到 bhash 中.
当指定 port
1. 从 bhash 中根据 hash 值 (port 计算的) 取得当前指定端口对应的 inet_bind_bucket 结构
2. 如果 bhash 中存在，则说明这个端口已经在使用，因此需要判断这个端口是否允许被 reuse
3. 如果不存在，则步骤和上面的第 5 步一样

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54


// net/ipv4/inet_connection_sock.c
int inet_csk_get_port(struct sock *sk, unsigned short snum)
{
	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
	int ret = 1, port = snum;
	struct inet_bind_hashbucket *head;
	struct net *net = sock_net(sk);
	struct inet_bind_bucket *tb = NULL;
	kuid_t uid = sock_i_uid(sk);
	int l3mdev;

	l3mdev = inet_sk_bound_l3mdev(sk);

  // 端口为0,也就是需要内核来分配端口
	if (!port) {
		head = inet_csk_find_open_port(sk, &tb, &port);
		if (!head)
			return ret;
		if (!tb)
			goto tb_not_found;
		goto success;
	}
	head = &hinfo->bhash[inet_bhashfn(net, port,
					  hinfo->bhash_size)];
	spin_lock_bh(&head->lock);
	inet_bind_bucket_for_each(tb, &head->chain)
		if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
		    tb->port == port)
			goto tb_found;
tb_not_found:
	tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
				     net, head, port, l3mdev);
	if (!tb)
		goto fail_unlock;
tb_found:
	if (!hlist_empty(&tb->owners)) {
		if (sk->sk_reuse == SK_FORCE_REUSE)
			goto success;

		if (inet_csk_bind_conflict(sk, tb, true, true))
			goto fail_unlock;
	}
success:
  /* ... */

	if (sk->sk_reuseport) {}

  /* ... */
	if (!inet_csk(sk)->icsk_bind_hash)
		inet_bind_hash(sk, tb, port);

	return ret;
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


// net/ipv4/inet_hashtables.c
struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
						 struct net *net,
						 struct inet_bind_hashbucket *head,
						 const unsigned short snum,
						 int l3mdev)
{
	struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);

	if (tb) {
		write_pnet(&tb->ib_net, net);
		tb->l3mdev    = l3mdev;
		tb->port      = snum;
		tb->fastreuse = 0;
		tb->fastreuseport = 0;
		INIT_HLIST_HEAD(&tb->owners);
		hlist_add_head(&tb->node, &head->chain);
	}
	return tb;
}

初始化目标地址和目的地址

由于绑定时不知道目的地址的信息，所以置为 0

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


// net/ipv4/af_inet.c
int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
		bool force_bind_address_no_port, bool with_lock)
{
  /* ... */

	if (inet->inet_rcv_saddr)
		sk->sk_userlocks |= SOCK_BINDADDR_LOCK;
	if (snum)
		sk->sk_userlocks |= SOCK_BINDPORT_LOCK;
	inet->inet_sport = htons(inet->inet_num);
	inet->inet_daddr = 0;
	inet->inet_dport = 0;
	sk_dst_reset(sk);
	err = 0;

  /* ... */
}

Listen

数据结构

request_sock_queue

前面提到，inet_connection_sock 是所有 面向连接 的 socket 表示，是对 struct inet_sock 的扩展；它的第一个域就是 inet_sock。主要维护了与其绑定的端口结构，以及监听等状态过程中的信息。它包含了一个 icsk_accept_queue 的域,这个域是一个 request_sock_queue 类型。

每个需要 listen 的 sock 都需要维护一个 request_sock_queue 结构，主要保存了与自己连接的半连接信息。

request_sock_queue 也就表示一个 request_sock 队列。这里我们知道，tcp 中分为

半连接队列：处于 SYN_RECVD 状态，刚接到 syn,等待三次握手完成
已完成连接队列：处于 established 状态，已经完成三次握手,等待 accept 来读取

在连接建立过程中：

当当 syn 信号到来时，即处于半连接状态时，我们都会新建一个 request_sock结构,并将它加入到 listen_sock 的 request_sock hash表中
当当 3 次握手完毕后，即处于以完成连接状态时，将它放入到 request_sock_queue 的 rskq_accept_head 和 rskq_accept_tail 队列中
当 accept 的时候，就直接从这个队列中读取，并且将 request_sock结构释放，然后在 BSD 层新建一个 socket 结构，并将它和接收端新建的子 sock 结构关联起来

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


// include/net/request_sock.h
struct request_sock_queue {
	spinlock_t		rskq_lock;
	u8			rskq_defer_accept;

	u32			synflood_warned;
	atomic_t		qlen;
	atomic_t		young;

	struct request_sock	*rskq_accept_head;
	struct request_sock	*rskq_accept_tail;
	struct fastopen_queue	fastopenq;  /* Check max_qlen != 0 to determine
					     * if TFO is enabled.
					     */
};

listen_sock

listen_sock 维护了一个 sock 的半连接队列的信息，其中的 syn_table 中是连接该 sock 的所有半连接队列的 hash 数组。

新建立的 request_sock 就存放在 listen_sock 结构的 syn_table 中。这是一个哈希数组，总共有 nr_table_entries 项。实际上在分配内存时，分配的大小是 TCP_SYNQ_HSIZE(512)项。成员 max_qlen_log 以 2 的对数的形式表示 request_sock 队列的最大值。哈希表有 512 项，但队列的最大值的取值是 1024。即 max_qlen_log 的值为 10。qlen 是队列的当前长度。hash_rnd 是一个随机数，计算哈希值用。

我们需要知道每一个 SYN 请求都会新建一个 request_sock 结构,并将它加入到 listen_sock 的 syn_table 哈希表中，然后接收端会发送一个 SYN/ACK 段给 SYN 请求端，当 SYN 请求端将 3 次握手的最后一个 ACK 发送给接收端后，并且接收端判断 ACK 正确，则将 request_sock 从 syn_table 哈希表中删除，将 request_sock 加入到 request_sock_queue 的 rskq_accept_head 和 rskq_accept_tail 队列中，最后的 accept 系统调用不过是判断 accept 队列是否存在完成 3 次请求的 request_sock，从这个队列中将 request_sock 结构释放，然后在 BSD 层新建一个 socket 结构，并将它和接收端新建的子 sock 结构关联起来。

request_sock

request_sock 保存了 tcp 双方传输所必需的一些域，比如窗口大小、对端速率、对端数据包序列号等等这些值.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


// include/net/request_sock.h
struct request_sock {
	struct sock_common		__req_common;
#define rsk_refcnt			__req_common.skc_refcnt
#define rsk_hash			__req_common.skc_hash
#define rsk_listener			__req_common.skc_listener
#define rsk_window_clamp		__req_common.skc_window_clamp
#define rsk_rcv_wnd			__req_common.skc_rcv_wnd

	struct request_sock		*dl_next;
	u16				mss;
	u8				num_retrans; /* number of retransmits */
	u8				cookie_ts:1; /* syncookie: encode tcpopts in timestamp */
	u8				num_timeout:7; /* number of timeouts */
	u32				ts_recent;
	struct timer_list		rsk_timer;
	const struct request_sock_ops	*rsk_ops;
	struct sock			*sk;
	u32				*saved_syn;
	u32				secid;
	u32				peer_secid;
};

inet_request_sock

struct inet_request_sock 是 request_sock 的扩展，其定义如下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


// include/linux/inet_sock.h

struct inet_request_sock {
	struct request_sock	req;
#define ir_loc_addr		req.__req_common.skc_rcv_saddr
#define ir_rmt_addr		req.__req_common.skc_daddr
#define ir_num			req.__req_common.skc_num
#define ir_rmt_port		req.__req_common.skc_dport
#define ir_v6_rmt_addr		req.__req_common.skc_v6_daddr
#define ir_v6_loc_addr		req.__req_common.skc_v6_rcv_saddr
#define ir_iif			req.__req_common.skc_bound_dev_if
#define ir_cookie		req.__req_common.skc_cookie
#define ireq_net		req.__req_common.skc_net
#define ireq_state		req.__req_common.skc_state
#define ireq_family		req.__req_common.skc_family

	u16			snd_wscale : 4,
				rcv_wscale : 4,
				tstamp_ok  : 1,
				sack_ok	   : 1,
				wscale_ok  : 1,
				ecn_ok	   : 1,
				acked	   : 1,
				no_srccheck: 1,
				smc_ok	   : 1;
	u32                     ir_mark;
	union {
		struct ip_options_rcu __rcu	*ireq_opt;
	};
};

tcp_request_sock

增加了对于接收和发送的 ISN（初始序列号）等的维护。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


// include/linux/tcp.h
struct tcp_request_sock {
	struct inet_request_sock 	req;
	const struct tcp_request_sock_ops *af_specific;
	u64				snt_synack; /* first SYNACK sent time */
	bool				tfo_listener;
	u32				txhash;
	u32				rcv_isn;
	u32				snt_isn;
	u32				ts_off;
	u32				last_oow_ack_time; /* last SYNACK */
	u32				rcv_nxt; /* the ack # by SYNACK. For
						  * FastOpen it's the seq#
						  * after data-in-SYN.
						  */
};

Listen 实现

Listen 库函数调用的主要工作可以分为以下几步：

根据 socket 文件描述符找到内核中对应的 socket 结构体变量；
设置 socket 的状态并初始化等待连接队列；
将 socket 放入 inet_hashinfo 的 listen 哈希表中；

这里还有一个概念那就是 backlog 是用户通过函数调用传入的。在 linux 中, backlog 的大小指的是可支持连接队列的大小，而可支持的半开连接的大小一般是和 backlog差不多，而半开连接队列的最大长度是根据 backlog 计算的。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


int __sys_listen(int fd, int backlog)
{
	struct socket *sock;
	int err, fput_needed;
	int somaxconn;

	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (sock) {
		somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn;
		if ((unsigned int)backlog > somaxconn)
			backlog = somaxconn;

		err = security_socket_listen(sock, backlog);
		if (!err)
			err = sock->ops->listen(sock, backlog);

		fput_light(sock->file, fput_needed);
	}
	return err;
}

inet_listen

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51


// net/ipv4/af_inet.c

/*
 *	Move a socket into listening state.
 */
int inet_listen(struct socket *sock, int backlog)
{
	struct sock *sk = sock->sk;
	unsigned char old_state;
	int err, tcp_fastopen;

	lock_sock(sk);

	err = -EINVAL;
	if (sock->state != SS_UNCONNECTED || sock->type != SOCK_STREAM)
		goto out;

	old_state = sk->sk_state;
	if (!((1 << old_state) & (TCPF_CLOSE | TCPF_LISTEN)))
		goto out;

	sk->sk_max_ack_backlog = backlog;
	/* Really, if the socket is already in listen state
	 * we can only allow the backlog to be adjusted.
	 */
	if (old_state != TCP_LISTEN) {
		/* Enable TFO w/o requiring TCP_FASTOPEN socket option.
		 * Note that only TCP sockets (SOCK_STREAM) will reach here.
		 * Also fastopen backlog may already been set via the option
		 * because the socket was in TCP_LISTEN state previously but
		 * was shutdown() rather than close().
		 */
		tcp_fastopen = sock_net(sk)->ipv4.sysctl_tcp_fastopen;
		if ((tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) &&
		    (tcp_fastopen & TFO_SERVER_ENABLE) &&
		    !inet_csk(sk)->icsk_accept_queue.fastopenq.max_qlen) {
			fastopen_queue_tune(sk, backlog);
			tcp_fastopen_init_key_once(sock_net(sk));
		}

		err = inet_csk_listen_start(sk, backlog);
		if (err)
			goto out;
		tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_LISTEN_CB, 0, NULL);
	}
	err = 0;

out:
	release_sock(sk);
	return err;
}

上面的代码中，有点值得注意的是，当 sock 状态已经是 TCP_LISTEN 时，也可以继续调用 listen() 库函数，其作用是设置 sock 的最大并发连接请求数；

inet_csk_listen_start

它的主要工作是新分配一个 listen socket，将它加入到 inet_connection_sock 的 icsk_accept_queue域的 listen_opt 中，然后对当前使用端口进行判断，最终返回:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


// net/ipv4/inet_connection_sock.c

int inet_csk_listen_start(struct sock *sk, int backlog)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct inet_sock *inet = inet_sk(sk);
	int err = -EADDRINUSE;

  // 初始化连接等待队列
	reqsk_queue_alloc(&icsk->icsk_accept_queue);

	sk->sk_ack_backlog = 0;   // 已经连接的个数
	inet_csk_delack_init(sk); // 将ack标记全部置0

	/* There is race window here: we announce ourselves listening,
	 * but this transition is still not validated by get_port().
	 * It is OK, because this socket enters to hash table only
	 * after validation is complete.
	 */
	inet_sk_state_store(sk, TCP_LISTEN); // 设置sock的状态为TCP_LISTEN
	if (!sk->sk_prot->get_port(sk, inet->inet_num)) {
		inet->inet_sport = htons(inet->inet_num);

		sk_dst_reset(sk);
		err = sk->sk_prot->hash(sk); // 将当前socket哈希到inet_hashinfo中

		if (likely(!err))
			return 0;
	}

	inet_sk_set_state(sk, TCP_CLOSE);
	return err;
}

request_sock_queue 的初始化 reqsk_queue_alloc

先看下 reqsk_queue_alloc() 的源代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


// net/core/request_sock.c
void reqsk_queue_alloc(struct request_sock_queue *queue)
{
	spin_lock_init(&queue->rskq_lock);

	spin_lock_init(&queue->fastopenq.lock);
	queue->fastopenq.rskq_rst_head = NULL;
	queue->fastopenq.rskq_rst_tail = NULL;
	queue->fastopenq.qlen = 0;

	queue->rskq_accept_head = NULL;
}

整个过程中，先计算 request_sock 的大小并申请空间，然后初始化 request_sock_queue 的相应成员的值；

端口注册

前面提到管理 socket 的哈希表结构 inet_hashinfo ，其中的成员 listening_hash 用于存放处于 TCP_LISTEN 状态的 sock ；

当 socket 通过 listen() 调用完成等待连接队列的初始化后，需要将当前 sock 放到该结构体中：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


	inet_sk_state_store(sk, TCP_LISTEN); // 设置sock的状态为TCP_LISTEN
	if (!sk->sk_prot->get_port(sk, inet->inet_num)) {
		inet->inet_sport = htons(inet->inet_num);

		sk_dst_reset(sk);
		err = sk->sk_prot->hash(sk); // 将当前socket哈希到inet_hashinfo中

		if (likely(!err))
			return 0;
	}

这里 sk_prot->hash(sk) 调用了 net/ipv4/inet_hashtables.c:inet_hash() 方法：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


int inet_hash(struct sock *sk)
{
	int err = 0;

	if (sk->sk_state != TCP_CLOSE) {
		local_bh_disable();
		err = __inet_hash(sk, NULL);
		local_bh_enable();
	}

	return err;
}

int __inet_hash(struct sock *sk, struct sock *osk)
{
	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
	struct inet_listen_hashbucket *ilb;
	int err = 0;

  // 状态检查
	if (sk->sk_state != TCP_LISTEN) {
		inet_ehash_nolisten(sk, osk);
		return 0;
	}
	WARN_ON(!sk_unhashed(sk));

  // 计算hash值，取得链表
	ilb = &hashinfo->listening_hash[inet_sk_listen_hashfn(sk)];

	spin_lock(&ilb->lock);
	if (sk->sk_reuseport) {
		err = inet_reuseport_add_sock(sk, ilb);
		if (err)
			goto unlock;
	}
	if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
		sk->sk_family == AF_INET6)
		hlist_add_tail_rcu(&sk->sk_node, &ilb->head);
	else
		hlist_add_head_rcu(&sk->sk_node, &ilb->head);
	inet_hash2(hashinfo, sk);
	ilb->count++;
	sock_set_flag(sk, SOCK_RCU_FREE);
  // 将sock添加到listen链表中
	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
unlock:
	spin_unlock(&ilb->lock);

	return err;
}

Accept

accept 的作用就是从 accept 队列中取出三次握手完成的 sock，并将它关联到 vfs 上(其实操作和调用 sys_socket 时新建一个 socket 类似)，然后返回。

这里还有个要注意的：

如果这个传递给 accept 的 socket 是非阻塞的话，就算 accept 队列为空，也会直接返回
如果这个传递给 accept 的 socket 是阻塞的话，会休眠掉，等待 accept 队列有数据后唤醒它

接下来我们就来看它的实现，accept 对应的系统调用是 sys_accept，而他则会调用 sys_accept4,因此我们直接来看 sys_accept4:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80


int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
		  int __user *upeer_addrlen, int flags)
{
	struct socket *sock, *newsock;
	struct file *newfile;
	int err, len, newfd, fput_needed;
	struct sockaddr_storage address;

	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
		return -EINVAL;

	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
		flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;

	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (!sock)
		goto out;

	err = -ENFILE;
	newsock = sock_alloc();
	if (!newsock)
		goto out_put;

	newsock->type = sock->type;
	newsock->ops = sock->ops;

	/*
	 * We don't need try_module_get here, as the listening socket (sock)
	 * has the protocol module (sock->ops->owner) held.
	 */
	__module_get(newsock->ops->owner);

	newfd = get_unused_fd_flags(flags);
	if (unlikely(newfd < 0)) {
		err = newfd;
		sock_release(newsock);
		goto out_put;
	}
	newfile = sock_alloc_file(newsock, flags, sock->sk->sk_prot_creator->name);
	if (IS_ERR(newfile)) {
		err = PTR_ERR(newfile);
		put_unused_fd(newfd);
		goto out_put;
	}

	err = security_socket_accept(sock, newsock);
	if (err)
		goto out_fd;

	err = sock->ops->accept(sock, newsock, sock->file->f_flags, false);
	if (err < 0)
		goto out_fd;

	if (upeer_sockaddr) {
		len = newsock->ops->getname(newsock,
					(struct sockaddr *)&address, 2);
		if (len < 0) {
			err = -ECONNABORTED;
			goto out_fd;
		}
		err = move_addr_to_user(&address,
					len, upeer_sockaddr, upeer_addrlen);
		if (err < 0)
			goto out_fd;
	}

	/* File flags are not inherited via accept() unlike another OSes. */

	fd_install(newfd, newfile);
	err = newfd;

out_put:
	fput_light(sock->file, fput_needed);
out:
	return err;
out_fd:
	fput(newfile);
	put_unused_fd(newfd);
	goto out_put;
}

可以看到，这里创建了一个新的 socket 结构，然后调用了 sock->ops->accept，也就是 inet_accept。

实现

inet_accept

可以看到流程很简单,最终的实现都集中在 inet_accept 中了.而 inet_accept 主要工作如下：

调用 inet_csk_accept 来进行对 accept 队列的操作，它会返回取得的 sock.
将从 inet_csk_accept返回的 sock 链接到传递进来的 new 的 socket 中
- 这里就知道我们上面为什么只需要 new 一个 socket 而不是 sock了，因为 sock 我们是直接从 accept 队列中取得的
设置新的 socket 的状态为 SS_CONNECTED

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


int inet_accept(struct socket *sock, struct socket *newsock, int flags,
		bool kern)
{
	struct sock *sk1 = sock->sk;
	int err = -EINVAL;
	struct sock *sk2 = sk1->sk_prot->accept(sk1, flags, &err, kern);

	if (!sk2)
		goto do_err;

	lock_sock(sk2);

	sock_rps_record_flow(sk2);
	WARN_ON(!((1 << sk2->sk_state) &
		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
		  TCPF_CLOSE_WAIT | TCPF_CLOSE)));

	sock_graft(sk2, newsock);

	newsock->state = SS_CONNECTED;
	err = 0;
	release_sock(sk2);
do_err:
	return err;
}

inet_csk_accept

inet_csk_accept 就是从 accept 队列中取出 sock 然后返回

在看他的源码之前先来看几个相关函数的实现：

1
2
3
4
5


include/net/request_sock.h
static inline bool reqsk_queue_empty(const struct request_sock_queue *queue)
{
	return READ_ONCE(queue->rskq_accept_head) == NULL;
}

reqsk_queue_remove 主要是从 accept 队列中得到一个 sock

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


static inline struct request_sock *reqsk_queue_remove(struct request_sock_queue *queue,
						      struct sock *parent)
{
	struct request_sock *req;

	spin_lock_bh(&queue->rskq_lock);
	req = queue->rskq_accept_head;
	if (req) {
		sk_acceptq_removed(parent);
		WRITE_ONCE(queue->rskq_accept_head, req->dl_next);
		if (queue->rskq_accept_head == NULL)
			queue->rskq_accept_tail = NULL;
	}
	spin_unlock_bh(&queue->rskq_lock);
	return req;
}

inet_csk_wait_for_connect 用来在 accept 队列为空的情况下，休眠掉一段时间 (这里每个 socket 都有一个等待队列的)

这里是每个调用的进程都会声明一个 wait 队列，然后将它连接到主的 socket 的等待队列链表中，然后休眠，。等到唤醒.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


static int inet_csk_wait_for_connect(struct sock *sk, long timeo)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
  // 定义一个waitqueue.
	DEFINE_WAIT(wait);
	int err;

	for (;;) {
		prepare_to_wait_exclusive(sk_sleep(sk), &wait,
					  TASK_INTERRUPTIBLE);
		release_sock(sk);
		if (reqsk_queue_empty(&icsk->icsk_accept_queue))
			timeo = schedule_timeout(timeo);
		sched_annotate_sleep();
		lock_sock(sk);
		err = 0;
		if (!reqsk_queue_empty(&icsk->icsk_accept_queue))
			break;
		err = -EINVAL;
		if (sk->sk_state != TCP_LISTEN)
			break;
		err = sock_intr_errno(timeo);
		if (signal_pending(current))
			break;
		err = -EAGAIN;
		if (!timeo)
			break;
	}
	finish_wait(sk_sleep(sk), &wait);
	return err;

inet_csk_accept 中有个阻塞和非阻塞的问题.非阻塞的话会直接返回的,就算 accept 队列为空.这个时侯设置 errno 为-EAGAIN.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
{
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
	struct request_sock *req;
	struct sock *newsk;
	int error;

	lock_sock(sk);

	/* We need to make sure that this socket is listening,
	 * and that it has something pending.
	 */
	error = -EINVAL;
	if (sk->sk_state != TCP_LISTEN)
		goto out_err;

	/* Find already established connection */
	if (reqsk_queue_empty(queue)) {
		long timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);

		/* If this is a non blocking socket don't sleep */
		error = -EAGAIN;
		if (!timeo)
			goto out_err;

		error = inet_csk_wait_for_connect(sk, timeo);
		if (error)
			goto out_err;
	}
	req = reqsk_queue_remove(queue, sk);
	newsk = req->sk;

	if (sk->sk_protocol == IPPROTO_TCP &&
	    tcp_rsk(req)->tfo_listener) {
		spin_lock_bh(&queue->fastopenq.lock);
		if (tcp_rsk(req)->tfo_listener) {
			/* We are still waiting for the final ACK from 3WHS
			 * so can't free req now. Instead, we set req->sk to
			 * NULL to signify that the child socket is taken
			 * so reqsk_fastopen_remove() will free the req
			 * when 3WHS finishes (or is aborted).
			 */
			req->sk = NULL;
			req = NULL;
		}
		spin_unlock_bh(&queue->fastopenq.lock);
	}
out:
	release_sock(sk);
	if (req)
		reqsk_put(req);
	return newsk;
out_err:
	newsk = NULL;
	req = NULL;
	*err = error;
	goto out;
}

sock_graft

将 socket 和 sock 关联。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


static inline void sock_graft(struct sock *sk, struct socket *parent)
{
	WARN_ON(parent->sk);
	write_lock_bh(&sk->sk_callback_lock);
	rcu_assign_pointer(sk->sk_wq, &parent->wq);
	parent->sk = sk;
	sk_set_socket(sk, parent);
	sk->sk_uid = SOCK_INODE(parent)->i_uid;
	security_sock_graft(sk, parent);
	write_unlock_bh(&sk->sk_callback_lock);
}

Connect

什么情况下，icsk_accept_queue 才不为空呢？当然是三次握手结束才可以。接下来我们来分析三次握手的过程。

三次握手一般是由客户端调用 connect 发起，它的具体流程是:

由 fd 得到 socket, 并且将地址复制到内核空间
调用 inet_stream_connect 进行主要的处理

这里要注意 connect 也有个阻塞和非阻塞的区别，阻塞的话调用 inet_wait_for_connect 休眠，等待握手完成，否则直接返回。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen)
{
	struct socket *sock;
	struct sockaddr_storage address;
	int err, fput_needed;

	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (!sock)
		goto out;
	err = move_addr_to_kernel(uservaddr, addrlen, &address);
	if (err < 0)
		goto out_put;

	err =
	    security_socket_connect(sock, (struct sockaddr *)&address, addrlen);
	if (err)
		goto out_put;

	err = sock->ops->connect(sock, (struct sockaddr *)&address, addrlen,
				 sock->file->f_flags);
out_put:
	fput_light(sock->file, fput_needed);
out:
	return err;
}

client 端

inet_stream_connect

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
			int addr_len, int flags)
{
	int err;

	lock_sock(sock->sk);
	err = __inet_stream_connect(sock, uaddr, addr_len, flags, 0);
	release_sock(sock->sk);
	return err;
}

看到主要调用了 inet_stream_connect，它的主要工作是:

判断 socket 的状态，只有当为 SS_UNCONNECTED 也就是非连接状态时才调用 tcp_v4_connect 来进行连接处理
判断 tcp 的状态 sk_state 只能为 TCPF_SYN_SENT 或者 TCPF_SYN_RECV 才进入相关处理
如果状态合适并且 socket 为阻塞模式则调用 inet_wait_for_connect 进入休眠等待三次握手完成来重新唤醒它。
否则直接返回，并设置错误号为 EINPROGRESS.

需要注意一下，如果为非阻塞模式，用户可以通过 select 来查看 socket 是否可读写，从而判断是否已连接。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99


int __inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
			  int addr_len, int flags, int is_sendmsg)
{
	struct sock *sk = sock->sk;
	int err;
	long timeo;

	if (uaddr) {
		if (addr_len < sizeof(uaddr->sa_family))
			return -EINVAL;

		if (uaddr->sa_family == AF_UNSPEC) {
			err = sk->sk_prot->disconnect(sk, flags);
			sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
			goto out;
		}
	}

	switch (sock->state) {
	default:
		err = -EINVAL;
		goto out;
	case SS_CONNECTED:
		err = -EISCONN;
		goto out;
	case SS_CONNECTING:
		if (inet_sk(sk)->defer_connect)
			err = is_sendmsg ? -EINPROGRESS : -EISCONN;
		else
			err = -EALREADY;
		/* Fall out of switch with err, set for this state */
		break;
	case SS_UNCONNECTED:
		err = -EISCONN;
		if (sk->sk_state != TCP_CLOSE)
			goto out;

		if (BPF_CGROUP_PRE_CONNECT_ENABLED(sk)) {
			err = sk->sk_prot->pre_connect(sk, uaddr, addr_len);
			if (err)
				goto out;
		}

		err = sk->sk_prot->connect(sk, uaddr, addr_len);
		if (err < 0)
			goto out;

		sock->state = SS_CONNECTING;

		if (!err && inet_sk(sk)->defer_connect)
			goto out;

		/* Just entered SS_CONNECTING state; the only
		 * difference is that return value in non-blocking
		 * case is EINPROGRESS, rather than EALREADY.
		 */
		err = -EINPROGRESS;
		break;
	}

	timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);

	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
		int writebias = (sk->sk_protocol == IPPROTO_TCP) &&
				tcp_sk(sk)->fastopen_req &&
				tcp_sk(sk)->fastopen_req->data ? 1 : 0;

		/* Error code is set above */
		if (!timeo || !inet_wait_for_connect(sk, timeo, writebias))
			goto out;

		err = sock_intr_errno(timeo);
		if (signal_pending(current))
			goto out;
	}

	/* Connection was closed by RST, timeout, ICMP error
	 * or another process disconnected us.
	 */
	if (sk->sk_state == TCP_CLOSE)
		goto sock_error;

	/* sk->sk_err may be not zero now, if RECVERR was ordered by user
	 * and error was received after socket entered established state.
	 * Hence, it is handled normally after connect() return successfully.
	 */

	sock->state = SS_CONNECTED;
	err = 0;
out:
	return err;

sock_error:
	err = sock_error(sk) ? : -ECONNABORTED;
	sock->state = SS_UNCONNECTED;
	if (sk->sk_prot->disconnect(sk, flags))
		sock->state = SS_DISCONNECTING;
	goto out;
}

tcp_v4_connect

tcp_v4_connect 的流程:

判断地址的一些合法性.
调用 ip_route_connect 来查找出去的路由(包括查找临时端口等等)
1. 为什么呢？因为三次握手马上就要发送一个 SYN 包了，这就要凑齐源地址、源端口、目标地址、目标端口。目标地址和目标端口是服务端的，已经知道源端口是客户端随机分配的，源地址应该用哪一个呢？这时候要选择一条路由，看从哪个网卡出去，就应该填写哪个网卡的 IP 地址。
设置 sock 的状态为 TCP_SYN_SENT，并调用 inet_hash_connect 来查找一个临时端口(也就是我们出去的端口)，并加入到对应的 hash 链表(具体操作和 get_port 很相似)
调用 tcp_connect 来完成最终的操作：这个函数主要用来初始化将要发送的 syn 包(包括窗口大小 isn 等等),然后将这个 sk_buffer 加入到 socket 的写队列。最终调用 tcp_transmit_skb传输到 3 层。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134


/* This will initiate an outgoing connection. */
int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
	struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
	struct inet_sock *inet = inet_sk(sk);
	struct tcp_sock *tp = tcp_sk(sk);
	__be16 orig_sport, orig_dport;
	__be32 daddr, nexthop;
	struct flowi4 *fl4;
	struct rtable *rt;
	int err;
	struct ip_options_rcu *inet_opt;
	struct inet_timewait_death_row *tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;

	if (addr_len < sizeof(struct sockaddr_in))
		return -EINVAL;

	if (usin->sin_family != AF_INET)
		return -EAFNOSUPPORT;

	nexthop = daddr = usin->sin_addr.s_addr;
	inet_opt = rcu_dereference_protected(inet->inet_opt,
					     lockdep_sock_is_held(sk));
	if (inet_opt && inet_opt->opt.srr) {
		if (!daddr)
			return -EINVAL;
		nexthop = inet_opt->opt.faddr;
	}

	orig_sport = inet->inet_sport;
	orig_dport = usin->sin_port;
	fl4 = &inet->cork.fl.u.ip4;
	rt = ip_route_connect(fl4, nexthop, inet->inet_saddr,
			      RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
			      IPPROTO_TCP,
			      orig_sport, orig_dport, sk);
	if (IS_ERR(rt)) {
		err = PTR_ERR(rt);
		if (err == -ENETUNREACH)
			IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
		return err;
	}

	if (rt->rt_flags & (RTCF_MULTICAST | RTCF_BROADCAST)) {
		ip_rt_put(rt);
		return -ENETUNREACH;
	}

	if (!inet_opt || !inet_opt->opt.srr)
		daddr = fl4->daddr;

	if (!inet->inet_saddr)
		inet->inet_saddr = fl4->saddr;
	sk_rcv_saddr_set(sk, inet->inet_saddr);

	if (tp->rx_opt.ts_recent_stamp && inet->inet_daddr != daddr) {
		/* Reset inherited state */
		tp->rx_opt.ts_recent	   = 0;
		tp->rx_opt.ts_recent_stamp = 0;
		if (likely(!tp->repair))
			WRITE_ONCE(tp->write_seq, 0);
	}

	inet->inet_dport = usin->sin_port;
	sk_daddr_set(sk, daddr);

	inet_csk(sk)->icsk_ext_hdr_len = 0;
	if (inet_opt)
		inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;

	tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT;

	/* Socket identity is still unknown (sport may be zero).
	 * However we set state to SYN-SENT and not releasing socket
	 * lock select source port, enter ourselves into the hash tables and
	 * complete initialization after this.
	 */
	tcp_set_state(sk, TCP_SYN_SENT);
	err = inet_hash_connect(tcp_death_row, sk);
	if (err)
		goto failure;

	sk_set_txhash(sk);

	rt = ip_route_newports(fl4, rt, orig_sport, orig_dport,
			       inet->inet_sport, inet->inet_dport, sk);
	if (IS_ERR(rt)) {
		err = PTR_ERR(rt);
		rt = NULL;
		goto failure;
	}
	/* OK, now commit destination to socket.  */
	sk->sk_gso_type = SKB_GSO_TCPV4;
	sk_setup_caps(sk, &rt->dst);
	rt = NULL;

	if (likely(!tp->repair)) {
		if (!tp->write_seq)
			WRITE_ONCE(tp->write_seq,
				   secure_tcp_seq(inet->inet_saddr,
						  inet->inet_daddr,
						  inet->inet_sport,
						  usin->sin_port));
		tp->tsoffset = secure_tcp_ts_off(sock_net(sk),
						 inet->inet_saddr,
						 inet->inet_daddr);
	}

	inet->inet_id = prandom_u32();

	if (tcp_fastopen_defer_connect(sk, &err))
		return err;
	if (err)
		goto failure;

	err = tcp_connect(sk);

	if (err)
		goto failure;

	return 0;

failure:
	/*
	 * This unhashes the socket and releases the local port,
	 * if necessary.
	 */
	tcp_set_state(sk, TCP_CLOSE);
	ip_rt_put(rt);
	sk->sk_route_caps = 0;
	inet->inet_dport = 0;
	return err;
}
EXPORT_SYMBOL(tcp_v4_connect);

tcp_connect

该函数完成发送 SYN 的作用，其流程如下：

初始化该次连接的 sk 结构 tcp_sock
sk_stream_alloc_skb 分配一个 skb 数据包
tcp_init_nondata_skb 初始化 skb，也就是 SYN 包
skb 加入发送队列后，调用 tcp_transmit_skb 发送该数据包（SYN）
更新 snd_nxt，inet_csk_reset_xmit_timer 启动超时定时器。如果 SYN 发送不成功，则再次发送

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53


/* Build a SYN and send it off. */
int tcp_connect(struct sock *sk)
{
	struct tcp_sock *tp = tcp_sk(sk);
	struct sk_buff *buff;
	int err;

	tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB, 0, NULL);

	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
		return -EHOSTUNREACH; /* Routing failure or similar. */

	tcp_connect_init(sk);

	if (unlikely(tp->repair)) {
		tcp_finish_connect(sk, NULL);
		return 0;
	}

	buff = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, true);
	if (unlikely(!buff))
		return -ENOBUFS;

	tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN);
	tcp_mstamp_refresh(tp);
	tp->retrans_stamp = tcp_time_stamp(tp);
	tcp_connect_queue_skb(sk, buff);
	tcp_ecn_send_syn(sk, buff);
	tcp_rbtree_insert(&sk->tcp_rtx_queue, buff);

	/* Send off SYN; include data in Fast Open. */
	err = tp->fastopen_req ? tcp_send_syn_data(sk, buff) :
	      tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
	if (err == -ECONNREFUSED)
		return err;

	/* We change tp->snd_nxt after the tcp_transmit_skb() call
	 * in order to make this packet get counted in tcpOutSegs.
	 */
	WRITE_ONCE(tp->snd_nxt, tp->write_seq);
	tp->pushed_seq = tp->write_seq;
	buff = tcp_send_head(sk);
	if (unlikely(buff)) {
		WRITE_ONCE(tp->snd_nxt, TCP_SKB_CB(buff)->seq);
		tp->pushed_seq	= TCP_SKB_CB(buff)->seq;
	}
	TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);

	/* Timer for repeating the SYN until an answer. */
	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
				  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
	return 0;
}

我们回到 __inet_stream_connect 函数，在调用 sk->sk_prot->connect 之后，inet_wait_for_connect 会一直等待客户端收到服务端的 ACK。而我们知道，服务端在 accept 之后，也是在等待中。

Server 端

为了解析三次握手，我们简单的看网络包接收到 TCP 层做的部分事情。

tcp_v4_rcv

1
2
3
4
5
6
7
8
9


static struct net_protocol tcp_protocol = {
  .early_demux  =  tcp_v4_early_demux,
  .early_demux_handler =  tcp_v4_early_demux,
  .handler  =  tcp_v4_rcv,
  .err_handler  =  tcp_v4_err,
  .no_policy  =  1,
  .netns_ok  =  1,
  .icmp_strict_tag_validation = 1,
}

我们通过 struct net_protocol 结构中的 handler 进行接收，调用的函数是 tcp_v4_rcv。接下来的调用链为 tcp_v4_rcv->tcp_v4_do_rcv->tcp_rcv_state_process。

tcp_rcv_state_process

tcp_rcv_state_process，顾名思义，是用来处理接收一个网络包后引起状态变化的。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


// net/ipv4/tcp_input.c
int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
  struct tcp_sock *tp = tcp_sk(sk);
  struct inet_connection_sock *icsk = inet_csk(sk);
  const struct tcphdr *th = tcp_hdr(skb);
  struct request_sock *req;
  int queued = 0;
  bool acceptable;

  switch (sk->sk_state) {
  /* ... */
  case TCP_LISTEN:
  /* ... */
    if (th->syn) {
      acceptable = icsk->icsk_af_ops->conn_request(sk, skb) >= 0;
      if (!acceptable)
        return 1;
      consume_skb(skb);
      return 0;
    }
  /* ... */
}

目前服务端是处于 TCP_LISTEN 状态的，而且发过来的包是 SYN，因而就有了上面的代码，调用 icsk->icsk_af_ops->conn_request 函数。struct inet_connection_sock 对应的操作是 inet_connection_sock_af_ops，按照下面的定义，其实调用的是 tcp_v4_conn_request。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


const struct inet_connection_sock_af_ops ipv4_specific = {
        .queue_xmit        = ip_queue_xmit,
        .send_check        = tcp_v4_send_check,
        .rebuild_header    = inet_sk_rebuild_header,
        .sk_rx_dst_set     = inet_sk_rx_dst_set,
        .conn_request      = tcp_v4_conn_request,
        .syn_recv_sock     = tcp_v4_syn_recv_sock,
        .net_header_len    = sizeof(struct iphdr),
        .setsockopt        = ip_setsockopt,
        .getsockopt        = ip_getsockopt,
        .addr2sockaddr     = inet_csk_addr2sockaddr,
        .sockaddr_len      = sizeof(struct sockaddr_in),
        .mtu_reduced       = tcp_v4_mtu_reduced,
};

tcp_v4_conn_request

tcp_v4_conn_request 会调用 tcp_conn_request，这个函数也比较长，里面调用了 send_synack，但实际调用的是 tcp_v4_send_synack。具体发送的过程我们不去管它，看注释我们能知道，这是收到了 SYN 后，回复一个 SYN-ACK，回复完毕后，服务端处于 TCP_SYN_RECV。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21



int tcp_conn_request(struct request_sock_ops *rsk_ops,
         const struct tcp_request_sock_ops *af_ops,
         struct sock *sk, struct sk_buff *skb)
{
 /* ... */
    af_ops->send_synack(sk, dst, &fl, req, &foc,
            !want_cookie ? TCP_SYNACK_NORMAL :
               TCP_SYNACK_COOKIE);
 /* ... */
}

/*
 *  Send a SYN-ACK after having received a SYN.
 */
static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst,
            struct flowi *fl,
            struct request_sock *req,
            struct tcp_fastopen_cookie *foc,
            enum tcp_synack_type synack_type)
{......}

这个时候，轮到客户端接收网络包了。都是 TCP 协议栈，所以过程和服务端没有太多区别，还是会走到 tcp_rcv_state_process 函数的，只不过由于客户端目前处于 TCP_SYN_SENT 状态，就进入了下面的代码分支。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
  struct tcp_sock *tp = tcp_sk(sk);
  struct inet_connection_sock *icsk = inet_csk(sk);
  const struct tcphdr *th = tcp_hdr(skb);
  struct request_sock *req;
  int queued = 0;
  bool acceptable;

  switch (sk->sk_state) {
 /* ... */
  case TCP_SYN_SENT:
    tp->rx_opt.saw_tstamp = 0;
    tcp_mstamp_refresh(tp);
    queued = tcp_rcv_synsent_state_process(sk, skb, th);
    if (queued >= 0)
      return queued;
    /* Do step6 onward by hand. */
    tcp_urg(sk, skb, th);
    __kfree_skb(skb);
    tcp_data_snd_check(sk);
    return 0;
  }
 /* ... */
}

tcp_rcv_synsent_state_process 会调用 tcp_send_ack，发送一个 ACK-ACK，发送后客户端处于 TCP_ESTABLISHED 状态。

又轮到服务端接收网络包了，我们还是归 tcp_rcv_state_process 函数处理。由于服务端目前处于状态 TCP_SYN_RECV 状态，因而又走了另外的分支。当收到这个网络包的时候，服务端也处于 TCP_ESTABLISHED 状态，三次握手结束。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36



int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
  struct tcp_sock *tp = tcp_sk(sk);
  struct inet_connection_sock *icsk = inet_csk(sk);
  const struct tcphdr *th = tcp_hdr(skb);
  struct request_sock *req;
  int queued = 0;
  bool acceptable;
 /* ... */
  switch (sk->sk_state) {
  case TCP_SYN_RECV:
    if (req) {
      inet_csk(sk)->icsk_retransmits = 0;
      reqsk_fastopen_remove(sk, req, false);
    } else {
      /* Make sure socket is routed, for correct metrics. */
      icsk->icsk_af_ops->rebuild_header(sk);
      tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
      tcp_init_congestion_control(sk);

      tcp_mtup_init(sk);
      tp->copied_seq = tp->rcv_nxt;
      tcp_init_buffer_space(sk);
    }
    smp_mb();
    tcp_set_state(sk, TCP_ESTABLISHED);
    sk->sk_state_change(sk);
    if (sk->sk_socket)
      sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
    tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
    tp->snd_wnd = ntohs(th->window) << tp->rx_opt.snd_wscale;
    tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
    break;
 /* ... */
}

Send

数据结构

msghdr

1
2
3
4
5
6
7
8
9


struct msghdr {
	void		*msg_name;	/* ptr to socket address structure */
	int		msg_namelen;	/* size of socket address structure */
	struct iov_iter	msg_iter;	/* data */
	void		*msg_control;	/* ancillary data */
	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
	unsigned int	msg_flags;	/* flags on received message */
	struct kiocb	*msg_iocb;	/* ptr to iocb for async requests */
};

kiocb

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


struct kiocb {
	struct file		*ki_filp;

	/* The 'ki_filp' pointer is shared in a union for aio */
	randomized_struct_fields_start

	loff_t			ki_pos;
	void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
	void			*private;
	int			ki_flags;
	u16			ki_hint;
	u16			ki_ioprio; /* See linux/ioprio.h */
	unsigned int		ki_cookie; /* for ->iopoll */

	randomized_struct_fields_end
};

iovec

1
2
3
4
5


struct iovec
  {
    void *iov_base;	/* Pointer to data.  */
    size_t iov_len;	/* Length of data.  */
  };

实现

系统调用

1
2
3
4
5
6


SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,
		unsigned int, flags, struct sockaddr __user *, addr,
		int, addr_len)
{
	return __sys_sendto(fd, buff, len, flags, addr, addr_len);
}

__sys_sendto

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
		 struct sockaddr __user *addr,  int addr_len)
{
	struct socket *sock;
	struct sockaddr_storage address;
	int err;
	struct msghdr msg;
	struct iovec iov;
	int fput_needed;

	err = import_single_range(WRITE, buff, len, &iov, &msg.msg_iter);
	if (unlikely(err))
		return err;
	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (!sock)
		goto out;

	msg.msg_name = NULL;
	msg.msg_control = NULL;
	msg.msg_controllen = 0;
	msg.msg_namelen = 0;
	if (addr) {
		err = move_addr_to_kernel(addr, addr_len, &address);
		if (err < 0)
			goto out_put;
		msg.msg_name = (struct sockaddr *)&address;
		msg.msg_namelen = addr_len;
	}
	if (sock->file->f_flags & O_NONBLOCK)
		flags |= MSG_DONTWAIT;
	msg.msg_flags = flags;
	err = sock_sendmsg(sock, &msg);

out_put:
	fput_light(sock->file, fput_needed);
out:
	return err;
}

sock_sendmsg

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


int sock_sendmsg(struct socket *sock, struct msghdr *msg)
{
	int err = security_socket_sendmsg(sock, msg,
					  msg_data_left(msg));

	return err ?: sock_sendmsg_nosec(sock, msg);
}

static inline int sock_sendmsg_nosec(struct socket *sock, struct msghdr *msg)
{
	int ret = INDIRECT_CALL_INET(sock->ops->sendmsg, inet6_sendmsg,
				     inet_sendmsg, sock, msg,
				     msg_data_left(msg));
	BUG_ON(ret == -EIOCBQUEUED);
	return ret;
}

最终调用了 sock->ops->sendmsg，也就是 inet_sendmsg

inet_sendmsg

可以看到，最终调用的是 sk_prot 的 sendmsg 方法，对于 TCP 就是 tcp_sendmsg

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


// net/ipv4/af_inet.c
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
{
	struct sock *sk = sock->sk;

	if (unlikely(inet_send_prepare(sk)))
		return -EAGAIN;

	return INDIRECT_CALL_2(sk->sk_prot->sendmsg, tcp_sendmsg, udp_sendmsg,
			       sk, msg, size);
}

tcp_sendmsg

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97


int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
  struct tcp_sock *tp = tcp_sk(sk);
  struct sk_buff *skb;
  int flags, err, copied = 0;
  int mss_now = 0, size_goal, copied_syn = 0;
  long timeo;
 /* ... */

  /* Ok commence sending. */
  copied = 0;
restart:
  mss_now = tcp_send_mss(sk, &size_goal, flags);

  while (msg_data_left(msg)) {
    int copy = 0;
    int max = size_goal;

    skb = tcp_write_queue_tail(sk);
    if (tcp_send_head(sk)) {
      if (skb->ip_summed == CHECKSUM_NONE)
        max = mss_now;
      copy = max - skb->len;
    }

    if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
      bool first_skb;

new_segment:
      /* Allocate new segment. If the interface is SG,
       * allocate skb fitting to single page.
       */
      if (!sk_stream_memory_free(sk))
        goto wait_for_sndbuf;
 /* ... */
      first_skb = skb_queue_empty(&sk->sk_write_queue);
      skb = sk_stream_alloc_skb(sk,
              select_size(sk, sg, first_skb),
              sk->sk_allocation,
              first_skb);
 /* ... */
      skb_entail(sk, skb);
      copy = size_goal;
      max = size_goal;
 /* ... */
    }

    /* Try to append data to the end of skb. */
    if (copy > msg_data_left(msg))
      copy = msg_data_left(msg);

    /* Where to copy to? */
    if (skb_availroom(skb) > 0) {
      /* We have some space in skb head. Superb! */
      copy = min_t(int, copy, skb_availroom(skb));
      err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy);
 /* ... */
    } else {
      bool merge = true;
      int i = skb_shinfo(skb)->nr_frags;
      struct page_frag *pfrag = sk_page_frag(sk);
 /* ... */
      copy = min_t(int, copy, pfrag->size - pfrag->offset);
 /* ... */
      err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
                   pfrag->page,
                   pfrag->offset,
                   copy);
 /* ... */
      pfrag->offset += copy;
    }

 /* ... */
    tp->write_seq += copy;
    TCP_SKB_CB(skb)->end_seq += copy;
    tcp_skb_pcount_set(skb, 0);

    copied += copy;
    if (!msg_data_left(msg)) {
      if (unlikely(flags & MSG_EOR))
        TCP_SKB_CB(skb)->eor = 1;
      goto out;
    }

    if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
      continue;

    if (forced_push(tp)) {
      tcp_mark_push(tp, skb);
      __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
    } else if (skb == tcp_send_head(sk))
      tcp_push_one(sk, mss_now);
    continue;
 /* ... */
  }
 /* ... */
}

tcp_sendmsg 的实现还是很复杂的，这里面做了这样几件事情。

msg 是用户要写入的数据，这个数据要拷贝到内核协议栈里面去发送；在内核协议栈里面，网络包的数据都是由 struct sk_buff 维护的，因而第一件事情就是找到一个空闲的内存空间，将用户要写入的数据，拷贝到 struct sk_buff 的管辖范围内。而第二件事情就是发送 struct sk_buff。
在 tcp_sendmsg 中，我们首先通过强制类型转换，将 sock 结构转换为 struct tcp_sock，这个是维护 TCP 连接状态的重要数据结构。
接下来是 tcp_sendmsg 的第一件事情，把数据拷贝到 struct sk_buff。

我们先声明一个变量 copied，初始化为 0，这表示拷贝了多少数据。紧接着是一个循环，while (msg_data_left(msg))，也即如果用户的数据没有发送完毕，就一直循环。循环里声明了一个 copy 变量，表示这次拷贝的数值，在循环的最后有 copied += copy，将每次拷贝的数量都加起来。

我们这里只需要看一次循环做了哪些事情。

tcp_write_queue_tail 从 TCP 写入队列 sk_write_queue 中拿出最后一个 struct sk_buff，在这个写入队列中排满了要发送的 struct sk_buff，为什么要拿最后一个呢？这里面只有最后一个，可能会因为上次用户给的数据太少，而没有填满。
tcp_send_mss 会计算 MSS，也即 Max Segment Size。这是什么呢？这个意思是说，我们在网络上传输的网络包的大小是有限制的，而这个限制在最底层开始就有。
如果 copy 小于 0，说明最后一个 struct sk_buff 已经没地方存放了，需要调用 sk_stream_alloc_skb，重新分配 struct sk_buff，然后调用 skb_entail，将新分配的 sk_buff 放到队列尾部。
为了减少内存拷贝的代价，有的网络设备支持分散聚合（Scatter/Gather）I/O，顾名思义，就是 IP 层没必要通过内存拷贝进行聚合，让散的数据零散的放在原处，在设备层进行聚合。在注释 /_ Where to copy to? _/ 后面有个 if-else 分支。
1. if 分支就是 skb_add_data_nocache 将数据拷贝到连续的数据区域。
2. else 分支就是 skb_copy_to_page_nocache 将数据拷贝到 struct skb_shared_info 结构指向的不需要连续的页面区域。
就是要发生网络包了。
1. 第一种情况是积累的数据报数目太多了，因而我们需要通过调用 __tcp_push_pending_frames 发送网络包。
2. 第二种情况是，这是第一个网络包，需要马上发送，调用 tcp_push_one
3. 无论 __tcp_push_pending_frames 还是 tcp_push_one，都会调用 tcp_write_xmit 发送网络包。

tcp_write_xmit

这里面主要的逻辑是一个循环，用来处理发送队列，只要队列不空，就会发送。

主要涉及 TSO 和滑动窗口拥塞控制相关的部分，此处不介绍了，最终调用了 tcp_transmit_skb。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int push_one, gfp_t gfp)
{
  struct tcp_sock *tp = tcp_sk(sk);
  struct sk_buff *skb;
  unsigned int tso_segs, sent_pkts;
  int cwnd_quota;
 /* ... */
  max_segs = tcp_tso_segs(sk, mss_now);
  while ((skb = tcp_send_head(sk))) {
    unsigned int limit;
 /* ... */
    tso_segs = tcp_init_tso_segs(skb, mss_now);
 /* ... */
    cwnd_quota = tcp_cwnd_test(tp, skb);
 /* ... */
    if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) {
      is_rwnd_limited = true;
      break;
    }
 /* ... */
    limit = mss_now;
        if (tso_segs > 1 && !tcp_urg_mode(tp))
            limit = tcp_mss_split_point(sk, skb, mss_now, min_t(unsigned int, cwnd_quota, max_segs), nonagle);

    if (skb->len > limit &&
        unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
      break;
 /* ... */
    if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
      break;

repair:
    /* Advance the send_head.  This one is sent out.
     * This call will increment packets_out.
     */
    tcp_event_new_data_sent(sk, skb);

    tcp_minshall_update(tp, mss_now, skb);
    sent_pkts += tcp_skb_pcount(skb);

    if (push_one)
      break;
  }
 /* ... */
}

tcp_transmit_skb

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39



static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
                gfp_t gfp_mask)
{
    const struct inet_connection_sock *icsk = inet_csk(sk);
    struct inet_sock *inet;
    struct tcp_sock *tp;
    struct tcp_skb_cb *tcb;
    struct tcphdr *th;
    int err;

    tp = tcp_sk(sk);

    skb->skb_mstamp = tp->tcp_mstamp;
    inet = inet_sk(sk);
    tcb = TCP_SKB_CB(skb);
    memset(&opts, 0, sizeof(opts));

    tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
    skb_push(skb, tcp_header_size);

    /* Build TCP header and checksum it. */
    th = (struct tcphdr *)skb->data;
    th->source      = inet->inet_sport;
    th->dest        = inet->inet_dport;
    th->seq         = htonl(tcb->seq);
    th->ack_seq     = htonl(tp->rcv_nxt);
    *(((__be16 *)th) + 6)   = htons(((tcp_header_size >> 2) << 12) |
                    tcb->tcp_flags);

    th->check       = 0;
    th->urg_ptr     = 0;
 /* ... */
    tcp_options_write((__be32 *)(th + 1), tp, &opts);
    th->window  = htons(min(tp->rcv_wnd, 65535U));
 /* ... */
    err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
 /* ... */
}

tcp_transmit_skb 这个函数比较长，主要做了两件事情：

填充 TCP 头
会调用 icsk_af_ops 的 queue_xmit 方法
- icsk_af_ops 指向 ipv4_specific，也即调用的是 ip_queue_xmit 函数

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15



const struct inet_connection_sock_af_ops ipv4_specific = {
        .queue_xmit        = ip_queue_xmit,
        .send_check        = tcp_v4_send_check,
        .rebuild_header    = inet_sk_rebuild_header,
        .sk_rx_dst_set     = inet_sk_rx_dst_set,
        .conn_request      = tcp_v4_conn_request,
        .syn_recv_sock     = tcp_v4_syn_recv_sock,
        .net_header_len    = sizeof(struct iphdr),
        .setsockopt        = ip_setsockopt,
        .getsockopt        = ip_getsockopt,
        .addr2sockaddr     = inet_csk_addr2sockaddr,
        .sockaddr_len      = sizeof(struct sockaddr_in),
        .mtu_reduced       = tcp_v4_mtu_reduced,
};

Recv

数据到达

tcp_v4_rcv

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50



int tcp_v4_rcv(struct sk_buff *skb)
{
  struct net *net = dev_net(skb->dev);
  const struct iphdr *iph;
  const struct tcphdr *th;
  bool refcounted;
  struct sock *sk;
  int ret;
......
  th = (const struct tcphdr *)skb->data;
  iph = ip_hdr(skb);
......
  TCP_SKB_CB(skb)->seq = ntohl(th->seq);
  TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin + skb->len - th->doff * 4);
  TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
  TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
  TCP_SKB_CB(skb)->tcp_tw_isn = 0;
  TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
  TCP_SKB_CB(skb)->sacked   = 0;

lookup:
  sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source, th->dest, &refcounted);

process:
  if (sk->sk_state == TCP_TIME_WAIT)
    goto do_time_wait;

  if (sk->sk_state == TCP_NEW_SYN_RECV) {
......
  }
......
  th = (const struct tcphdr *)skb->data;
  iph = ip_hdr(skb);

  skb->dev = NULL;

  if (sk->sk_state == TCP_LISTEN) {
    ret = tcp_v4_do_rcv(sk, skb);
    goto put_and_return;
  }
......
  if (!sock_owned_by_user(sk)) {
    if (!tcp_prequeue(sk, skb))
      ret = tcp_v4_do_rcv(sk, skb);
  } else if (tcp_add_backlog(sk, skb)) {
    goto discard_and_relse;
  }
......
}

在 tcp_v4_rcv 中，得到 TCP 的头之后，我们可以开始处理 TCP 层的事情。因为 TCP 层是分状态的，状态被维护在数据结构 struct sock 里面，因而我们要根据 IP 地址以及 TCP 头里面的内容，在 tcp_hashinfo 中找到这个包对应的 struct sock，从而得到这个包对应的连接的状态。

接下来，我们就根据不同的状态做不同的处理，例如，上面代码中的 TCP_LISTEN、TCP_NEW_SYN_RECV 状态属于连接建立过程中。这个我们在讲三次握手的时候讲过了。再如，TCP_TIME_WAIT 状态是连接结束的时候的状态，这个我们暂时可以不用看。

接下来，我们来分析最主流的网络包的接收过程，这里面涉及三个队列：

backlog 队列
prequeue 队列
sk_receive_queue 队列

为什么接收网络包的过程，需要在这三个队列里面倒腾过来、倒腾过去呢？这是因为，同样一个网络包要在三个主体之间交接。

第一个主体是软中断的处理过程。如果你没忘记的话，我们在执行 tcp_v4_rcv 函数的时候，依然处于软中断的处理逻辑里，所以必然会占用这个软中断。
第二个主体就是用户态进程。如果用户态触发系统调用 read 读取网络包，也要从队列里面找。
第三个主体就是内核协议栈。哪怕用户进程没有调用 read，读取网络包，当网络包来的时候，也得有一个地方收着呀。

这时候，我们就能够了解上面代码中 sock_owned_by_user 的意思了，其实就是说，当前这个 sock 是不是正有一个用户态进程等着读数据呢，如果没有，内核协议栈也调用 tcp_add_backlog，暂存在 backlog 队列中，并且抓紧离开软中断的处理过程。

如果有一个用户态进程等待读取数据呢？我们先调用 tcp_prequeue，也即赶紧放入 prequeue 队列，并且离开软中断的处理过程。在这个函数里面，我们会看到对于 sysctl_tcp_low_latency 的判断，也即是不是要低时延地处理网络包。

如果把 sysctl_tcp_low_latency 设置为 0，那就要放在 prequeue 队列中暂存，这样不用等待网络包处理完毕，就可以离开软中断的处理过程，但是会造成比较长的时延。如果把 sysctl_tcp_low_latency 设置为 1，我们还是调用 tcp_v4_do_rcv。

tcp_v4_do_rcv

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18



int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
  struct sock *rsk;

  if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
    struct dst_entry *dst = sk->sk_rx_dst;
......
    tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len);
    return 0;
  }
......
  if (tcp_rcv_state_process(sk, skb)) {
......
  }
  return 0;
......
}

在 tcp_v4_do_rcv 中，分两种情况，一种情况是连接已经建立，处于 TCP_ESTABLISHED 状态，调用 tcp_rcv_established。另一种情况，就是其他的状态，调用 tcp_rcv_state_process。

tcp_v4_established

在连接状态下，我们会调用 tcp_rcv_established。在这个函数里面，我们会调用 tcp_data_queue，将其放入 sk_receive_queue 队列进行处理。

tcp_data_queue

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73



static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
{
  struct tcp_sock *tp = tcp_sk(sk);
  bool fragstolen = false;
......
  if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
    if (tcp_receive_window(tp) == 0)
      goto out_of_window;

    /* Ok. In sequence. In window. */
    if (tp->ucopy.task == current &&
        tp->copied_seq == tp->rcv_nxt && tp->ucopy.len &&
        sock_owned_by_user(sk) && !tp->urg_data) {
      int chunk = min_t(unsigned int, skb->len,
            tp->ucopy.len);

      __set_current_state(TASK_RUNNING);

      if (!skb_copy_datagram_msg(skb, 0, tp->ucopy.msg, chunk)) {
        tp->ucopy.len -= chunk;
        tp->copied_seq += chunk;
        eaten = (chunk == skb->len);
        tcp_rcv_space_adjust(sk);
      }
    }

    if (eaten <= 0) {
queue_and_out:
......
      eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
    }
    tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
......
    if (!RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
      tcp_ofo_queue(sk);
......
    }
......
    return;
  }

  if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
    /* A retransmit, 2nd most common case.  Force an immediate ack. */
    tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);

out_of_window:
    tcp_enter_quickack_mode(sk);
    inet_csk_schedule_ack(sk);
drop:
    tcp_drop(sk, skb);
    return;
  }

  /* Out of window. F.e. zero window probe. */
  if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp)))
    goto out_of_window;

  tcp_enter_quickack_mode(sk);

  if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
    /* Partial packet, seq < rcv_next < end_seq */
    tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, tp->rcv_nxt);
    /* If window is closed, drop tail of packet. But after
     * remembering D-SACK for its head made in previous line.
     */
    if (!tcp_receive_window(tp))
      goto out_of_window;
    goto queue_and_out;
  }

  tcp_data_queue_ofo(sk, skb);
}

在 tcp_data_queue 中，对于收到的网络包，我们要分情况进行处理。

第一种情况，seq == tp->rcv_nxt，说明来的网络包正是我服务端期望的下一个网络包。这个时候我们判断 sock_owned_by_user，也即用户进程也是正在等待读取，这种情况下，就直接 skb_copy_datagram_msg，将网络包拷贝给用户进程就可以了。

如果用户进程没有正在等待读取，或者因为内存原因没有能够拷贝成功，tcp_queue_rcv 里面还是将网络包放入 sk_receive_queue 队列。

接下来，tcp_rcv_nxt_update 将 tp->rcv_nxt 设置为 end_seq，也即当前的网络包接收成功后，更新下一个期待的网络包。

这个时候，我们还会判断一下另一个队列，out_of_order_queue，也看看乱序队列的情况，看看乱序队列里面的包，会不会因为这个新的网络包的到来，也能放入到 sk_receive_queue 队列中。

实现

系统调用

1
2
3
4
5


SYSCALL_DEFINE4(recv, int, fd, void __user *, ubuf, size_t, size,
		unsigned int, flags)
{
	return __sys_recvfrom(fd, ubuf, size, flags, NULL, NULL);
}

__sys_recvfrom

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags,
		   struct sockaddr __user *addr, int __user *addr_len)
{
	struct socket *sock;
	struct iovec iov;
	struct msghdr msg;
	struct sockaddr_storage address;
	int err, err2;
	int fput_needed;

	err = import_single_range(READ, ubuf, size, &iov, &msg.msg_iter);
	if (unlikely(err))
		return err;
	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (!sock)
		goto out;

	msg.msg_control = NULL;
	msg.msg_controllen = 0;
	/* Save some cycles and don't copy the address if not needed */
	msg.msg_name = addr ? (struct sockaddr *)&address : NULL;
	/* We assume all kernel code knows the size of sockaddr_storage */
	msg.msg_namelen = 0;
	msg.msg_iocb = NULL;
	msg.msg_flags = 0;
	if (sock->file->f_flags & O_NONBLOCK)
		flags |= MSG_DONTWAIT;
	err = sock_recvmsg(sock, &msg, flags);

	if (err >= 0 && addr != NULL) {
		err2 = move_addr_to_user(&address,
					 msg.msg_namelen, addr, addr_len);
		if (err2 < 0)
			err = err2;
	}

	fput_light(sock->file, fput_needed);
out:
	return err;
}

sock_recvmsg

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


int sock_recvmsg(struct socket *sock, struct msghdr *msg, int flags)
{
	int err = security_socket_recvmsg(sock, msg, msg_data_left(msg), flags);

	return err ?: sock_recvmsg_nosec(sock, msg, flags);
}

static inline int sock_recvmsg_nosec(struct socket *sock, struct msghdr *msg,
				     int flags)
{
	return INDIRECT_CALL_INET(sock->ops->recvmsg, inet6_recvmsg,
				  inet_recvmsg, sock, msg, msg_data_left(msg),
				  flags);
}

最终调用了 sock->ops->recvmsg，也就是 inet_recvmsg

inet_recvmsg

可以看到，最终调用的是 sk_prot 的 recvmsg 方法，对于 TCP 就是 tcp_recvmsg

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


int inet_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
		 int flags)
{
	struct sock *sk = sock->sk;
	int addr_len = 0;
	int err;

	if (likely(!(flags & MSG_ERRQUEUE)))
		sock_rps_record_flow(sk);

	err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
			      sk, msg, size, flags & MSG_DONTWAIT,
			      flags & ~MSG_DONTWAIT, &addr_len);
	if (err >= 0)
		msg->msg_namelen = addr_len;
	return err;
}

tcp_recvmsg

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99


int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
    int flags, int *addr_len)
{
  struct tcp_sock *tp = tcp_sk(sk);
  int copied = 0;
  u32 peek_seq;
  u32 *seq;
  unsigned long used;
  int err;
  int target;    /* Read at least this many bytes */
  long timeo;
  struct task_struct *user_recv = NULL;
  struct sk_buff *skb, *last;
.....
  do {
    u32 offset;
......
    /* Next get a buffer. */
    last = skb_peek_tail(&sk->sk_receive_queue);
    skb_queue_walk(&sk->sk_receive_queue, skb) {
      last = skb;
      offset = *seq - TCP_SKB_CB(skb)->seq;
      if (offset < skb->len)
        goto found_ok_skb;
......
    }
......
    if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
      /* Install new reader */
      if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) {
        user_recv = current;
        tp->ucopy.task = user_recv;
        tp->ucopy.msg = msg;
      }

      tp->ucopy.len = len;
      /* Look: we have the following (pseudo)queues:
       *
       * 1. packets in flight
       * 2. backlog
       * 3. prequeue
       * 4. receive_queue
       *
       * Each queue can be processed only if the next ones
       * are empty.
       */
      if (!skb_queue_empty(&tp->ucopy.prequeue))
        goto do_prequeue;
    }

    if (copied >= target) {
      /* Do not sleep, just process backlog. */
      release_sock(sk);
      lock_sock(sk);
    } else {
      sk_wait_data(sk, &timeo, last);
    }

    if (user_recv) {
      int chunk;
      chunk = len - tp->ucopy.len;
      if (chunk != 0) {
        len -= chunk;
        copied += chunk;
      }

      if (tp->rcv_nxt == tp->copied_seq &&
          !skb_queue_empty(&tp->ucopy.prequeue)) {
do_prequeue:
        tcp_prequeue_process(sk);

        chunk = len - tp->ucopy.len;
        if (chunk != 0) {
          len -= chunk;
          copied += chunk;
        }
      }
    }
    continue;
  found_ok_skb:
    /* Ok so how much can we use? */
    used = skb->len - offset;
    if (len < used)
      used = len;

    if (!(flags & MSG_TRUNC)) {
      err = skb_copy_datagram_msg(skb, offset, msg, used);
......
    }

    *seq += used;
    copied += used;
    len -= used;

    tcp_rcv_space_adjust(sk);
......
  } while (len > 0);
......
}

tcp_recvmsg 这个函数比较长，里面逻辑也很复杂，好在里面有一段注释概括了这里面的逻辑。注释里面提到了三个队列，receive_queue 队列、prequeue 队列和 backlog 队列。这里面，我们需要把前一个队列处理完毕，才处理后一个队列。

tcp_recvmsg 的整个逻辑也是这样执行的：这里面有一个 while 循环，不断地读取网络包。

这里，我们会先处理 sk_receive_queue 队列。如果找到了网络包，就跳到 found_ok_skb 这里。这里会调用 skb_copy_datagram_msg，将网络包拷贝到用户进程中，然后直接进入下一层循环。

直到 sk_receive_queue 队列处理完毕，我们才到了 sysctl_tcp_low_latency 判断。如果不需要低时延，则会有 prequeue 队列。于是，我们能就跳到 do_prequeue 这里，调用 tcp_prequeue_process 进行处理。

如果 sysctl_tcp_low_latency 设置为 1，也即没有 prequeue 队列，或者 prequeue 队列为空，则需要处理 backlog 队列，在 release_sock 函数中处理。

release_sock 会调用 __release_sock，这里面会依次处理队列中的网络包。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


void release_sock(struct sock *sk)
{
......
  if (sk->sk_backlog.tail)
    __release_sock(sk);
......
}

static void __release_sock(struct sock *sk)
  __releases(&sk->sk_lock.slock)
  __acquires(&sk->sk_lock.slock)
{
  struct sk_buff *skb, *next;

  while ((skb = sk->sk_backlog.head) != NULL) {
    sk->sk_backlog.head = sk->sk_backlog.tail = NULL;
    do {
      next = skb->next;
      prefetch(next);
      skb->next = NULL;
      sk_backlog_rcv(sk, skb);
      cond_resched();
      skb = next;
    } while (skb != NULL);
  }
......
}

最后，哪里都没有网络包，我们只好调用 sk_wait_data，继续等待在哪里，等待网络包的到来。

至此，网络包的接收过程到此结束。

参考资料

TCP/IP Architecture, Design and Implementation in Linux：Chapter 3
Linux 4.4 之后 TCP 三路握手的新流程