容器的本质

随着5G的到来,互联网和通信技术的演进,被谷歌力推的Kubernetes现在已经成为业界操作容器的标准方案。Kubernetes控制对象是容器,那么究竟什么是容器?容器的本质是什么?为什么江湖传闻它相比于虚拟机更轻便?为什么说容器的底层存在安全隐患,隐患从何而来?为什么对于Linux系统自身根本没有容器的概念?理解这些问题对于日后的计算集群运维和开发都有重要的意义。而了解这些,需要从创建子进程说起。

创建一个隔离的进程

Linux系统中提供的系统调用fork创建子进程是不带参数的,而系统调用clone是可以带参数的,clone的参数列表中就包括用来做进程隔离的标志位参数,以下c代码中,父进程调用clone创建子进程后wait阻塞。clone中的第一个参数函数指针container就是子进程的执行单元。系统调用clone使用了5个标志位,用来对新创建的进程做隔离。

标志位 作用
CLONE_NEWNET 标示子进程具有独立的网络空间。
CLONE_NEWPID 标示子进程在自身空间内变为1号进程,具有独立进程树。
CLONE_NEWIPC 标示子进程具有独立的进程间通信Inter Process Communication。
CLONE_NEWUTS 标示子进程具有自己的主机名。
CLONE_NEWNS 标示子进程享有独立文件系统结构。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>

#define SIZE (1024 * 1024)
static char stack[SIZE];
int clone_flags = CLONE_NEWNET | CLONE_NEWPID | \
CLONE_NEWIPC | CLONE_NEWUTS | \
CLONE_NEWNS | SIGCHLD;

char* const childprocess[] = { "/bin/bash", NULL };

int container(void* arg) {
printf("Inside the container, current PID is %d.\n", getpid());
sethostname("container", 10);
mount("proc", "/proc", "proc", 0, NULL);
execv(childprocess[0], childprocess);
return 1;
}

int main(int argc, char** argv) {
printf("In Parent, Container started\n");
int childpid = clone(container, stack + SIZE, clone_flags, NULL);
printf("Container's PID is %d.\n", childpid);
waitpid(childpid, NULL, 0);
printf("In Parent, Container stopped.\n");
mount("proc", "/proc", "proc", 0, NULL);
return 0;
}

创建子进程前,查看系统的进程树,网络设备,主机名,进程间通信IPC。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
root@vbox:/share# pstree
systemd─┬─VBoxService───7*[{VBoxService}]
├─accounts-daemon───2*[{accounts-daemon}]
├─atd
├─cron
├─dbus-daemon
├─login───bash───main───bash
├─lvmetad
├─lxcfs───2*[{lxcfs}]
├─networkd-dispat───{networkd-dispat}
├─polkitd───2*[{polkitd}]
├─rsyslogd───3*[{rsyslogd}]
├─snapd───8*[{snapd}]
├─sshd───sshd───sshd───bash───su───bash───pstree
├─systemd───(sd-pam)
├─systemd-journal
├─systemd-logind
├─systemd-network
├─systemd-resolve
└─systemd-udevd

root@vbox:/share# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:9f:34:d0 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.108/24 brd 192.168.0.255 scope global dynamic enp0s3
valid_lft 6323sec preferred_lft 6323sec
inet6 fe80::a00:27ff:fe9f:34d0/64 scope link
valid_lft forever preferred_lft forever

root@vbox:/share# hostname
vbox

root@vbox:/share# ipcs
------ Message Queues --------
key msqid owner perms used-bytes messages
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00002234 0 root 666 68 0
------ Semaphore Arrays --------
key semid owner perms nsems

编译执行,创建子进程。注意命令提示符的变化。

1
2
3
4
5
root@vbox:/share# ./main
In Parent, Container started
Container's PID is 1624.
Inside the container, current PID is 1.
container:/share#

创建子进程后,查看系统的进程树,网络设备,主机名,进程间通信IPC。可以看到,此进程认为自己是1号进程,网络设备只包含一个没有开启的loopback,主机名变成了container,进程间通信中共享内存没有了之前的共享内存段,至此已经相当于进入了一个容器内部运行。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
container:/share# pstree
tcsh───pstree

container:/share# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 07:23 pts/0 00:00:00 -bin/tcsh
root 4 1 0 07:23 pts/0 00:00:00 ps -ef

container:/share# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

container:/share# hostname
container

container:/share# ipcs
------ Message Queues --------
key msqid owner perms used-bytes messages
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
------ Semaphore Arrays --------
key semid owner perms nsems

为进程构建独立网络

接下来,再继续为子进程构建一个类似于Docker的网络连接。Docker一般会创建一个叫做bridge0或者docker0的网桥,用来连接多个“容器”,以及“宿主机”,以实现Docker独立网络空间的效果。也可以按照类似的方法构建一个NAT网络。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#在父进程空间执行

#创建网桥totobridge0
brctl addbr totobridge0
ifconfig totobridge0 up
#创建网络设备VethA和VethB,VethA连接网桥,VethB设置IP地址
ip link add VethA type veth peer name VethB
brctl addif totobridge0 VethA
ip link set VethA up
ifconfig VethB 192.168.10.1/24 up
#创建网络设备VethC和VethD,VethC连接网桥,VethD放入Container PID指定的网络空间
ip link add VethC type veth peer name VethD
brctl addif totobridge0 VethC
ip link set VethC up
ip link set dev VethD netns ${container's pid}

#在子进程网络空间执行
ip link set VethD up
ip addr add 192.168.10.2/24 dev VethD

以上指令执行完成后,就搭建起了一个上图示意的网络拓扑结构。此时查看两个网络空间的内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#宿主机中的网络设备
root@vbox:/share# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:9f:34:d0 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.95/24 brd 192.168.1.255 scope global dynamic enp0s3
valid_lft 85829sec preferred_lft 85829sec
inet6 fe80::a00:27ff:fe9f:34d0/64 scope link
valid_lft forever preferred_lft forever
3: totobridge0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ae:11:7c:70:05:c3 brd ff:ff:ff:ff:ff:ff
inet6 fe80::e4b7:f6ff:fe25:e2aa/64 scope link
valid_lft forever preferred_lft forever
4: VethB@VethA: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether d2:81:47:d5:7c:d4 brd ff:ff:ff:ff:ff:ff
inet 192.168.10.1/24 brd 192.168.10.255 scope global VethB
valid_lft forever preferred_lft forever
inet6 fe80::d081:47ff:fed5:7cd4/64 scope link
valid_lft forever preferred_lft forever
5: VethA@VethB: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master totobridge0 state UP group default qlen 1000
link/ether ce:06:89:24:9f:f0 brd ff:ff:ff:ff:ff:ff
inet6 fe80::cc06:89ff:fe24:9ff0/64 scope link
valid_lft forever preferred_lft forever
7: VethC@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master totobridge0 state UP group default qlen 1000
link/ether ae:11:7c:70:05:c3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::ac11:7cff:fe70:5c3/64 scope link
valid_lft forever preferred_lft forever

root@vbox:/share# ping 192.168.10.2
PING 192.168.10.2 (192.168.10.2) 56(84) bytes of data.
64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=0.027 ms
64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=0.054 ms
64 bytes from 192.168.10.2: icmp_seq=3 ttl=64 time=0.055 ms
64 bytes from 192.168.10.2: icmp_seq=4 ttl=64 time=0.040 ms
--- 192.168.10.2 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3032ms
rtt min/avg/max/mdev = 0.027/0.044/0.055/0.011 ms

#容器中的网络设备
container:/# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
6: VethD@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 42:59:e1:ca:df:21 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.10.2/24 scope global VethD
valid_lft forever preferred_lft forever
inet6 fe80::4059:e1ff:feca:df21/64 scope link
valid_lft forever preferred_lft forever

container:/# ping 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=0.126 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=0.038 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=0.091 ms
64 bytes from 192.168.10.1: icmp_seq=4 ttl=64 time=0.056 ms
--- 192.168.10.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3035ms
rtt min/avg/max/mdev = 0.038/0.077/0.126/0.035 ms

现在可以看到,容器子进程的行为已经类似虚拟机的行为了。容器的本质实际上就是Linux系统中被隔离的子进程,所以容器运行在宿主机同一个内核上,这种情况安全性相比虚拟机会比较弱。实际上,虚拟机是用软件模拟出一个完整的图灵机再运行操作系统,在其上再运行应用程序,虚拟机之间的应用是不共享内核协议栈的,而容器非常容易可以控制共享哪些内容。这也是kubernetes的pod可以带多个容器的实现原理。

参考链接:http://crosbymichael.com/creating-containers-part-1.html