通过前面的文章,我们对kubevirt有了一些简单的了解,本文我们来看看kubevirt虚拟机的网络实现原理。
pod网络
kubevirt是k8s的一个CRD实现,每个kubevirt虚拟机对应一个vmi对象和一个pod对象,而k8s本身对pod网络有了一些规范(CNI),所以在了解kubevirt虚拟机网络前,有必要先对k8s的pod网络有个了解。
pod与container
k8s pod是一组容器(container)的逻辑集合,一个pod可以包含多个业务容器和一个系统内置的sandbox容器:
kubelet创建pod
kubelet在创建pod下的容器时,会先创建sandbox容器,再创建其它业务容器:
hostnetwork与CNI
pod网络就是在创建sandbox容器这一步完成的。kubelet调CRI接口创建sandbox容器,CRI收到请求后判断pod是否是hostNetwork,如果不是则会先调CNI插件初始化网络(包含创建网络设备和申请ip):
flannel host-gateway
我们假设CNI用的是flannel host-gateway模式,则pod网络有如下示意:
kubevirt网络
kubevirt网络相关组件
通过前面的文章,我们知道用户在kubevirt平台创建虚拟机其实只需要创建一个vmi(Virtual Machine Instance)对象,之后virt-controller会根据vmi对象中的信息创建一个pod,本文我们把这个pod叫作vmi pod。vmi pod中有kubevirt组件virt-launcher,以及虚拟化相关组件libvirtd和qemu。kubevirt虚拟机网络主要与virt-launcher以及daemonset部署的virt-handler有关:
以下内容基于[email protected]
源码分析
假设当前有如下vmi yaml示例:
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
annotations:
name: test
spec:
domain:
devices:
interfaces:
- masquerade: {}
name: default
# ...
networks:
- name: default
pod: {}
# ...
vmi对象中与网络关系密切的参数主要有两个:
-
spec.domain.devices.interfaces:定义连接guest接口的方法,支持
bridge
、slirp
、masquerade
、sriov
和macvtap
(五选一),本文仅对bridge和masquerade两种类型做阐述。 -
spec.networks:定义连接vm虚拟机网络的源,支持
pod
和multus
两种类型(二选一),本文仅对pod类型做阐述。
上述两个参数的name字段需要匹配上。
基于上述yaml,我们再分别从virt-handler和virt-launcher两个组件源码层面,看看kubevirt网络的实现。
virt-handler
当创建一个vmi对象后,virt-handler初始化网络的入口函数在vmUpdateHelperDefault:
// pkg/virt-handler/vm.go
func (d *VirtualMachineController) vmUpdateHelperDefault(origVMI *v1.VirtualMachineInstance, domainExists bool) error {
/*...*/
if !vmi.IsRunning() && !vmi.IsFinal() {
/*...*/
if err := d.setupNetwork(vmi); err != nil {
return fmt.Errorf("failed to configure vmi network: %w", err)
}
/*...*/
}
/*...*/
}
// pkg/virt-handler/vm.go
func (d *VirtualMachineController) setupNetwork(vmi *v1.VirtualMachineInstance) error {
/*...*/
return d.netConf.Setup(vmi, isolationRes.Pid(), func() error {
if requiresDeviceClaim {
if err := d.claimDeviceOwnership(rootMount, "vhost-net"); err != nil {
return neterrors.CreateCriticalNetworkError(fmt.Errorf("failed to set up vhost-net device, %s", err))
}
}
return nil
})
}
// pkg/network/setup/netconf.go
func (c *NetConf) Setup(vmi *v1.VirtualMachineInstance, launcherPid int, preSetup func() error) error {
/*...*/
err := ns.Do(func() error {
// 执行初始化网络的第一阶段逻辑
return netConfigurator.SetupPodNetworkPhase1(launcherPid)
})
if err != nil {
return fmt.Errorf("setup failed, err: %w", err)
}
/*...*/
}
// pkg/network/setup/network.go
func (n *VMNetworkConfigurator) SetupPodNetworkPhase1(pid int) error {
launcherPID := &pid
nics, err := n.getPhase1NICs(launcherPID)
if err != nil {
return err
}
for _, nic := range nics {
if err := nic.PlugPhase1(); err != nil {
return fmt.Errorf("failed plugging phase1 at nic '%s': %w", nic.podInterfaceName, err)
}
}
return nil
}
在SetupPodNetworkPhase1
函数中主要分两步,第一步通过getPhase1NICs
收集NIC信息,第二步遍历这些NIC,执行PlugPhase1
。
getPhase1NICs
先看看getPhase1NICs
:
// pkg/network/setup/network.go
func (v VMNetworkConfigurator) getPhase1NICs(launcherPID *int) ([]podNIC, error) {
/*...*/
for i, _ := range v.vmi.Spec.Networks {
nic, err := newPhase1PodNIC(v.vmi, &v.vmi.Spec.Networks[i], v.handler, v.cacheFactory, launcherPID)
if err != nil {
return nil, err
}
nics = append(nics, *nic)
}
return nics, nil
}
func newPhase1PodNIC(vmi *v1.VirtualMachineInstance, network *v1.Network, handler netdriver.NetworkHandler, cacheFactory cache.InterfaceCacheFactory, launcherPID *int) (*podNIC, error) {
// 根据spec.domain.devices.interfaces和spec.networks找到pod中NIC信息
podnic, err := newPodNIC(vmi, network, handler, cacheFactory, launcherPID)
if err != nil {
return nil, err
}
if launcherPID == nil {
return nil, fmt.Errorf("missing launcher PID to construct infra configurators")
}
// 这里返回的infraConfigurator针对bridge和masquerade做了特别初始化处理
if podnic.vmiSpecIface.Bridge != nil {
podnic.infraConfigurator = infraconfigurators.NewBridgePodNetworkConfigurator(
podnic.vmi,
podnic.vmiSpecIface,
generateInPodBridgeInterfaceName(podnic.podInterfaceName),
*podnic.launcherPID,
podnic.handler)
} else if podnic.vmiSpecIface.Masquerade != nil {
podnic.infraConfigurator = infraconfigurators.NewMasqueradePodNetworkConfigurator(
podnic.vmi,
podnic.vmiSpecIface,
generateInPodBridgeInterfaceName(podnic.podInterfaceName),
podnic.vmiSpecNetwork,
*podnic.launcherPID,
podnic.handler)
}
return podnic, nil
}
newPodNIC
方法是根据spec.domain.devices.interfaces和spec.networks找到pod中NIC信息,需要注意的是,如果spec.networks是pod类型,返回的pod NIC名称默认是eth0;如果是multus且multus.default=false,则返回的pod NIC名称是net%d,%d表示mutlus在sepc.networks中的序号。
PlugPhase1
再看看PlugPhase1
:
// pkg/network/setup/podnic.go
func (l *podNIC) PlugPhase1() error {
// 如果NIC是SRIOV,则不作任何处理
if l.vmiSpecIface.SRIOV != nil {
return nil
}
/*...*/
// 前面看到只有bridge和masquerade才会初始化该字段
// 所以非bridge和masquerade类型的在这里就直接返回了
if l.infraConfigurator == nil {
return nil
}
if err := l.infraConfigurator.DiscoverPodNetworkInterface(l.podInterfaceName); err != nil {
return err
}
dhcpConfig := l.infraConfigurator.GenerateNonRecoverableDHCPConfig()
if dhcpConfig != nil {
log.Log.V(4).Infof("The generated dhcpConfig: %s", dhcpConfig.String())
if err := l.cacheFactory.CacheDHCPConfigForPid(getPIDString(l.launcherPID)).Write(l.podInterfaceName, dhcpConfig); err != nil {
return fmt.Errorf("failed to save DHCP configuration: %w", err)
}
}
domainIface := l.infraConfigurator.GenerateNonRecoverableDomainIfaceSpec()
if domainIface != nil {
log.Log.V(4).Infof("The generated libvirt domain interface: %+v", *domainIface)
if err := l.storeCachedDomainIface(*domainIface); err != nil {
return fmt.Errorf("failed to save libvirt domain interface: %w", err)
}
}
/*...*/
// preparePodNetworkInterface must be called *after* the Generate
// methods since it mutates the pod interface from which those
// generator methods get their info from.
if err := l.infraConfigurator.PreparePodNetworkInterface(); err != nil {
log.Log.Reason(err).Error("failed to prepare pod networking")
return errors.CreateCriticalNetworkError(err)
}
/*...*/
}
PlugPhase1函数中,如果发现vmi的spec.domain.devices.interfaces是sriov
、slirp
和macvtap
类型都不会做过多的处理。而对于bridge
和masquerade
两种类型,都是根据infraConfigurator接口做相关处理,infraConfigurator接口的定义如下:
// pkg/network/infraconfigurators/common.go
type PodNetworkInfraConfigurator interface {
DiscoverPodNetworkInterface(podIfaceName string) error
PreparePodNetworkInterface() error
GenerateNonRecoverableDomainIfaceSpec() *api.Interface
GenerateNonRecoverableDHCPConfig() *cache.DHCPConfig
}
bridge
和masquerade
都实现了PodNetworkInfraConfigurator
,具体实现如下。
bridge
- DiscoverPodNetworkInterface
// pkg/network/infraconfigurators/bridge.go
func (b *BridgePodNetworkConfigurator) DiscoverPodNetworkInterface(podIfaceName string) error {
// 先根据pod接口名在vmi pod找到pod的网卡link设备
link, err := b.handler.LinkByName(podIfaceName)
if err != nil {
log.Log.Reason(err).Errorf("failed to get a link for interface: %s", podIfaceName)
return err
}
b.podNicLink = link
// 从link设备中拿到ip地址信息
addrList, err := b.handler.AddrList(b.podNicLink, netlink.FAMILY_V4)
if err != nil {
log.Log.Reason(err).Errorf("failed to get an ip address for %s", podIfaceName)
return err
}
if len(addrList) == 0 {
// 如果没有设置ip,则把ipam使能设置为关闭
b.ipamEnabled = false
} else {
// 如果有找到ip,则把ipam使用设置为打开
b.podIfaceIP = addrList[0]
b.ipamEnabled = true
// 记录pod网卡的路由信息
if err := b.learnInterfaceRoutes(); err != nil {
return err
}
}
// 根据vmi pod网卡名称构建出tap设备名称(如果网卡是eth0,则tap设备名为tap0)
b.tapDeviceName = virtnetlink.GenerateTapDeviceName(podIfaceName)
// 尝试从vmi.spec.domain.devices.interfaces中拿到用户指定的mac地址
// 如果没配置(如前文给的vmi yaml就没配置),则会随机生成一个mac地址
b.vmMac, err = virtnetlink.RetrieveMacAddressFromVMISpecIface(b.vmiSpecIface)
if err != nil {
return err
}
if b.vmMac == nil {
b.vmMac = &b.podNicLink.Attrs().HardwareAddr
}
return nil
}
- GenerateNonRecoverableDHCPConfig
// pkg/network/infraconfigurators/bridge.go
func (b *BridgePodNetworkConfigurator) GenerateNonRecoverableDHCPConfig() *cache.DHCPConfig {
// 如果pod网卡没有ip,直接返回
if !b.ipamEnabled {
return &cache.DHCPConfig{IPAMDisabled: true}
}
dhcpConfig := &cache.DHCPConfig{
MAC: *b.vmMac,
IPAMDisabled: !b.ipamEnabled,
IP: b.podIfaceIP,
}
// 如果pod网卡有ip,且配置了路由,则
if b.ipamEnabled && len(b.podIfaceRoutes) > 0 {
log.Log.V(4).Infof("got to add %d routes to the DhcpConfig", len(b.podIfaceRoutes))
b.decorateDhcpConfigRoutes(dhcpConfig)
}
return dhcpConfig
}
// 把符合条件的pod网卡路由和网关信息作为dhcp配置
func (b *BridgePodNetworkConfigurator) decorateDhcpConfigRoutes(dhcpConfig *cache.DHCPConfig) {
log.Log.V(4).Infof("the default route is: %s", b.podIfaceRoutes[0].String())
dhcpConfig.Gateway = b.podIfaceRoutes[0].Gw
if len(b.podIfaceRoutes) > 1 {
dhcpRoutes := virtnetlink.FilterPodNetworkRoutes(b.podIfaceRoutes, dhcpConfig)
dhcpConfig.Routes = &dhcpRoutes
}
}
- GenerateNonRecoverableDomainIfaceSpec
// pkg/network/infraconfigurators/bridge.go
// 根据mac地址构造一个interface对象
func (b *BridgePodNetworkConfigurator) GenerateNonRecoverableDomainIfaceSpec() *api.Interface {
return &api.Interface{
MAC: &api.MAC{MAC: b.vmMac.String()},
}
}
- PreparePodNetworkInterface
前面三个函数都可以看作准备数据,PreparePodNetworkInterface
才是整个逻辑核心。
// pkg/network/infraconfigurators/bridge.go
func (b *BridgePodNetworkConfigurator) PreparePodNetworkInterface() error {
// 先把pod中的网卡down掉
if err := b.handler.LinkSetDown(b.podNicLink); err != nil {
log.Log.Reason(err).Errorf("failed to bring link down for interface: %s", b.podNicLink.Attrs().Name)
return err
}
// 如果ipam使能(即pod网卡有ip)
if b.ipamEnabled {
// 删掉pod网卡的ip
err := b.handler.AddrDel(b.podNicLink, &b.podIfaceIP)
if err != nil {
log.Log.Reason(err).Errorf("failed to delete address for interface: %s", b.podNicLink.Attrs().Name)
return err
}
// 把pod网卡重命名,并且创建一张和原pod网卡同名的dummy网卡
// 并把原先网卡ip给dummy网卡
if err := b.switchPodInterfaceWithDummy(); err != nil {
log.Log.Reason(err).Error("failed to switch pod interface with a dummy")
return err
}
// Set arp_ignore=1 to avoid
// the dummy interface being seen by Duplicate Address Detection (DAD).
// Without this, some VMs will lose their ip address after a few
// minutes.
if err := b.handler.ConfigureIpv4ArpIgnore(); err != nil {
log.Log.Reason(err).Errorf("failed to set arp_ignore=1")
return err
}
}
// 给pod网卡设置随机mac地址
if _, err := b.handler.SetRandomMac(b.podNicLink.Attrs().Name); err != nil {
return err
}
// 创建一个网桥设备
if err := b.createBridge(); err != nil {
return err
}
tapOwner := netdriver.LibvirtUserAndGroupId
if util.IsNonRootVMI(b.vmi) {
tapOwner = strconv.Itoa(util.NonRootUID)
}
// 用virt-chroot命令创建一个tap设备,并挂到网桥上
err := createAndBindTapToBridge(b.handler, b.tapDeviceName, b.bridgeInterfaceName, b.launcherPID, b.podNicLink.Attrs().MTU, tapOwner, b.vmi)
if err != nil {
log.Log.Reason(err).Errorf("failed to create tap device named %s", b.tapDeviceName)
return err
}
// 重新up pod网卡设备
if err := b.handler.LinkSetUp(b.podNicLink); err != nil {
log.Log.Reason(err).Errorf("failed to bring link up for interface: %s", b.podNicLink.Attrs().Name)
return err
}
// 关闭pod网卡learning
if err := b.handler.LinkSetLearningOff(b.podNicLink); err != nil {
log.Log.Reason(err).Errorf("failed to disable mac learning for interface: %s", b.podNicLink.Attrs().Name)
return err
}
return nil
}
// pkg/network/infraconfigurators/bridge.go
func (b *BridgePodNetworkConfigurator) switchPodInterfaceWithDummy() error {
originalPodInterfaceName := b.podNicLink.Attrs().Name
newPodInterfaceName := virtnetlink.GenerateNewBridgedVmiInterfaceName(originalPodInterfaceName)
dummy := &netlink.Dummy{LinkAttrs: netlink.LinkAttrs{Name: originalPodInterfaceName}}
// 先把pod网卡重命名(如把eth0命名为eth0-nic)
err := b.handler.LinkSetName(b.podNicLink, newPodInterfaceName)
if err != nil {
log.Log.Reason(err).Errorf("failed to rename interface : %s", b.podNicLink.Attrs().Name)
return err
}
// 更新内存对象中的podNicLink信息
b.podNicLink, err = b.handler.LinkByName(newPodInterfaceName)
if err != nil {
log.Log.Reason(err).Errorf("failed to get a link for interface: %s", newPodInterfaceName)
return err
}
// 创建一个dummy网卡(名称为原网卡名,如eth0)
err = b.handler.LinkAdd(dummy)
if err != nil {
log.Log.Reason(err).Errorf("failed to create dummy interface : %s", originalPodInterfaceName)
return err
}
// 把原先pod网卡ip给dummy网卡
err = b.handler.AddrReplace(dummy, &b.podIfaceIP)
if err != nil {
log.Log.Reason(err).Errorf("failed to replace original IP address to dummy interface: %s", originalPodInterfaceName)
return err
}
return nil
}
// pkg/network/infraconfigurators/bridge.go
func (b *BridgePodNetworkConfigurator) createBridge() error {
// 创建一个网桥设备
bridge := &netlink.Bridge{
LinkAttrs: netlink.LinkAttrs{
Name: b.bridgeInterfaceName,
},
}
err := b.handler.LinkAdd(bridge)
if err != nil {
log.Log.Reason(err).Errorf("failed to create a bridge")
return err
}
// 把pod网卡接到网桥上
err = b.handler.LinkSetMaster(b.podNicLink, bridge)
if err != nil {
log.Log.Reason(err).Errorf("failed to connect interface %s to bridge %s", b.podNicLink.Attrs().Name, bridge.Name)
return err
}
// up网桥设备
err = b.handler.LinkSetUp(bridge)
if err != nil {
log.Log.Reason(err).Errorf("failed to bring link up for interface: %s", b.bridgeInterfaceName)
return err
}
// 构建一个fake ip:169.254.75.1%d/32
// %d表示spec.domain.devices.interfaces序号
// 一张网卡也就是169.254.75.10/32
addr := virtnetlink.GetFakeBridgeIP(b.vmi.Spec.Domain.Devices.Interfaces, b.vmiSpecIface)
fakeaddr, _ := b.handler.ParseAddr(addr)
// 给网桥添加fake ip
if err := b.handler.AddrAdd(bridge, fakeaddr); err != nil {
log.Log.Reason(err).Errorf("failed to set bridge IP")
return err
}
// disabel网桥的tx checksum offload
if err = b.handler.DisableTXOffloadChecksum(b.bridgeInterfaceName); err != nil {
log.Log.Reason(err).Error("failed to disable TX offload checksum on bridge interface")
return err
}
return nil
}
masquerade
- DiscoverPodNetworkInterface
// pkg/network/infraconfigurators/masquerade.go
func (b *MasqueradePodNetworkConfigurator) DiscoverPodNetworkInterface(podIfaceName string) error {
// 获取pod网卡设备信息
link, err := b.handler.LinkByName(podIfaceName)
if err != nil {
log.Log.Reason(err).Errorf("failed to get a link for interface: %s", podIfaceName)
return err
}
b.podNicLink = link
// 计算虚拟机ipv4地址以及网关地址
// ipv4默认网段10.0.2.0/24
// 如果在vmi.spec.networks.pod.vmNetworkCIDR,则以该字段为准
if err := b.computeIPv4GatewayAndVmIp(); err != nil {
return err
}
// 判断pod网卡是否开启ipv6
ipv6Enabled, err := b.handler.IsIpv6Enabled(podIfaceName)
if err != nil {
log.Log.Reason(err).Errorf(ipVerifyFailFmt, podIfaceName)
return err
}
if ipv6Enabled {
// 计算虚拟机ipv6地址以及网关地址
// ipv6默认网段fd10:0:2::/120
// 如果在vmi.spec.networks.pod.vmIPv6NetworkCIDR,则以该字段为准
if err := b.discoverIPv6GatewayAndVmIp(); err != nil {
return err
}
}
return nil
}
- GenerateNonRecoverableDHCPConfig
// pkg/network/infraconfigurators/masquerade.go
// masquerade不需要dhcp
func (b *MasqueradePodNetworkConfigurator) GenerateNonRecoverableDHCPConfig() *cache.DHCPConfig {
return nil
}
- GenerateNonRecoverableDomainIfaceSpec
// pkg/network/infraconfigurators/masquerade.go
// masquerade无需处理
func (b *MasqueradePodNetworkConfigurator) GenerateNonRecoverableDomainIfaceSpec() *api.Interface {
return nil
}
- PreparePodNetworkInterface
// pkg/network/infraconfigurators/masquerade.go
func (b *MasqueradePodNetworkConfigurator) PreparePodNetworkInterface() error {
// 创建一个网桥设备
if err := b.createBridge(); err != nil {
return err
}
tapOwner := netdriver.LibvirtUserAndGroupId
if util.IsNonRootVMI(b.vmi) {
tapOwner = strconv.Itoa(util.NonRootUID)
}
// 用virt-chroot命令创建一个tap设备,并挂到网桥上
tapDeviceName := virtnetlink.GenerateTapDeviceName(b.podNicLink.Attrs().Name)
err := createAndBindTapToBridge(b.handler, tapDeviceName, b.bridgeInterfaceName, b.launcherPID, b.podNicLink.Attrs().MTU, tapOwner, b.vmi)
if err != nil {
log.Log.Reason(err).Errorf("failed to create tap device named %s", tapDeviceName)
return err
}
// 基于nft/iptables创建ipv4 nat规则
err = b.createNatRules(iptables.ProtocolIPv4)
if err != nil {
log.Log.Reason(err).Errorf("failed to create ipv4 nat rules for vm error: %v", err)
return err
}
ipv6Enabled, err := b.handler.IsIpv6Enabled(b.podNicLink.Attrs().Name)
if err != nil {
log.Log.Reason(err).Errorf(ipVerifyFailFmt, b.podNicLink.Attrs().Name)
return err
}
if ipv6Enabled {
// 基于nft/iptables创建ipv6 nat规则
err = b.createNatRules(iptables.ProtocolIPv6)
if err != nil {
log.Log.Reason(err).Errorf("failed to create ipv6 nat rules for vm error: %v", err)
return err
}
}
return nil
}
// pkg/network/infraconfigurators/masquerade.go
func (b *MasqueradePodNetworkConfigurator) createBridge() error {
// 网桥配置固定的mac地址:02:00:00:00:00:00
mac, err := net.ParseMAC(link.StaticMasqueradeBridgeMAC)
if err != nil {
return err
}
// 创建一个网桥
bridge := &netlink.Bridge{
LinkAttrs: netlink.LinkAttrs{
Name: b.bridgeInterfaceName,
MTU: b.podNicLink.Attrs().MTU,
HardwareAddr: mac,
},
}
err = b.handler.LinkAdd(bridge)
if err != nil {
log.Log.Reason(err).Errorf("failed to create a bridge")
return err
}
// up网桥设备
if err := b.handler.LinkSetUp(bridge); err != nil {
log.Log.Reason(err).Errorf("failed to bring link up for interface: %s", b.bridgeInterfaceName)
return err
}
// 把之前计算出来的虚拟机网关地址给到网桥
if err := b.handler.AddrAdd(bridge, b.vmGatewayAddr); err != nil {
log.Log.Reason(err).Errorf("failed to set bridge IP")
return err
}
ipv6Enabled, err := b.handler.IsIpv6Enabled(b.podNicLink.Attrs().Name)
if err != nil {
log.Log.Reason(err).Errorf(ipVerifyFailFmt, b.podNicLink.Attrs().Name)
return err
}
if ipv6Enabled {
// 如果开启ipv6,把ipv6的地址也配置到网桥设备
if err := b.handler.AddrAdd(bridge, b.vmGatewayIpv6Addr); err != nil {
log.Log.Reason(err).Errorf("failed to set bridge IPv6")
return err
}
}
// disabel网桥的tx checksum offload
if err = b.handler.DisableTXOffloadChecksum(b.bridgeInterfaceName); err != nil {
log.Log.Reason(err).Error("failed to disable TX offload checksum on bridge interface")
return err
}
return nil
}
// pkg/network/infraconfigurators/masquerade.go
func (b *MasqueradePodNetworkConfigurator) createNatRules(protocol iptables.Protocol) error {
// 开启pod内ipv4/ipv6 forward配置
err := b.handler.ConfigureIpForwarding(protocol)
if err != nil {
log.Log.Reason(err).Errorf("failed to configure ip forwarding")
return err
}
// 用nft或者iptables设置nat规则
if b.handler.NftablesLoad(protocol) == nil {
return b.createNatRulesUsingNftables(protocol)
} else if b.handler.HasNatIptables(protocol) {
return b.createNatRulesUsingIptables(protocol)
}
return fmt.Errorf("Couldn't configure ip nat rules")
}
virt-launcher
virt-launcher初始化网络的入口函数在SyncVirtualMachine(virt-launcher提供grpc接口,实际调该接口的还是virt-handler进程):
// pkg/virt-launcher/virtwrap/cmd-server/server.go
func (l *Launcher) SyncVirtualMachine(_ context.Context, request *cmdv1.VMIRequest) (*cmdv1.Response, error) {
/*...*/
if _, err := l.domainManager.SyncVMI(vmi, l.allowEmulation, request.Options); err != nil {
log.Log.Object(vmi).Reason(err).Errorf("Failed to sync vmi")
response.Success = false
response.Message = getErrorMessage(err)
return response, nil
}
/*...*/
}
// pkg/virt-launcher/virtwrap/manager.go
func (l *LibvirtDomainManager) SyncVMI(vmi *v1.VirtualMachineInstance, allowEmulation bool, options *cmdv1.VirtualMachineOptions) (*api.DomainSpec, error) {
/*...*/
dom, err := l.virConn.LookupDomainByName(domain.Spec.Name)
if err != nil {
// We need the domain but it does not exist, so create it
if domainerrors.IsNotFound(err) {
domain, err = l.preStartHook(vmi, domain, false)
/*...*/
}
/*...*/
}
/*...*/
}
// pkg/virt-launcher/virtwrap/manager.go
func (l *LibvirtDomainManager) preStartHook(vmi *v1.VirtualMachineInstance, domain *api.Domain, generateEmptyIsos bool) (*api.Domain, error) {
/*...*/
err = netsetup.NewVMNetworkConfigurator(vmi, l.networkCacheStoreFactory).SetupPodNetworkPhase2(domain)
if err != nil {
return domain, fmt.Errorf("preparing the pod network failed: %v", err)
}
/*...*/
}
// pkg/network/setup/network.go
func (n *VMNetworkConfigurator) SetupPodNetworkPhase2(domain *api.Domain) error {
nics, err := n.getPhase2NICs(domain)
if err != nil {
return err
}
for _, nic := range nics {
if err := nic.PlugPhase2(domain); err != nil {
return fmt.Errorf("failed plugging phase2 at nic '%s': %w", nic.podInterfaceName, err)
}
}
return nil
}
virt-launcher处理网络的第二阶段也分为2步,第一步通过getPhase2NICs
收集pod NIC信息,第二步遍历NIC,执行nic.PlugPhase2
。
getPhase2NICs
// pkg/network/setup/network.go
func (v VMNetworkConfigurator) getPhase2NICs(domain *api.Domain) ([]podNIC, error) {
nics := []podNIC{}
if len(v.vmi.Spec.Domain.Devices.Interfaces) == 0 {
return nics, nil
}
for i, _ := range v.vmi.Spec.Networks {
nic, err := newPhase2PodNIC(v.vmi, &v.vmi.Spec.Networks[i], v.handler, v.cacheFactory, domain)
if err != nil {
return nil, err
}
nics = append(nics, *nic)
}
return nics, nil
}
// pkg/network/setup/podnic.go
func newPhase2PodNIC(vmi *v1.VirtualMachineInstance, network *v1.Network, handler netdriver.NetworkHandler, cacheFactory cache.InterfaceCacheFactory, domain *api.Domain) (*podNIC, error) {
podnic, err := newPodNIC(vmi, network, handler, cacheFactory, nil)
if err != nil {
return nil, err
}
podnic.dhcpConfigurator = podnic.newDHCPConfigurator()
podnic.domainGenerator = podnic.newLibvirtSpecGenerator(domain)
return podnic, nil
}
PlugPhase2
// pkg/network/setup/podnic.go
func (l *podNIC) PlugPhase2(domain *api.Domain) error {
precond.MustNotBeNil(domain)
// 如果是sriov,直接返回
if l.vmiSpecIface.SRIOV != nil {
return nil
}
if err := l.domainGenerator.Generate(); err != nil {
log.Log.Reason(err).Critical("failed to create libvirt configuration")
}
// 只有是bridge或者masquerade才会进入逻辑
if l.dhcpConfigurator != nil {
dhcpConfig, err := l.dhcpConfigurator.Generate()
if err != nil {
log.Log.Reason(err).Errorf("failed to get a dhcp configuration for: %s", l.podInterfaceName)
return err
}
log.Log.V(4).Infof("The imported dhcpConfig: %s", dhcpConfig.String())
if err := l.dhcpConfigurator.EnsureDHCPServerStarted(l.podInterfaceName, *dhcpConfig, l.vmiSpecIface.DHCPOptions); err != nil {
log.Log.Reason(err).Criticalf("failed to ensure dhcp service running for: %s", l.podInterfaceName)
panic(err)
}
}
return nil
}
// pkg/network/dhcp/configurator.go
func (d *configurator) EnsureDHCPServerStarted(podInterfaceName string, dhcpConfig cache.DHCPConfig, dhcpOptions *v1.DHCPOptions) error {
if dhcpConfig.IPAMDisabled {
return nil
}
dhcpStartedFile := d.getDHCPStartedFilePath(podInterfaceName)
_, err := os.Stat(dhcpStartedFile)
if os.IsNotExist(err) {
// 启动dhcp服务
if err := d.handler.StartDHCP(&dhcpConfig, d.advertisingIfaceName, dhcpOptions); err != nil {
return fmt.Errorf("failed to start DHCP server for interface %s", podInterfaceName)
}
newFile, err := os.Create(dhcpStartedFile)
if err != nil {
return fmt.Errorf("failed to create dhcp started file %s: %s", dhcpStartedFile, err)
}
newFile.Close()
}
return nil
}
// pkg/network/driver/common.go
func (h *NetworkUtilsHandler) StartDHCP(nic *cache.DHCPConfig, bridgeInterfaceName string, dhcpOptions *v1.DHCPOptions) error {
/*...*/
// 起个协程启动一个dhcp服务(ipv4)
go func() {
if err = DHCPServer(
nic.MAC,
nic.IP.IP,
nic.IP.Mask,
bridgeInterfaceName,
nic.AdvertisingIPAddr,
nic.Gateway,
nameservers,
nic.Routes,
searchDomains,
nic.Mtu,
dhcpOptions,
); err != nil {
log.Log.Errorf("failed to run DHCP: %v", err)
panic(err)
}
}()
if nic.IPv6.IPNet != nil {
// 启动一个ipv6 dhcp服务
go func() {
if err = DHCPv6Server(
nic.IPv6.IP,
bridgeInterfaceName,
); err != nil {
log.Log.Reason(err).Error("failed to run DHCPv6")
panic(err)
}
}()
}
return nil
}
Generate
函数对于bridge和masquerade实现不同:
- bridge
// pkg/network/dhcp/bridge.go
func (d *BridgeConfigGenerator) Generate() (*cache.DHCPConfig, error) {
dhcpConfig, err := d.cacheFactory.CacheDHCPConfigForPid(d.launcherPID).Read(d.podInterfaceName)
if err != nil {
return nil, err
}
if dhcpConfig.IPAMDisabled {
return dhcpConfig, nil
}
dhcpConfig.Name = d.podInterfaceName
// 前面bridge逻辑提到,会给网桥一个fake ip,这里是获取fake ip
fakeBridgeIP := virtnetlink.GetFakeBridgeIP(d.vmiSpecIfaces, d.vmiSpecIface)
fakeServerAddr, _ := netlink.ParseAddr(fakeBridgeIP)
dhcpConfig.AdvertisingIPAddr = fakeServerAddr.IP
newPodNicName := virtnetlink.GenerateNewBridgedVmiInterfaceName(d.podInterfaceName)
podNicLink, err := d.handler.LinkByName(newPodNicName)
if err != nil {
return nil, err
}
// dhcp的MTU设置和pod网卡一样
dhcpConfig.Mtu = uint16(podNicLink.Attrs().MTU)
dhcpConfig.Subdomain = d.subdomain
return dhcpConfig, nil
}
- masquerade
// pkg/network/dhcp/masquerade.go
func (d *MasqueradeConfigGenerator) Generate() (*cache.DHCPConfig, error) {
dhcpConfig := &cache.DHCPConfig{}
// 获取pod网卡信息
podNicLink, err := d.handler.LinkByName(d.podInterfaceName)
if err != nil {
return nil, err
}
dhcpConfig.Name = podNicLink.Attrs().Name
dhcpConfig.Subdomain = d.subdomain
dhcpConfig.Mtu = uint16(podNicLink.Attrs().MTU)
// 获取masquerade的ipv4网关和vm ip
ipv4Gateway, ipv4, err := virtnetlink.GenerateMasqueradeGatewayAndVmIPAddrs(d.vmiSpecNetwork, iptables.ProtocolIPv4)
if err != nil {
return nil, err
}
dhcpConfig.IP = *ipv4
dhcpConfig.AdvertisingIPAddr = ipv4Gateway.IP.To4()
dhcpConfig.Gateway = ipv4Gateway.IP.To4()
ipv6Enabled, err := d.handler.IsIpv6Enabled(d.podInterfaceName)
if err != nil {
log.Log.Reason(err).Errorf("failed to verify whether ipv6 is configured on %s", d.podInterfaceName)
return nil, err
}
if ipv6Enabled {
// 获取masquerade的ipv6网关和vm ip
ipv6Gateway, ipv6, err := virtnetlink.GenerateMasqueradeGatewayAndVmIPAddrs(d.vmiSpecNetwork, iptables.ProtocolIPv6)
if err != nil {
return nil, err
}
dhcpConfig.IPv6 = *ipv6
dhcpConfig.AdvertisingIPv6Addr = ipv6Gateway.IP.To16()
}
return dhcpConfig, nil
}
除了上述内容,还有一些通过libvirtd把网络信息配置到虚拟机的动作未提及,这部分内容待读者自行研读。
libvirt
通过virt-handler和virt-launcher准备好bridge、tap、dhcp server等资源后,kubevirt会把这些数据组装成libvirt xml去调libvirtd接口创建虚拟机,最后形成一个完整的虚拟机系统。
总结
前面我们从源码层面对kubevirt网络做了一些梳理,本章节我们结合前面的源码分析,再用图示的形式梳理一下,便于读者理解。CNI依旧以flannel host-gateway模式为例:
pod+bridge
vmi yaml如下所示:
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
annotations:
name: test
spec:
domain:
devices:
interfaces:
- bridge: {} # 注意这里是bridge
name: default
# ...
networks:
- name: default
pod: {}
# ...
virt-handler
1、down掉eth0网卡,删除eth0网卡ip(删除的ip会先保留在内存中):
2、创建一张名为eth0的dummy网卡
,把原先eth0网卡改名为eth0-nic,刚才删掉的ip给到dummy网卡,设置arp_ignore,并且把eth0-nic网卡mac改为随机mac:
3、创建一个网桥,把eth0-nic连接到网桥,up网桥并给网桥设置一个fake ip,再disabel网桥的tx checksum offload:
4、创建tap设备并连接到网桥上:
5、up eth0-nic设备,并关闭mac地址学习功能,最终有:
virt-launcher
virt-launcher做的事情相对少些,只启动了一个dhcp server供后续vm获取ip:
libvirt
libvirt负责创建vm并且把tap设备用于vm中:
pod+masquerade
vmi yaml如下所示:
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
annotations:
name: test
spec:
domain:
devices:
interfaces:
- masquerade: {} # 注意这里是masquerade
name: default
# ...
networks:
- name: default
pod: {}
# ...
virt-handler
1、创建mac地址为02:00:00:00:00:00
的网桥并up,默认根据10.0.24.0/24计算网关地址(即10.0.24.1),并把该地址作为网桥的ip地址,最后disable tx checksum offload:
2、创建一个tap设备并连接到网桥上:
3、开启ip_forward,最终用nftable或者iptable实现nat:
virt-launcher
启动了一个dhcp server供后续vm获取ip:
libvirt
libvirt负责创建vm并且把tap设备用于vm中:
其它
通过以上内容我们对kubevirt虚拟机网络有了一个更清晰的认识,这里再补充一些个人的思考:
- 为什么创建虚拟机网络要在virt-handler和virt-launcher两个组件中完成?也就是官网提到创建网络的phase1和phase2两个阶段,不能全在virt-launcher中完成吗?
个人见解
:这里应该是从网络安全上考虑,官网可以看到这么一段话:
The virt-launcher is an untrusted component of KubeVirt (since it wraps the libvirt process that will run third party workloads). As a result, it must be run with as little privileges as required. As of now, the only capability required by virt-launcher to configure networking is the CAP_NET_ADMIN.
- 从前面的分析可以看出,一个网络报文从宿主机网卡到虚拟机,需要经过:宿主机物理网卡 -> CNI网桥 -> 宿主机veth pair -> pod网络命名空间veth pair -> virt-handler网桥 -> pod网络命名空间tap -> 虚拟机内的网卡,很明显这个链路太长,对于网络性能要求高的场景可能会具有一定的挑战性。
个人见解
:文章开头提到,只有非hostNetwork的pod网络才会有CNI的参与,因此,如果把vmi pod的网络配置为hostNetwork,则没有了CNI网桥、宿主机veth pair、pod网络命名空间veth pair
,链路会减少一大截。但是hostNetwork网络的话,首先要考虑网络安全问题,其次virt-handler和virt-launcher代码可能需要改造,最后还有vm ip的分配也得考虑。
本文都是用flannel作为pod网络CNI,如果在CNI处做改造,实现一个如下模型网络,或许是比较理想的(未做任何验证和深入分析,仅供参考):
- 从vmi的字段定义来看是支持SR-IOV的(没实际验证过),SR-IOV理论上能提升不少虚拟机的网络性能。不过SR-IOV也有一些限制:需要在BIOS中配置并且需要宿主机网卡支持SR-IOV;即使宿主机网卡支持SR-IOV,一般VF个数也比较少,数量在十几到几十左右,而一个k8s节点默认可以起pod的个数是110,如果都是小规格虚拟机,很明显VF数量满足不了。不过如果业务需要的都是大规格虚拟机,例如一台宿主机上只需要虚拟化个位数的虚拟机,SR-IOV应该是个不错的选择。
微信公众号卡巴斯同步发布,欢迎大家关注。