Bootstrap

kubevirt(六)网络

通过前面的文章,我们对kubevirt有了一些简单的了解,本文我们来看看kubevirt虚拟机的网络实现原理。

pod网络

kubevirt是k8s的一个CRD实现,每个kubevirt虚拟机对应一个vmi对象和一个pod对象,而k8s本身对pod网络有了一些规范(CNI),所以在了解kubevirt虚拟机网络前,有必要先对k8s的pod网络有个了解。

pod与container

k8s pod是一组容器(container)的逻辑集合,一个pod可以包含多个业务容器和一个系统内置的sandbox容器:

在这里插入图片描述

kubelet创建pod

kubelet在创建pod下的容器时,会先创建sandbox容器,再创建其它业务容器:

在这里插入图片描述

hostnetwork与CNI

pod网络就是在创建sandbox容器这一步完成的。kubelet调CRI接口创建sandbox容器,CRI收到请求后判断pod是否是hostNetwork,如果不是则会先调CNI插件初始化网络(包含创建网络设备和申请ip):

在这里插入图片描述

flannel host-gateway

我们假设CNI用的是flannel host-gateway模式,则pod网络有如下示意:

在这里插入图片描述

kubevirt网络

kubevirt网络相关组件

通过前面的文章,我们知道用户在kubevirt平台创建虚拟机其实只需要创建一个vmi(Virtual Machine Instance)对象,之后virt-controller会根据vmi对象中的信息创建一个pod,本文我们把这个pod叫作vmi pod。vmi pod中有kubevirt组件virt-launcher,以及虚拟化相关组件libvirtd和qemu。kubevirt虚拟机网络主要与virt-launcher以及daemonset部署的virt-handler有关:

在这里插入图片描述

以下内容基于[email protected]

源码分析

假设当前有如下vmi yaml示例:

apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
  annotations:
    name: test
spec:
  domain:
    devices:
      interfaces:
      - masquerade: {}
        name: default
      # ...
  networks:
  - name: default
    pod: {}
  # ...

vmi对象中与网络关系密切的参数主要有两个:

  • spec.domain.devices.interfaces:定义连接guest接口的方法,支持bridgeslirpmasqueradesriovmacvtap(五选一),本文仅对bridge和masquerade两种类型做阐述。

  • spec.networks:定义连接vm虚拟机网络的源,支持podmultus两种类型(二选一),本文仅对pod类型做阐述。

上述两个参数的name字段需要匹配上。

基于上述yaml,我们再分别从virt-handler和virt-launcher两个组件源码层面,看看kubevirt网络的实现。

virt-handler

当创建一个vmi对象后,virt-handler初始化网络的入口函数在vmUpdateHelperDefault:

// pkg/virt-handler/vm.go
func (d *VirtualMachineController) vmUpdateHelperDefault(origVMI *v1.VirtualMachineInstance, domainExists bool) error {
    /*...*/
    if !vmi.IsRunning() && !vmi.IsFinal() {
        /*...*/
        if err := d.setupNetwork(vmi); err != nil {
          return fmt.Errorf("failed to configure vmi network: %w", err)
        }
        /*...*/
    }
    /*...*/
}

// pkg/virt-handler/vm.go
func (d *VirtualMachineController) setupNetwork(vmi *v1.VirtualMachineInstance) error {
    /*...*/
    return d.netConf.Setup(vmi, isolationRes.Pid(), func() error {
        if requiresDeviceClaim {
            if err := d.claimDeviceOwnership(rootMount, "vhost-net"); err != nil {
                return neterrors.CreateCriticalNetworkError(fmt.Errorf("failed to set up vhost-net device, %s", err))
            }
        }
        return nil
    })
}

// pkg/network/setup/netconf.go
func (c *NetConf) Setup(vmi *v1.VirtualMachineInstance, launcherPid int, preSetup func() error) error {
    /*...*/
    err := ns.Do(func() error {
        // 执行初始化网络的第一阶段逻辑
        return netConfigurator.SetupPodNetworkPhase1(launcherPid)
    })
    if err != nil {
        return fmt.Errorf("setup failed, err: %w", err)
    }
    /*...*/
}

// pkg/network/setup/network.go
func (n *VMNetworkConfigurator) SetupPodNetworkPhase1(pid int) error {
    launcherPID := &pid
    nics, err := n.getPhase1NICs(launcherPID)
    if err != nil {
        return err
    }
    for _, nic := range nics {
        if err := nic.PlugPhase1(); err != nil {
            return fmt.Errorf("failed plugging phase1 at nic '%s': %w", nic.podInterfaceName, err)
        }
    }
    return nil
}

SetupPodNetworkPhase1函数中主要分两步,第一步通过getPhase1NICs收集NIC信息,第二步遍历这些NIC,执行PlugPhase1

getPhase1NICs

先看看getPhase1NICs

// pkg/network/setup/network.go
func (v VMNetworkConfigurator) getPhase1NICs(launcherPID *int) ([]podNIC, error) {
    /*...*/
    for i, _ := range v.vmi.Spec.Networks {
        nic, err := newPhase1PodNIC(v.vmi, &v.vmi.Spec.Networks[i], v.handler, v.cacheFactory, launcherPID)
        if err != nil {
            return nil, err
        }
        nics = append(nics, *nic)
    }
    return nics, nil
}

func newPhase1PodNIC(vmi *v1.VirtualMachineInstance, network *v1.Network, handler netdriver.NetworkHandler, cacheFactory cache.InterfaceCacheFactory, launcherPID *int) (*podNIC, error) {
    // 根据spec.domain.devices.interfaces和spec.networks找到pod中NIC信息
    podnic, err := newPodNIC(vmi, network, handler, cacheFactory, launcherPID)
    if err != nil {
        return nil, err
    }

    if launcherPID == nil {
        return nil, fmt.Errorf("missing launcher PID to construct infra configurators")
    }

    // 这里返回的infraConfigurator针对bridge和masquerade做了特别初始化处理
    if podnic.vmiSpecIface.Bridge != nil {
        podnic.infraConfigurator = infraconfigurators.NewBridgePodNetworkConfigurator(
            podnic.vmi,
            podnic.vmiSpecIface,
            generateInPodBridgeInterfaceName(podnic.podInterfaceName),
            *podnic.launcherPID,
            podnic.handler)
    } else if podnic.vmiSpecIface.Masquerade != nil {
        podnic.infraConfigurator = infraconfigurators.NewMasqueradePodNetworkConfigurator(
            podnic.vmi,
            podnic.vmiSpecIface,
            generateInPodBridgeInterfaceName(podnic.podInterfaceName),
            podnic.vmiSpecNetwork,
            *podnic.launcherPID,
            podnic.handler)
    }
    return podnic, nil
}

newPodNIC方法是根据spec.domain.devices.interfaces和spec.networks找到pod中NIC信息,需要注意的是,如果spec.networks是pod类型,返回的pod NIC名称默认是eth0;如果是multus且multus.default=false,则返回的pod NIC名称是net%d,%d表示mutlus在sepc.networks中的序号。

PlugPhase1

再看看PlugPhase1

// pkg/network/setup/podnic.go
func (l *podNIC) PlugPhase1() error {
    // 如果NIC是SRIOV,则不作任何处理
    if l.vmiSpecIface.SRIOV != nil {
        return nil
    }
    
    /*...*/
    
    // 前面看到只有bridge和masquerade才会初始化该字段
    // 所以非bridge和masquerade类型的在这里就直接返回了
    if l.infraConfigurator == nil {
        return nil
    }

    if err := l.infraConfigurator.DiscoverPodNetworkInterface(l.podInterfaceName); err != nil {
        return err
    }
    
dhcpConfig := l.infraConfigurator.GenerateNonRecoverableDHCPConfig()
    if dhcpConfig != nil {
        log.Log.V(4).Infof("The generated dhcpConfig: %s", dhcpConfig.String())
        if err := l.cacheFactory.CacheDHCPConfigForPid(getPIDString(l.launcherPID)).Write(l.podInterfaceName, dhcpConfig); err != nil {
            return fmt.Errorf("failed to save DHCP configuration: %w", err)
        }
    }

    domainIface := l.infraConfigurator.GenerateNonRecoverableDomainIfaceSpec()
    if domainIface != nil {
        log.Log.V(4).Infof("The generated libvirt domain interface: %+v", *domainIface)
        if err := l.storeCachedDomainIface(*domainIface); err != nil {
            return fmt.Errorf("failed to save libvirt domain interface: %w", err)
        }
    }
    
    /*...*/
    // preparePodNetworkInterface must be called *after* the Generate
    // methods since it mutates the pod interface from which those
    // generator methods get their info from.
    if err := l.infraConfigurator.PreparePodNetworkInterface(); err != nil {
        log.Log.Reason(err).Error("failed to prepare pod networking")
        return errors.CreateCriticalNetworkError(err)
    }
    
    /*...*/
}

PlugPhase1函数中,如果发现vmi的spec.domain.devices.interfaces是sriovslirpmacvtap类型都不会做过多的处理。而对于bridgemasquerade两种类型,都是根据infraConfigurator接口做相关处理,infraConfigurator接口的定义如下:

// pkg/network/infraconfigurators/common.go
type PodNetworkInfraConfigurator interface {
    DiscoverPodNetworkInterface(podIfaceName string) error
    PreparePodNetworkInterface() error
    GenerateNonRecoverableDomainIfaceSpec() *api.Interface
    GenerateNonRecoverableDHCPConfig() *cache.DHCPConfig
}

bridgemasquerade都实现了PodNetworkInfraConfigurator,具体实现如下。

bridge
  • DiscoverPodNetworkInterface
// pkg/network/infraconfigurators/bridge.go
func (b *BridgePodNetworkConfigurator) DiscoverPodNetworkInterface(podIfaceName string) error {
    // 先根据pod接口名在vmi pod找到pod的网卡link设备
    link, err := b.handler.LinkByName(podIfaceName)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to get a link for interface: %s", podIfaceName)
        return err
    }
    b.podNicLink = link

    // 从link设备中拿到ip地址信息
    addrList, err := b.handler.AddrList(b.podNicLink, netlink.FAMILY_V4)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to get an ip address for %s", podIfaceName)
        return err
    }
    if len(addrList) == 0 {
        // 如果没有设置ip,则把ipam使能设置为关闭
        b.ipamEnabled = false
    } else {
        // 如果有找到ip,则把ipam使用设置为打开
        b.podIfaceIP = addrList[0]
        b.ipamEnabled = true
        // 记录pod网卡的路由信息
        if err := b.learnInterfaceRoutes(); err != nil {
            return err
        }
    }

    // 根据vmi pod网卡名称构建出tap设备名称(如果网卡是eth0,则tap设备名为tap0)
    b.tapDeviceName = virtnetlink.GenerateTapDeviceName(podIfaceName)

    // 尝试从vmi.spec.domain.devices.interfaces中拿到用户指定的mac地址
    // 如果没配置(如前文给的vmi yaml就没配置),则会随机生成一个mac地址
    b.vmMac, err = virtnetlink.RetrieveMacAddressFromVMISpecIface(b.vmiSpecIface)
    if err != nil {
        return err
    }
    if b.vmMac == nil {
        b.vmMac = &b.podNicLink.Attrs().HardwareAddr
    }

    return nil
}
  • GenerateNonRecoverableDHCPConfig
// pkg/network/infraconfigurators/bridge.go
func (b *BridgePodNetworkConfigurator) GenerateNonRecoverableDHCPConfig() *cache.DHCPConfig {
    // 如果pod网卡没有ip,直接返回
    if !b.ipamEnabled {
        return &cache.DHCPConfig{IPAMDisabled: true}
    }

    dhcpConfig := &cache.DHCPConfig{
        MAC:          *b.vmMac,
        IPAMDisabled: !b.ipamEnabled,
        IP:           b.podIfaceIP,
    }

    // 如果pod网卡有ip,且配置了路由,则
    if b.ipamEnabled && len(b.podIfaceRoutes) > 0 {
        log.Log.V(4).Infof("got to add %d routes to the DhcpConfig", len(b.podIfaceRoutes))
        b.decorateDhcpConfigRoutes(dhcpConfig)
    }
    return dhcpConfig
}

// 把符合条件的pod网卡路由和网关信息作为dhcp配置
func (b *BridgePodNetworkConfigurator) decorateDhcpConfigRoutes(dhcpConfig *cache.DHCPConfig) {
    log.Log.V(4).Infof("the default route is: %s", b.podIfaceRoutes[0].String())
    dhcpConfig.Gateway = b.podIfaceRoutes[0].Gw
    if len(b.podIfaceRoutes) > 1 {
        dhcpRoutes := virtnetlink.FilterPodNetworkRoutes(b.podIfaceRoutes, dhcpConfig)
        dhcpConfig.Routes = &dhcpRoutes
    }
}
  • GenerateNonRecoverableDomainIfaceSpec
// pkg/network/infraconfigurators/bridge.go
// 根据mac地址构造一个interface对象
func (b *BridgePodNetworkConfigurator) GenerateNonRecoverableDomainIfaceSpec() *api.Interface {
    return &api.Interface{
        MAC: &api.MAC{MAC: b.vmMac.String()},
    }
}
  • PreparePodNetworkInterface

前面三个函数都可以看作准备数据,PreparePodNetworkInterface才是整个逻辑核心。

// pkg/network/infraconfigurators/bridge.go
func (b *BridgePodNetworkConfigurator) PreparePodNetworkInterface() error {
    // 先把pod中的网卡down掉
    if err := b.handler.LinkSetDown(b.podNicLink); err != nil {
        log.Log.Reason(err).Errorf("failed to bring link down for interface: %s", b.podNicLink.Attrs().Name)
        return err
    }

    // 如果ipam使能(即pod网卡有ip)
    if b.ipamEnabled {
        // 删掉pod网卡的ip
        err := b.handler.AddrDel(b.podNicLink, &b.podIfaceIP)
        if err != nil {
            log.Log.Reason(err).Errorf("failed to delete address for interface: %s", b.podNicLink.Attrs().Name)
            return err
        }

        // 把pod网卡重命名,并且创建一张和原pod网卡同名的dummy网卡
        // 并把原先网卡ip给dummy网卡
        if err := b.switchPodInterfaceWithDummy(); err != nil {
            log.Log.Reason(err).Error("failed to switch pod interface with a dummy")
            return err
        }

        // Set arp_ignore=1 to avoid
        // the dummy interface being seen by Duplicate Address Detection (DAD).
        // Without this, some VMs will lose their ip address after a few
        // minutes.
        if err := b.handler.ConfigureIpv4ArpIgnore(); err != nil {
            log.Log.Reason(err).Errorf("failed to set arp_ignore=1")
            return err
        }
    }

    // 给pod网卡设置随机mac地址
    if _, err := b.handler.SetRandomMac(b.podNicLink.Attrs().Name); err != nil {
        return err
    }

    // 创建一个网桥设备
    if err := b.createBridge(); err != nil {
        return err
    }

    tapOwner := netdriver.LibvirtUserAndGroupId
    if util.IsNonRootVMI(b.vmi) {
        tapOwner = strconv.Itoa(util.NonRootUID)
    }
    
    // 用virt-chroot命令创建一个tap设备,并挂到网桥上
    err := createAndBindTapToBridge(b.handler, b.tapDeviceName, b.bridgeInterfaceName, b.launcherPID, b.podNicLink.Attrs().MTU, tapOwner, b.vmi)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to create tap device named %s", b.tapDeviceName)
        return err
    }

    // 重新up pod网卡设备
    if err := b.handler.LinkSetUp(b.podNicLink); err != nil {
        log.Log.Reason(err).Errorf("failed to bring link up for interface: %s", b.podNicLink.Attrs().Name)
        return err
    }

    // 关闭pod网卡learning
    if err := b.handler.LinkSetLearningOff(b.podNicLink); err != nil {
        log.Log.Reason(err).Errorf("failed to disable mac learning for interface: %s", b.podNicLink.Attrs().Name)
        return err
    }

    return nil
}

// pkg/network/infraconfigurators/bridge.go
func (b *BridgePodNetworkConfigurator) switchPodInterfaceWithDummy() error {
    originalPodInterfaceName := b.podNicLink.Attrs().Name
    newPodInterfaceName := virtnetlink.GenerateNewBridgedVmiInterfaceName(originalPodInterfaceName)
    dummy := &netlink.Dummy{LinkAttrs: netlink.LinkAttrs{Name: originalPodInterfaceName}}

    // 先把pod网卡重命名(如把eth0命名为eth0-nic)
    err := b.handler.LinkSetName(b.podNicLink, newPodInterfaceName)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to rename interface : %s", b.podNicLink.Attrs().Name)
        return err
    }

    // 更新内存对象中的podNicLink信息
    b.podNicLink, err = b.handler.LinkByName(newPodInterfaceName)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to get a link for interface: %s", newPodInterfaceName)
        return err
    }

    // 创建一个dummy网卡(名称为原网卡名,如eth0)
    err = b.handler.LinkAdd(dummy)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to create dummy interface : %s", originalPodInterfaceName)
        return err
    }

    // 把原先pod网卡ip给dummy网卡
    err = b.handler.AddrReplace(dummy, &b.podIfaceIP)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to replace original IP address to dummy interface: %s", originalPodInterfaceName)
        return err
    }

    return nil
}

// pkg/network/infraconfigurators/bridge.go
func (b *BridgePodNetworkConfigurator) createBridge() error {
    // 创建一个网桥设备
    bridge := &netlink.Bridge{
        LinkAttrs: netlink.LinkAttrs{
            Name: b.bridgeInterfaceName,
        },
    }
    err := b.handler.LinkAdd(bridge)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to create a bridge")
        return err
    }

    // 把pod网卡接到网桥上
    err = b.handler.LinkSetMaster(b.podNicLink, bridge)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to connect interface %s to bridge %s", b.podNicLink.Attrs().Name, bridge.Name)
        return err
    }

    // up网桥设备
    err = b.handler.LinkSetUp(bridge)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to bring link up for interface: %s", b.bridgeInterfaceName)
        return err
    }

    // 构建一个fake ip:169.254.75.1%d/32
    // %d表示spec.domain.devices.interfaces序号
    // 一张网卡也就是169.254.75.10/32
    addr := virtnetlink.GetFakeBridgeIP(b.vmi.Spec.Domain.Devices.Interfaces, b.vmiSpecIface)
    fakeaddr, _ := b.handler.ParseAddr(addr)

    // 给网桥添加fake ip
    if err := b.handler.AddrAdd(bridge, fakeaddr); err != nil {
        log.Log.Reason(err).Errorf("failed to set bridge IP")
        return err
    }

    // disabel网桥的tx checksum offload
    if err = b.handler.DisableTXOffloadChecksum(b.bridgeInterfaceName); err != nil {
        log.Log.Reason(err).Error("failed to disable TX offload checksum on bridge interface")
        return err
    }

    return nil
}
masquerade
  • DiscoverPodNetworkInterface
// pkg/network/infraconfigurators/masquerade.go
func (b *MasqueradePodNetworkConfigurator) DiscoverPodNetworkInterface(podIfaceName string) error {
    // 获取pod网卡设备信息
    link, err := b.handler.LinkByName(podIfaceName)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to get a link for interface: %s", podIfaceName)
        return err
    }
    b.podNicLink = link

    // 计算虚拟机ipv4地址以及网关地址
    // ipv4默认网段10.0.2.0/24
    // 如果在vmi.spec.networks.pod.vmNetworkCIDR,则以该字段为准
    if err := b.computeIPv4GatewayAndVmIp(); err != nil {
        return err
    }

    // 判断pod网卡是否开启ipv6
    ipv6Enabled, err := b.handler.IsIpv6Enabled(podIfaceName)
    if err != nil {
        log.Log.Reason(err).Errorf(ipVerifyFailFmt, podIfaceName)
        return err
    }
    if ipv6Enabled {
        // 计算虚拟机ipv6地址以及网关地址
        // ipv6默认网段fd10:0:2::/120
        // 如果在vmi.spec.networks.pod.vmIPv6NetworkCIDR,则以该字段为准
        if err := b.discoverIPv6GatewayAndVmIp(); err != nil {
            return err
        }
    }

    return nil
}
  • GenerateNonRecoverableDHCPConfig
// pkg/network/infraconfigurators/masquerade.go
// masquerade不需要dhcp
func (b *MasqueradePodNetworkConfigurator) GenerateNonRecoverableDHCPConfig() *cache.DHCPConfig {
    return nil
}
  • GenerateNonRecoverableDomainIfaceSpec
// pkg/network/infraconfigurators/masquerade.go
// masquerade无需处理
func (b *MasqueradePodNetworkConfigurator) GenerateNonRecoverableDomainIfaceSpec() *api.Interface {
    return nil
}
  • PreparePodNetworkInterface
// pkg/network/infraconfigurators/masquerade.go
func (b *MasqueradePodNetworkConfigurator) PreparePodNetworkInterface() error {
    // 创建一个网桥设备
    if err := b.createBridge(); err != nil {
        return err
    }

    tapOwner := netdriver.LibvirtUserAndGroupId
    if util.IsNonRootVMI(b.vmi) {
        tapOwner = strconv.Itoa(util.NonRootUID)
    }
    
    // 用virt-chroot命令创建一个tap设备,并挂到网桥上
    tapDeviceName := virtnetlink.GenerateTapDeviceName(b.podNicLink.Attrs().Name)
    err := createAndBindTapToBridge(b.handler, tapDeviceName, b.bridgeInterfaceName, b.launcherPID, b.podNicLink.Attrs().MTU, tapOwner, b.vmi)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to create tap device named %s", tapDeviceName)
        return err
    }

    // 基于nft/iptables创建ipv4 nat规则
    err = b.createNatRules(iptables.ProtocolIPv4)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to create ipv4 nat rules for vm error: %v", err)
        return err
    }

    ipv6Enabled, err := b.handler.IsIpv6Enabled(b.podNicLink.Attrs().Name)
    if err != nil {
        log.Log.Reason(err).Errorf(ipVerifyFailFmt, b.podNicLink.Attrs().Name)
        return err
    }
    if ipv6Enabled {
        // 基于nft/iptables创建ipv6 nat规则
        err = b.createNatRules(iptables.ProtocolIPv6)
        if err != nil {
            log.Log.Reason(err).Errorf("failed to create ipv6 nat rules for vm error: %v", err)
            return err
        }
    }

    return nil
}

// pkg/network/infraconfigurators/masquerade.go
func (b *MasqueradePodNetworkConfigurator) createBridge() error {
    // 网桥配置固定的mac地址:02:00:00:00:00:00
    mac, err := net.ParseMAC(link.StaticMasqueradeBridgeMAC)
    if err != nil {
        return err
    }
    
    // 创建一个网桥
    bridge := &netlink.Bridge{
        LinkAttrs: netlink.LinkAttrs{
            Name:         b.bridgeInterfaceName,
            MTU:          b.podNicLink.Attrs().MTU,
            HardwareAddr: mac,
        },
    }
    err = b.handler.LinkAdd(bridge)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to create a bridge")
        return err
    }

    // up网桥设备
    if err := b.handler.LinkSetUp(bridge); err != nil {
        log.Log.Reason(err).Errorf("failed to bring link up for interface: %s", b.bridgeInterfaceName)
        return err
    }

    // 把之前计算出来的虚拟机网关地址给到网桥
    if err := b.handler.AddrAdd(bridge, b.vmGatewayAddr); err != nil {
        log.Log.Reason(err).Errorf("failed to set bridge IP")
        return err
    }
    ipv6Enabled, err := b.handler.IsIpv6Enabled(b.podNicLink.Attrs().Name)
    if err != nil {
        log.Log.Reason(err).Errorf(ipVerifyFailFmt, b.podNicLink.Attrs().Name)
        return err
    }
    if ipv6Enabled {
        // 如果开启ipv6,把ipv6的地址也配置到网桥设备
        if err := b.handler.AddrAdd(bridge, b.vmGatewayIpv6Addr); err != nil {
            log.Log.Reason(err).Errorf("failed to set bridge IPv6")
            return err
        }
    }
    
    // disabel网桥的tx checksum offload
    if err = b.handler.DisableTXOffloadChecksum(b.bridgeInterfaceName); err != nil {
        log.Log.Reason(err).Error("failed to disable TX offload checksum on bridge interface")
        return err
    }

    return nil
}

// pkg/network/infraconfigurators/masquerade.go
func (b *MasqueradePodNetworkConfigurator) createNatRules(protocol iptables.Protocol) error {
    // 开启pod内ipv4/ipv6 forward配置
    err := b.handler.ConfigureIpForwarding(protocol)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to configure ip forwarding")
        return err
    }

    // 用nft或者iptables设置nat规则
    if b.handler.NftablesLoad(protocol) == nil {
        return b.createNatRulesUsingNftables(protocol)
    } else if b.handler.HasNatIptables(protocol) {
        return b.createNatRulesUsingIptables(protocol)
    }
    return fmt.Errorf("Couldn't configure ip nat rules")
}

virt-launcher

virt-launcher初始化网络的入口函数在SyncVirtualMachine(virt-launcher提供grpc接口,实际调该接口的还是virt-handler进程):

// pkg/virt-launcher/virtwrap/cmd-server/server.go
func (l *Launcher) SyncVirtualMachine(_ context.Context, request *cmdv1.VMIRequest) (*cmdv1.Response, error) {
    /*...*/
    if _, err := l.domainManager.SyncVMI(vmi, l.allowEmulation, request.Options); err != nil {
        log.Log.Object(vmi).Reason(err).Errorf("Failed to sync vmi")
        response.Success = false
        response.Message = getErrorMessage(err)
        return response, nil
    }
    /*...*/
}

// pkg/virt-launcher/virtwrap/manager.go
func (l *LibvirtDomainManager) SyncVMI(vmi *v1.VirtualMachineInstance, allowEmulation bool, options *cmdv1.VirtualMachineOptions) (*api.DomainSpec, error) {
    /*...*/
    dom, err := l.virConn.LookupDomainByName(domain.Spec.Name)
    if err != nil {
        // We need the domain but it does not exist, so create it
        if domainerrors.IsNotFound(err) {
            domain, err = l.preStartHook(vmi, domain, false)
           /*...*/
        }
        /*...*/
    }
    /*...*/
}

// pkg/virt-launcher/virtwrap/manager.go
func (l *LibvirtDomainManager) preStartHook(vmi *v1.VirtualMachineInstance, domain *api.Domain, generateEmptyIsos bool) (*api.Domain, error) {
    /*...*/
    err = netsetup.NewVMNetworkConfigurator(vmi, l.networkCacheStoreFactory).SetupPodNetworkPhase2(domain)
    if err != nil {
        return domain, fmt.Errorf("preparing the pod network failed: %v", err)
    }
    /*...*/
}

// pkg/network/setup/network.go
func (n *VMNetworkConfigurator) SetupPodNetworkPhase2(domain *api.Domain) error {
    nics, err := n.getPhase2NICs(domain)
    if err != nil {
        return err
    }
    for _, nic := range nics {
        if err := nic.PlugPhase2(domain); err != nil {
            return fmt.Errorf("failed plugging phase2 at nic '%s': %w", nic.podInterfaceName, err)
        }
    }
    return nil
}

virt-launcher处理网络的第二阶段也分为2步,第一步通过getPhase2NICs收集pod NIC信息,第二步遍历NIC,执行nic.PlugPhase2

getPhase2NICs
// pkg/network/setup/network.go
func (v VMNetworkConfigurator) getPhase2NICs(domain *api.Domain) ([]podNIC, error) {
    nics := []podNIC{}

    if len(v.vmi.Spec.Domain.Devices.Interfaces) == 0 {
        return nics, nil
    }

    for i, _ := range v.vmi.Spec.Networks {
        nic, err := newPhase2PodNIC(v.vmi, &v.vmi.Spec.Networks[i], v.handler, v.cacheFactory, domain)
        if err != nil {
            return nil, err
        }
        nics = append(nics, *nic)
    }
    return nics, nil

}

// pkg/network/setup/podnic.go
func newPhase2PodNIC(vmi *v1.VirtualMachineInstance, network *v1.Network, handler netdriver.NetworkHandler, cacheFactory cache.InterfaceCacheFactory, domain *api.Domain) (*podNIC, error) {
    podnic, err := newPodNIC(vmi, network, handler, cacheFactory, nil)
    if err != nil {
        return nil, err
    }

    podnic.dhcpConfigurator = podnic.newDHCPConfigurator()
    podnic.domainGenerator = podnic.newLibvirtSpecGenerator(domain)

    return podnic, nil
}
PlugPhase2
// pkg/network/setup/podnic.go
func (l *podNIC) PlugPhase2(domain *api.Domain) error {
    precond.MustNotBeNil(domain)

    // 如果是sriov,直接返回
    if l.vmiSpecIface.SRIOV != nil {
        return nil
    }

    if err := l.domainGenerator.Generate(); err != nil {
        log.Log.Reason(err).Critical("failed to create libvirt configuration")
    }

    // 只有是bridge或者masquerade才会进入逻辑
    if l.dhcpConfigurator != nil {
        dhcpConfig, err := l.dhcpConfigurator.Generate()
        if err != nil {
            log.Log.Reason(err).Errorf("failed to get a dhcp configuration for: %s", l.podInterfaceName)
            return err
        }
        log.Log.V(4).Infof("The imported dhcpConfig: %s", dhcpConfig.String())
        if err := l.dhcpConfigurator.EnsureDHCPServerStarted(l.podInterfaceName, *dhcpConfig, l.vmiSpecIface.DHCPOptions); err != nil {
            log.Log.Reason(err).Criticalf("failed to ensure dhcp service running for: %s", l.podInterfaceName)
            panic(err)
        }
    }

    return nil
}

// pkg/network/dhcp/configurator.go
func (d *configurator) EnsureDHCPServerStarted(podInterfaceName string, dhcpConfig cache.DHCPConfig, dhcpOptions *v1.DHCPOptions) error {
    if dhcpConfig.IPAMDisabled {
        return nil
    }
    dhcpStartedFile := d.getDHCPStartedFilePath(podInterfaceName)
    _, err := os.Stat(dhcpStartedFile)
    if os.IsNotExist(err) {
        // 启动dhcp服务
        if err := d.handler.StartDHCP(&dhcpConfig, d.advertisingIfaceName, dhcpOptions); err != nil {
            return fmt.Errorf("failed to start DHCP server for interface %s", podInterfaceName)
        }
        newFile, err := os.Create(dhcpStartedFile)
        if err != nil {
            return fmt.Errorf("failed to create dhcp started file %s: %s", dhcpStartedFile, err)
        }
        newFile.Close()
    }
    return nil
}

// pkg/network/driver/common.go
func (h *NetworkUtilsHandler) StartDHCP(nic *cache.DHCPConfig, bridgeInterfaceName string, dhcpOptions *v1.DHCPOptions) error {
    /*...*/
    // 起个协程启动一个dhcp服务(ipv4)
    go func() {
        if err = DHCPServer(
            nic.MAC,
            nic.IP.IP,
            nic.IP.Mask,
            bridgeInterfaceName,
            nic.AdvertisingIPAddr,
            nic.Gateway,
            nameservers,
            nic.Routes,
            searchDomains,
            nic.Mtu,
            dhcpOptions,
        ); err != nil {
            log.Log.Errorf("failed to run DHCP: %v", err)
            panic(err)
        }
    }()
    
    if nic.IPv6.IPNet != nil {
        // 启动一个ipv6 dhcp服务
        go func() {
            if err = DHCPv6Server(
                nic.IPv6.IP,
                bridgeInterfaceName,
            ); err != nil {
                log.Log.Reason(err).Error("failed to run DHCPv6")
                panic(err)
            }
        }()
    }

    return nil
}

Generate函数对于bridge和masquerade实现不同:

  • bridge
// pkg/network/dhcp/bridge.go
func (d *BridgeConfigGenerator) Generate() (*cache.DHCPConfig, error) {
    dhcpConfig, err := d.cacheFactory.CacheDHCPConfigForPid(d.launcherPID).Read(d.podInterfaceName)
    if err != nil {
        return nil, err
    }

    if dhcpConfig.IPAMDisabled {
        return dhcpConfig, nil
    }

    dhcpConfig.Name = d.podInterfaceName

    // 前面bridge逻辑提到,会给网桥一个fake ip,这里是获取fake ip
    fakeBridgeIP := virtnetlink.GetFakeBridgeIP(d.vmiSpecIfaces, d.vmiSpecIface)
    fakeServerAddr, _ := netlink.ParseAddr(fakeBridgeIP)
    dhcpConfig.AdvertisingIPAddr = fakeServerAddr.IP

    newPodNicName := virtnetlink.GenerateNewBridgedVmiInterfaceName(d.podInterfaceName)
    podNicLink, err := d.handler.LinkByName(newPodNicName)
    if err != nil {
        return nil, err
    }
    
    // dhcp的MTU设置和pod网卡一样
    dhcpConfig.Mtu = uint16(podNicLink.Attrs().MTU)
    dhcpConfig.Subdomain = d.subdomain

    return dhcpConfig, nil
}
  • masquerade
// pkg/network/dhcp/masquerade.go
func (d *MasqueradeConfigGenerator) Generate() (*cache.DHCPConfig, error) {
    dhcpConfig := &cache.DHCPConfig{}
    // 获取pod网卡信息
    podNicLink, err := d.handler.LinkByName(d.podInterfaceName)
    if err != nil {
        return nil, err
    }

    dhcpConfig.Name = podNicLink.Attrs().Name
    dhcpConfig.Subdomain = d.subdomain
    dhcpConfig.Mtu = uint16(podNicLink.Attrs().MTU)

    // 获取masquerade的ipv4网关和vm ip
    ipv4Gateway, ipv4, err := virtnetlink.GenerateMasqueradeGatewayAndVmIPAddrs(d.vmiSpecNetwork, iptables.ProtocolIPv4)
    if err != nil {
        return nil, err
    }
    dhcpConfig.IP = *ipv4
    dhcpConfig.AdvertisingIPAddr = ipv4Gateway.IP.To4()
    dhcpConfig.Gateway = ipv4Gateway.IP.To4()

    ipv6Enabled, err := d.handler.IsIpv6Enabled(d.podInterfaceName)
    if err != nil {
        log.Log.Reason(err).Errorf("failed to verify whether ipv6 is configured on %s", d.podInterfaceName)
        return nil, err
    }

    if ipv6Enabled {
        // 获取masquerade的ipv6网关和vm ip
        ipv6Gateway, ipv6, err := virtnetlink.GenerateMasqueradeGatewayAndVmIPAddrs(d.vmiSpecNetwork, iptables.ProtocolIPv6)
        if err != nil {
            return nil, err
        }
        dhcpConfig.IPv6 = *ipv6
        dhcpConfig.AdvertisingIPv6Addr = ipv6Gateway.IP.To16()
    }

    return dhcpConfig, nil
}

除了上述内容,还有一些通过libvirtd把网络信息配置到虚拟机的动作未提及,这部分内容待读者自行研读。

libvirt

通过virt-handler和virt-launcher准备好bridge、tap、dhcp server等资源后,kubevirt会把这些数据组装成libvirt xml去调libvirtd接口创建虚拟机,最后形成一个完整的虚拟机系统。

总结

前面我们从源码层面对kubevirt网络做了一些梳理,本章节我们结合前面的源码分析,再用图示的形式梳理一下,便于读者理解。CNI依旧以flannel host-gateway模式为例:

在这里插入图片描述

pod+bridge

vmi yaml如下所示:

apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
  annotations:
    name: test
spec:
  domain:
    devices:
      interfaces:
      - bridge: {} # 注意这里是bridge
        name: default
      # ...
  networks:
  - name: default
    pod: {}
  # ...

virt-handler

1、down掉eth0网卡,删除eth0网卡ip(删除的ip会先保留在内存中):

在这里插入图片描述

2、创建一张名为eth0的dummy网卡,把原先eth0网卡改名为eth0-nic,刚才删掉的ip给到dummy网卡,设置arp_ignore,并且把eth0-nic网卡mac改为随机mac:

在这里插入图片描述

3、创建一个网桥,把eth0-nic连接到网桥,up网桥并给网桥设置一个fake ip,再disabel网桥的tx checksum offload:

在这里插入图片描述

4、创建tap设备并连接到网桥上:

在这里插入图片描述

5、up eth0-nic设备,并关闭mac地址学习功能,最终有:

在这里插入图片描述

virt-launcher

virt-launcher做的事情相对少些,只启动了一个dhcp server供后续vm获取ip:

在这里插入图片描述

libvirt

libvirt负责创建vm并且把tap设备用于vm中:

在这里插入图片描述

pod+masquerade

vmi yaml如下所示:

apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachineInstance
metadata:
  annotations:
    name: test
spec:
  domain:
    devices:
      interfaces:
      - masquerade: {} # 注意这里是masquerade
        name: default
      # ...
  networks:
  - name: default
    pod: {}
  # ...

virt-handler

1、创建mac地址为02:00:00:00:00:00的网桥并up,默认根据10.0.24.0/24计算网关地址(即10.0.24.1),并把该地址作为网桥的ip地址,最后disable tx checksum offload:

在这里插入图片描述

2、创建一个tap设备并连接到网桥上:

在这里插入图片描述

3、开启ip_forward,最终用nftable或者iptable实现nat:

在这里插入图片描述

virt-launcher

启动了一个dhcp server供后续vm获取ip:

在这里插入图片描述

libvirt

libvirt负责创建vm并且把tap设备用于vm中:

在这里插入图片描述

其它

通过以上内容我们对kubevirt虚拟机网络有了一个更清晰的认识,这里再补充一些个人的思考:

  1. 为什么创建虚拟机网络要在virt-handler和virt-launcher两个组件中完成?也就是官网提到创建网络的phase1和phase2两个阶段,不能全在virt-launcher中完成吗?

个人见解:这里应该是从网络安全上考虑,官网可以看到这么一段话:

The virt-launcher is an untrusted component of KubeVirt (since it wraps the libvirt process that will run third party workloads). As a result, it must be run with as little privileges as required. As of now, the only capability required by virt-launcher to configure networking is the CAP_NET_ADMIN.

  1. 从前面的分析可以看出,一个网络报文从宿主机网卡到虚拟机,需要经过:宿主机物理网卡 -> CNI网桥 -> 宿主机veth pair -> pod网络命名空间veth pair -> virt-handler网桥 -> pod网络命名空间tap -> 虚拟机内的网卡,很明显这个链路太长,对于网络性能要求高的场景可能会具有一定的挑战性。

个人见解:文章开头提到,只有非hostNetwork的pod网络才会有CNI的参与,因此,如果把vmi pod的网络配置为hostNetwork,则没有了CNI网桥、宿主机veth pair、pod网络命名空间veth pair,链路会减少一大截。但是hostNetwork网络的话,首先要考虑网络安全问题,其次virt-handler和virt-launcher代码可能需要改造,最后还有vm ip的分配也得考虑。

本文都是用flannel作为pod网络CNI,如果在CNI处做改造,实现一个如下模型网络,或许是比较理想的(未做任何验证和深入分析,仅供参考):

在这里插入图片描述

  1. 从vmi的字段定义来看是支持SR-IOV的(没实际验证过),SR-IOV理论上能提升不少虚拟机的网络性能。不过SR-IOV也有一些限制:需要在BIOS中配置并且需要宿主机网卡支持SR-IOV;即使宿主机网卡支持SR-IOV,一般VF个数也比较少,数量在十几到几十左右,而一个k8s节点默认可以起pod的个数是110,如果都是小规格虚拟机,很明显VF数量满足不了。不过如果业务需要的都是大规格虚拟机,例如一台宿主机上只需要虚拟化个位数的虚拟机,SR-IOV应该是个不错的选择。

在这里插入图片描述

微信公众号卡巴斯同步发布,欢迎大家关注。

;