123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394 |
- Ethernet switch device driver model (switchdev)
- ===============================================
- Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
- Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com>
- The Ethernet switch device driver model (switchdev) is an in-kernel driver
- model for switch devices which offload the forwarding (data) plane from the
- kernel.
- Figure 1 is a block diagram showing the components of the switchdev model for
- an example setup using a data-center-class switch ASIC chip. Other setups
- with SR-IOV or soft switches, such as OVS, are possible.
- User-space tools
-
- user space |
- +-------------------------------------------------------------------+
- kernel | Netlink
- |
- +--------------+-------------------------------+
- | Network stack |
- | (Linux) |
- | |
- +----------------------------------------------+
-
- sw1p2 sw1p4 sw1p6
- sw1p1 + sw1p3 + sw1p5 + eth1
- + | + | + | +
- | | | | | | |
- +--+----+----+----+-+--+----+---+ +-----+-----+
- | Switch driver | | mgmt |
- | (this document) | | driver |
- | | | |
- +--------------+----------------+ +-----------+
- |
- kernel | HW bus (eg PCI)
- +-------------------------------------------------------------------+
- hardware |
- +--------------+---+------------+
- | Switch device (sw1) |
- | +----+ +--------+
- | | v offloaded data path | mgmt port
- | | | |
- +--|----|----+----+----+----+---+
- | | | | | |
- + + + + + +
- p1 p2 p3 p4 p5 p6
-
- front-panel ports
-
- Fig 1.
- Include Files
- -------------
- #include <linux/netdevice.h>
- #include <net/switchdev.h>
- Configuration
- -------------
- Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model
- support is built for driver.
- Switch Ports
- ------------
- On switchdev driver initialization, the driver will allocate and register a
- struct net_device (using register_netdev()) for each enumerated physical switch
- port, called the port netdev. A port netdev is the software representation of
- the physical port and provides a conduit for control traffic to/from the
- controller (the kernel) and the network, as well as an anchor point for higher
- level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using
- standard netdev tools (iproute2, ethtool, etc), the port netdev can also
- provide to the user access to the physical properties of the switch port such
- as PHY link state and I/O statistics.
- There is (currently) no higher-level kernel object for the switch beyond the
- port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops.
- A switch management port is outside the scope of the switchdev driver model.
- Typically, the management port is not participating in offloaded data plane and
- is loaded with a different driver, such as a NIC driver, on the management port
- device.
- Port Netdev Naming
- ^^^^^^^^^^^^^^^^^^
- Udev rules should be used for port netdev naming, using some unique attribute
- of the port as a key, for example the port MAC address or the port PHYS name.
- Hard-coding of kernel netdev names within the driver is discouraged; let the
- kernel pick the default netdev name, and let udev set the final name based on a
- port attribute.
- Using port PHYS name (ndo_get_phys_port_name) for the key is particularly
- useful for dynamically-named ports where the device names its ports based on
- external configuration. For example, if a physical 40G port is split logically
- into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
- name for each port using port PHYS name. The udev rule would be:
- SUBSYSTEM=="net", ACTION=="add", DRIVER="<driver>", ATTR{phys_port_name}!="", \
- NAME="$attr{phys_port_name}"
- Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
- is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0
- would be sub-port 0 on port 1 on switch 1.
- Switch ID
- ^^^^^^^^^
- The switchdev driver must implement the switchdev op switchdev_port_attr_get
- for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same
- physical ID for each port of a switch. The ID must be unique between switches
- on the same system. The ID does not need to be unique between switches on
- different systems.
- The switch ID is used to locate ports on a switch and to know if aggregated
- ports belong to the same switch.
- Port Features
- ^^^^^^^^^^^^^
- NETIF_F_NETNS_LOCAL
- If the switchdev driver (and device) only supports offloading of the default
- network namespace (netns), the driver should set this feature flag to prevent
- the port netdev from being moved out of the default netns. A netns-aware
- driver/device would not set this flag and be responsible for partitioning
- hardware to preserve netns containment. This means hardware cannot forward
- traffic from a port in one namespace to another port in another namespace.
- Port Topology
- ^^^^^^^^^^^^^
- The port netdevs representing the physical switch ports can be organized into
- higher-level switching constructs. The default construct is a standalone
- router port, used to offload L3 forwarding. Two or more ports can be bonded
- together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge
- L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3
- tunnels can be built on ports. These constructs are built using standard Linux
- tools such as the bridge driver, the bonding/team drivers, and netlink-based
- tools such as iproute2.
- The switchdev driver can know a particular port's position in the topology by
- monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a
- bond will see it's upper master change. If that bond is moved into a bridge,
- the bond's upper master will change. And so on. The driver will track such
- movements to know what position a port is in in the overall topology by
- registering for netdevice events and acting on NETDEV_CHANGEUPPER.
- L2 Forwarding Offload
- ---------------------
- The idea is to offload the L2 data forwarding (switching) path from the kernel
- to the switchdev device by mirroring bridge FDB entries down to the device. An
- FDB entry is the {port, MAC, VLAN} tuple forwarding destination.
- To offloading L2 bridging, the switchdev driver/device should support:
- - Static FDB entries installed on a bridge port
- - Notification of learned/forgotten src mac/vlans from device
- - STP state changes on the port
- - VLAN flooding of multicast/broadcast and unknown unicast packets
- Static FDB Entries
- ^^^^^^^^^^^^^^^^^^
- The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
- to support static FDB entries installed to the device. Static bridge FDB
- entries are installed, for example, using iproute2 bridge cmd:
- bridge fdb add ADDR dev DEV [vlan VID] [self]
- The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx
- ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using
- switchdev_port_obj_xxx ops.
- XXX: what should be done if offloading this rule to hardware fails (for
- example, due to full capacity in hardware tables) ?
- Note: by default, the bridge does not filter on VLAN and only bridges untagged
- traffic. To enable VLAN support, turn on VLAN filtering:
- echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
- Notification of Learned/Forgotten Source MAC/VLANs
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- The switch device will learn/forget source MAC address/VLAN on ingress packets
- and notify the switch driver of the mac/vlan/port tuples. The switch driver,
- in turn, will notify the bridge driver using the switchdev notifier call:
- err = call_switchdev_notifiers(val, dev, info);
- Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
- forgetting, and info points to a struct switchdev_notifier_fdb_info. On
- SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
- bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge
- command will label these entries "offload":
- $ bridge fdb
- 52:54:00:12:35:01 dev sw1p1 master br0 permanent
- 00:02:00:00:02:00 dev sw1p1 master br0 offload
- 00:02:00:00:02:00 dev sw1p1 self
- 52:54:00:12:35:02 dev sw1p2 master br0 permanent
- 00:02:00:00:03:00 dev sw1p2 master br0 offload
- 00:02:00:00:03:00 dev sw1p2 self
- 33:33:00:00:00:01 dev eth0 self permanent
- 01:00:5e:00:00:01 dev eth0 self permanent
- 33:33:ff:00:00:00 dev eth0 self permanent
- 01:80:c2:00:00:0e dev eth0 self permanent
- 33:33:00:00:00:01 dev br0 self permanent
- 01:00:5e:00:00:01 dev br0 self permanent
- 33:33:ff:12:35:01 dev br0 self permanent
- Learning on the port should be disabled on the bridge using the bridge command:
- bridge link set dev DEV learning off
- Learning on the device port should be enabled, as well as learning_sync:
- bridge link set dev DEV learning on self
- bridge link set dev DEV learning_sync on self
- Learning_sync attribute enables syncing of the learned/forgotton FDB entry to
- the bridge's FDB. It's possible, but not optimal, to enable learning on the
- device port and on the bridge port, and disable learning_sync.
- To support learning and learning_sync port attributes, the driver implements
- switchdev op switchdev_port_attr_get/set for
- SWITCHDEV_ATTR_PORT_ID_BRIDGE_FLAGS. The driver should initialize the attributes
- to the hardware defaults.
- FDB Ageing
- ^^^^^^^^^^
- The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is
- the responsibility of the port driver/device to age out these entries. If the
- port device supports ageing, when the FDB entry expires, it will notify the
- driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the
- device does not support ageing, the driver can simulate ageing using a
- garbage collection timer to monitor FBD entries. Expired entries will be
- notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for
- example of driver running ageing timer.
- To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB
- entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The
- notification will reset the FDB entry's last-used time to now. The driver
- should rate limit refresh notifications, for example, no more than once a
- second. (The last-used time is visible using the bridge -s fdb option).
- STP State Change on Port
- ^^^^^^^^^^^^^^^^^^^^^^^^
- Internally or with a third-party STP protocol implementation (e.g. mstpd), the
- bridge driver maintains the STP state for ports, and will notify the switch
- driver of STP state change on a port using the switchdev op
- switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE.
- State is one of BR_STATE_*. The switch driver can use STP state updates to
- update ingress packet filter list for the port. For example, if port is
- DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs
- and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass.
- Note that STP BDPUs are untagged and STP state applies to all VLANs on the port
- so packet filters should be applied consistently across untagged and tagged
- VLANs on the port.
- Flooding L2 domain
- ^^^^^^^^^^^^^^^^^^
- For a given L2 VLAN domain, the switch device should flood multicast/broadcast
- and unknown unicast packets to all ports in domain, if allowed by port's
- current STP state. The switch driver, knowing which ports are within which
- vlan L2 domain, can program the switch device for flooding. The packet may
- be sent to the port netdev for processing by the bridge driver. The
- bridge should not reflood the packet to the same ports the device flooded,
- otherwise there will be duplicate packets on the wire.
- To avoid duplicate packets, the device/driver should mark a packet as already
- forwarded using skb->offload_fwd_mark. The same mark is set on the device
- ports in the domain using dev->offload_fwd_mark. If the skb->offload_fwd_mark
- is non-zero and matches the forwarding egress port's dev->skb_mark, the kernel
- will drop the skb right before transmit on the egress port, with the
- understanding that the device already forwarded the packet on same egress port.
- The driver can use switchdev_port_fwd_mark_set() to set a globally unique mark
- for port's dev->offload_fwd_mark, based on the port's parent ID (switch ID) and
- a group ifindex.
- It is possible for the switch device to not handle flooding and push the
- packets up to the bridge driver for flooding. This is not ideal as the number
- of ports scale in the L2 domain as the device is much more efficient at
- flooding packets that software.
- If supported by the device, flood control can be offloaded to it, preventing
- certain netdevs from flooding unicast traffic for which there is no FDB entry.
- IGMP Snooping
- ^^^^^^^^^^^^^
- XXX: complete this section
- L3 Routing Offload
- ------------------
- Offloading L3 routing requires that device be programmed with FIB entries from
- the kernel, with the device doing the FIB lookup and forwarding. The device
- does a longest prefix match (LPM) on FIB entries matching route prefix and
- forwards the packet to the matching FIB entry's nexthop(s) egress ports.
- To program the device, the driver implements support for
- SWITCHDEV_OBJ_IPV[4|6]_FIB object using switchdev_port_obj_xxx ops.
- switchdev_port_obj_add is used for both adding a new FIB entry to the device,
- or modifying an existing entry on the device.
- XXX: Currently, only SWITCHDEV_OBJ_ID_IPV4_FIB objects are supported.
- SWITCHDEV_OBJ_ID_IPV4_FIB object passes:
- struct switchdev_obj_ipv4_fib { /* IPV4_FIB */
- u32 dst;
- int dst_len;
- struct fib_info *fi;
- u8 tos;
- u8 type;
- u32 nlflags;
- u32 tb_id;
- } ipv4_fib;
- to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi
- structure holds details on the route and route's nexthops. *dev is one of the
- port netdevs mentioned in the routes next hop list. If the output port netdevs
- referenced in the route's nexthop list don't all have the same switch ID, the
- driver is not called to add/modify/delete the FIB entry.
- Routes offloaded to the device are labeled with "offload" in the ip route
- listing:
- $ ip route show
- default via 192.168.0.2 dev eth0
- 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload
- 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload
- 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload
- 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload
- 12.0.0.2 proto zebra metric 30 offload
- nexthop via 11.0.0.1 dev sw1p1 weight 1
- nexthop via 11.0.0.9 dev sw1p2 weight 1
- 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload
- 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload
- 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15
- XXX: add/mod/del IPv6 FIB API
- Nexthop Resolution
- ^^^^^^^^^^^^^^^^^^
- The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for
- the switch device to forward the packet with the correct dst mac address, the
- nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac
- address discovery comes via the ARP (or ND) process and is available via the
- arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver
- should trigger the kernel's neighbor resolution process. See the rocker
- driver's rocker_port_ipv4_resolve() for an example.
- The driver can monitor for updates to arp_tbl using the netevent notifier
- NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops
- for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy
- to know when arp_tbl neighbor entries are purged from the port.
- Transaction item queue
- ^^^^^^^^^^^^^^^^^^^^^^
- For switchdev ops attr_set and obj_add, there is a 2 phase transaction model
- used. First phase is to "prepare" anything needed, including various checks,
- memory allocation, etc. The goal is to handle the stuff that is not unlikely
- to fail here. The second phase is to "commit" the actual changes.
- Switchdev provides an inftrastructure for sharing items (for example memory
- allocations) between the two phases.
- The object created by a driver in "prepare" phase and it is queued up by:
- switchdev_trans_item_enqueue()
- During the "commit" phase, the driver gets the object by:
- switchdev_trans_item_dequeue()
- If a transaction is aborted during "prepare" phase, switchdev code will handle
- cleanup of the queued-up objects.
|