123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615 |
- Distributed Switch Architecture
- ===============================
- Introduction
- ============
- This document describes the Distributed Switch Architecture (DSA) subsystem
- design principles, limitations, interactions with other subsystems, and how to
- develop drivers for this subsystem as well as a TODO for developers interested
- in joining the effort.
- Design principles
- =================
- The Distributed Switch Architecture is a subsystem which was primarily designed
- to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line)
- using Linux, but has since evolved to support other vendors as well.
- The original philosophy behind this design was to be able to use unmodified
- Linux tools such as bridge, iproute2, ifconfig to work transparently whether
- they configured/queried a switch port network device or a regular network
- device.
- An Ethernet switch is typically comprised of multiple front-panel ports, and one
- or more CPU or management port. The DSA subsystem currently relies on the
- presence of a management port connected to an Ethernet controller capable of
- receiving Ethernet frames from the switch. This is a very common setup for all
- kinds of Ethernet switches found in Small Home and Office products: routers,
- gateways, or even top-of-the rack switches. This host Ethernet controller will
- be later referred to as "master" and "cpu" in DSA terminology and code.
- The D in DSA stands for Distributed, because the subsystem has been designed
- with the ability to configure and manage cascaded switches on top of each other
- using upstream and downstream Ethernet links between switches. These specific
- ports are referred to as "dsa" ports in DSA terminology and code. A collection
- of multiple switches connected to each other is called a "switch tree".
- For each front-panel port, DSA will create specialized network devices which are
- used as controlling and data-flowing endpoints for use by the Linux networking
- stack. These specialized network interfaces are referred to as "slave" network
- interfaces in DSA terminology and code.
- The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
- which is a hardware feature making the switch insert a specific tag for each
- Ethernet frames it received to/from specific ports to help the management
- interface figure out:
- - what port is this frame coming from
- - what was the reason why this frame got forwarded
- - how to send CPU originated traffic to specific ports
- The subsystem does support switches not capable of inserting/stripping tags, but
- the features might be slightly limited in that case (traffic separation relies
- on Port-based VLAN IDs).
- Note that DSA does not currently create network interfaces for the "cpu" and
- "dsa" ports because:
- - the "cpu" port is the Ethernet switch facing side of the management
- controller, and as such, would create a duplication of feature, since you
- would get two interfaces for the same conduit: master netdev, and "cpu" netdev
- - the "dsa" port(s) are just conduits between two or more switches, and as such
- cannot really be used as proper network interfaces either, only the
- downstream, or the top-most upstream interface makes sense with that model
- Switch tagging protocols
- ------------------------
- DSA currently supports 4 different tagging protocols, and a tag-less mode as
- well. The different protocols are implemented in:
- net/dsa/tag_trailer.c: Marvell's 4 trailer tag mode (legacy)
- net/dsa/tag_dsa.c: Marvell's original DSA tag
- net/dsa/tag_edsa.c: Marvell's enhanced DSA tag
- net/dsa/tag_brcm.c: Broadcom's 4 bytes tag
- The exact format of the tag protocol is vendor specific, but in general, they
- all contain something which:
- - identifies which port the Ethernet frame came from/should be sent to
- - provides a reason why this frame was forwarded to the management interface
- Master network devices
- ----------------------
- Master network devices are regular, unmodified Linux network device drivers for
- the CPU/management Ethernet interface. Such a driver might occasionally need to
- know whether DSA is enabled (e.g.: to enable/disable specific offload features),
- but the DSA subsystem has been proven to work with industry standard drivers:
- e1000e, mv643xx_eth etc. without having to introduce modifications to these
- drivers. Such network devices are also often referred to as conduit network
- devices since they act as a pipe between the host processor and the hardware
- Ethernet switch.
- Networking stack hooks
- ----------------------
- When a master netdev is used with DSA, a small hook is placed in in the
- networking stack is in order to have the DSA subsystem process the Ethernet
- switch specific tagging protocol. DSA accomplishes this by registering a
- specific (and fake) Ethernet type (later becoming skb->protocol) with the
- networking stack, this is also known as a ptype or packet_type. A typical
- Ethernet Frame receive sequence looks like this:
- Master network device (e.g.: e1000e):
- Receive interrupt fires:
- - receive function is invoked
- - basic packet processing is done: getting length, status etc.
- - packet is prepared to be processed by the Ethernet layer by calling
- eth_type_trans
- net/ethernet/eth.c:
- eth_type_trans(skb, dev)
- if (dev->dsa_ptr != NULL)
- -> skb->protocol = ETH_P_XDSA
- drivers/net/ethernet/*:
- netif_receive_skb(skb)
- -> iterate over registered packet_type
- -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv()
- net/dsa/dsa.c:
- -> dsa_switch_rcv()
- -> invoke switch tag specific protocol handler in
- net/dsa/tag_*.c
- net/dsa/tag_*.c:
- -> inspect and strip switch tag protocol to determine originating port
- -> locate per-port network device
- -> invoke eth_type_trans() with the DSA slave network device
- -> invoked netif_receive_skb()
- Past this point, the DSA slave network devices get delivered regular Ethernet
- frames that can be processed by the networking stack.
- Slave network devices
- ---------------------
- Slave network devices created by DSA are stacked on top of their master network
- device, each of these network interfaces will be responsible for being a
- controlling and data-flowing end-point for each front-panel port of the switch.
- These interfaces are specialized in order to:
- - insert/remove the switch tag protocol (if it exists) when sending traffic
- to/from specific switch ports
- - query the switch for ethtool operations: statistics, link state,
- Wake-on-LAN, register dumps...
- - external/internal PHY management: link, auto-negotiation etc.
- These slave network devices have custom net_device_ops and ethtool_ops function
- pointers which allow DSA to introduce a level of layering between the networking
- stack/ethtool, and the switch driver implementation.
- Upon frame transmission from these slave network devices, DSA will look up which
- switch tagging protocol is currently registered with these network devices, and
- invoke a specific transmit routine which takes care of adding the relevant
- switch tag in the Ethernet frames.
- These frames are then queued for transmission using the master network device
- ndo_start_xmit() function, since they contain the appropriate switch tag, the
- Ethernet switch will be able to process these incoming frames from the
- management interface and delivers these frames to the physical switch port.
- Graphical representation
- ------------------------
- Summarized, this is basically how DSA looks like from a network device
- perspective:
- |---------------------------
- | CPU network device (eth0)|
- ----------------------------
- | <tag added by switch |
- | |
- | |
- | tag added by CPU> |
- |--------------------------------------------|
- | Switch driver |
- |--------------------------------------------|
- || || ||
- |-------| |-------| |-------|
- | sw0p0 | | sw0p1 | | sw0p2 |
- |-------| |-------| |-------|
- Slave MDIO bus
- --------------
- In order to be able to read to/from a switch PHY built into it, DSA creates a
- slave MDIO bus which allows a specific switch driver to divert and intercept
- MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
- switches, these functions would utilize direct or indirect PHY addressing mode
- to return standard MII registers from the switch builtin PHYs, allowing the PHY
- library and/or to return link status, link partner pages, auto-negotiation
- results etc..
- For Ethernet switches which have both external and internal MDIO busses, the
- slave MII bus can be utilized to mux/demux MDIO reads and writes towards either
- internal or external MDIO devices this switch might be connected to: internal
- PHYs, external PHYs, or even external switches.
- Data structures
- ---------------
- DSA data structures are defined in include/net/dsa.h as well as
- net/dsa/dsa_priv.h.
- dsa_chip_data: platform data configuration for a given switch device, this
- structure describes a switch device's parent device, its address, as well as
- various properties of its ports: names/labels, and finally a routing table
- indication (when cascading switches)
- dsa_platform_data: platform device configuration data which can reference a
- collection of dsa_chip_data structure if multiples switches are cascaded, the
- master network device this switch tree is attached to needs to be referenced
- dsa_switch_tree: structure assigned to the master network device under
- "dsa_ptr", this structure references a dsa_platform_data structure as well as
- the tagging protocol supported by the switch tree, and which receive/transmit
- function hooks should be invoked, information about the directly attached switch
- is also provided: CPU port. Finally, a collection of dsa_switch are referenced
- to address individual switches in the tree.
- dsa_switch: structure describing a switch device in the tree, referencing a
- dsa_switch_tree as a backpointer, slave network devices, master network device,
- and a reference to the backing dsa_switch_driver
- dsa_switch_driver: structure referencing function pointers, see below for a full
- description.
- Design limitations
- ==================
- DSA is a platform device driver
- -------------------------------
- DSA is implemented as a DSA platform device driver which is convenient because
- it will register the entire DSA switch tree attached to a master network device
- in one-shot, facilitating the device creation and simplifying the device driver
- model a bit, this comes however with a number of limitations:
- - building DSA and its switch drivers as modules is currently not working
- - the device driver parenting does not necessarily reflect the original
- bus/device the switch can be created from
- - supporting non-MDIO and non-MMIO (platform) switches is not possible
- Limits on the number of devices and ports
- -----------------------------------------
- DSA currently limits the number of maximum switches within a tree to 4
- (DSA_MAX_SWITCHES), and the number of ports per switch to 12 (DSA_MAX_PORTS).
- These limits could be extended to support larger configurations would this need
- arise.
- Lack of CPU/DSA network devices
- -------------------------------
- DSA does not currently create slave network devices for the CPU or DSA ports, as
- described before. This might be an issue in the following cases:
- - inability to fetch switch CPU port statistics counters using ethtool, which
- can make it harder to debug MDIO switch connected using xMII interfaces
- - inability to configure the CPU port link parameters based on the Ethernet
- controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/
- - inability to configure specific VLAN IDs / trunking VLANs between switches
- when using a cascaded setup
- Common pitfalls using DSA setups
- --------------------------------
- Once a master network device is configured to use DSA (dev->dsa_ptr becomes
- non-NULL), and the switch behind it expects a tagging protocol, this network
- interface can only exclusively be used as a conduit interface. Sending packets
- directly through this interface (e.g.: opening a socket using this interface)
- will not make us go through the switch tagging protocol transmit function, so
- the Ethernet switch on the other end, expecting a tag will typically drop this
- frame.
- Slave network devices check that the master network device is UP before allowing
- you to administratively bring UP these slave network devices. A common
- configuration mistake is forgetting to bring UP the master network device first.
- Interactions with other subsystems
- ==================================
- DSA currently leverages the following subsystems:
- - MDIO/PHY library: drivers/net/phy/phy.c, mdio_bus.c
- - Switchdev: net/switchdev/*
- - Device Tree for various of_* functions
- - HWMON: drivers/hwmon/*
- MDIO/PHY library
- ----------------
- Slave network devices exposed by DSA may or may not be interfacing with PHY
- devices (struct phy_device as defined in include/linux/phy.h), but the DSA
- subsystem deals with all possible combinations:
- - internal PHY devices, built into the Ethernet switch hardware
- - external PHY devices, connected via an internal or external MDIO bus
- - internal PHY devices, connected via an internal MDIO bus
- - special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a
- fixed PHYs
- The PHY configuration is done by the dsa_slave_phy_setup() function and the
- logic basically looks like this:
- - if Device Tree is used, the PHY device is looked up using the standard
- "phy-handle" property, if found, this PHY device is created and registered
- using of_phy_connect()
- - if Device Tree is used, and the PHY device is "fixed", that is, conforms to
- the definition of a non-MDIO managed PHY as defined in
- Documentation/devicetree/bindings/net/fixed-link.txt, the PHY is registered
- and connected transparently using the special fixed MDIO bus driver
- - finally, if the PHY is built into the switch, as is very common with
- standalone switch packages, the PHY is probed using the slave MII bus created
- by DSA
- SWITCHDEV
- ---------
- DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
- more specifically with its VLAN filtering portion when configuring VLANs on top
- of per-port slave network devices. Since DSA primarily deals with
- MDIO-connected switches, although not exclusively, SWITCHDEV's
- prepare/abort/commit phases are often simplified into a prepare phase which
- checks whether the operation is supporte by the DSA switch driver, and a commit
- phase which applies the changes.
- As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN
- objects.
- Device Tree
- -----------
- DSA features a standardized binding which is documented in
- Documentation/devicetree/bindings/net/dsa/dsa.txt. PHY/MDIO library helper
- functions such as of_get_phy_mode(), of_phy_connect() are also used to query
- per-port PHY specific details: interface connection, MDIO bus location etc..
- HWMON
- -----
- Some switch drivers feature internal temperature sensors which are exposed as
- regular HWMON devices in /sys/class/hwmon/.
- Driver development
- ==================
- DSA switch drivers need to implement a dsa_switch_driver structure which will
- contain the various members described below.
- register_switch_driver() registers this dsa_switch_driver in its internal list
- of drivers to probe for. unregister_switch_driver() does the exact opposite.
- Unless requested differently by setting the priv_size member accordingly, DSA
- does not allocate any driver private context space.
- Switch configuration
- --------------------
- - priv_size: additional size needed by the switch driver for its private context
- - tag_protocol: this is to indicate what kind of tagging protocol is supported,
- should be a valid value from the dsa_tag_protocol enum
- - probe: probe routine which will be invoked by the DSA platform device upon
- registration to test for the presence/absence of a switch device. For MDIO
- devices, it is recommended to issue a read towards internal registers using
- the switch pseudo-PHY and return whether this is a supported device. For other
- buses, return a non-NULL string
- - setup: setup function for the switch, this function is responsible for setting
- up the dsa_switch_driver private structure with all it needs: register maps,
- interrupts, mutexes, locks etc.. This function is also expected to properly
- configure the switch to separate all network interfaces from each other, that
- is, they should be isolated by the switch hardware itself, typically by creating
- a Port-based VLAN ID for each port and allowing only the CPU port and the
- specific port to be in the forwarding vector. Ports that are unused by the
- platform should be disabled. Past this function, the switch is expected to be
- fully configured and ready to serve any kind of request. It is recommended
- to issue a software reset of the switch during this setup function in order to
- avoid relying on what a previous software agent such as a bootloader/firmware
- may have previously configured.
- - set_addr: Some switches require the programming of the management interface's
- Ethernet MAC address, switch drivers can also disable ageing of MAC addresses
- on the management interface and "hardcode"/"force" this MAC address for the
- CPU/management interface as an optimization
- PHY devices and link management
- -------------------------------
- - get_phy_flags: Some switches are interfaced to various kinds of Ethernet PHYs,
- if the PHY library PHY driver needs to know about information it cannot obtain
- on its own (e.g.: coming from switch memory mapped registers), this function
- should return a 32-bits bitmask of "flags", that is private between the switch
- driver and the Ethernet PHY driver in drivers/net/phy/*.
- - phy_read: Function invoked by the DSA slave MDIO bus when attempting to read
- the switch port MDIO registers. If unavailable, return 0xffff for each read.
- For builtin switch Ethernet PHYs, this function should allow reading the link
- status, auto-negotiation results, link partner pages etc..
- - phy_write: Function invoked by the DSA slave MDIO bus when attempting to write
- to the switch port MDIO registers. If unavailable return a negative error
- code.
- - poll_link: Function invoked by DSA to query the link state of the switch
- builtin Ethernet PHYs, per port. This function is responsible for calling
- netif_carrier_{on,off} when appropriate, and can be used to poll all ports in a
- single call. Executes from workqueue context.
- - adjust_link: Function invoked by the PHY library when a slave network device
- is attached to a PHY device. This function is responsible for appropriately
- configuring the switch port link parameters: speed, duplex, pause based on
- what the phy_device is providing.
- - fixed_link_update: Function invoked by the PHY library, and specifically by
- the fixed PHY driver asking the switch driver for link parameters that could
- not be auto-negotiated, or obtained by reading the PHY registers through MDIO.
- This is particularly useful for specific kinds of hardware such as QSGMII,
- MoCA or other kinds of non-MDIO managed PHYs where out of band link
- information is obtained
- Ethtool operations
- ------------------
- - get_strings: ethtool function used to query the driver's strings, will
- typically return statistics strings, private flags strings etc.
- - get_ethtool_stats: ethtool function used to query per-port statistics and
- return their values. DSA overlays slave network devices general statistics:
- RX/TX counters from the network device, with switch driver specific statistics
- per port
- - get_sset_count: ethtool function used to query the number of statistics items
- - get_wol: ethtool function used to obtain Wake-on-LAN settings per-port, this
- function may, for certain implementations also query the master network device
- Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
- - set_wol: ethtool function used to configure Wake-on-LAN settings per-port,
- direct counterpart to set_wol with similar restrictions
- - set_eee: ethtool function which is used to configure a switch port EEE (Green
- Ethernet) settings, can optionally invoke the PHY library to enable EEE at the
- PHY level if relevant. This function should enable EEE at the switch port MAC
- controller and data-processing logic
- - get_eee: ethtool function which is used to query a switch port EEE settings,
- this function should return the EEE state of the switch port MAC controller
- and data-processing logic as well as query the PHY for its currently configured
- EEE settings
- - get_eeprom_len: ethtool function returning for a given switch the EEPROM
- length/size in bytes
- - get_eeprom: ethtool function returning for a given switch the EEPROM contents
- - set_eeprom: ethtool function writing specified data to a given switch EEPROM
- - get_regs_len: ethtool function returning the register length for a given
- switch
- - get_regs: ethtool function returning the Ethernet switch internal register
- contents. This function might require user-land code in ethtool to
- pretty-print register values and registers
- Power management
- ----------------
- - suspend: function invoked by the DSA platform device when the system goes to
- suspend, should quiesce all Ethernet switch activities, but keep ports
- participating in Wake-on-LAN active as well as additional wake-up logic if
- supported
- - resume: function invoked by the DSA platform device when the system resumes,
- should resume all Ethernet switch activities and re-configure the switch to be
- in a fully active state
- - port_enable: function invoked by the DSA slave network device ndo_open
- function when a port is administratively brought up, this function should be
- fully enabling a given switch port. DSA takes care of marking the port with
- BR_STATE_BLOCKING if the port is a bridge member, or BR_STATE_FORWARDING if it
- was not, and propagating these changes down to the hardware
- - port_disable: function invoked by the DSA slave network device ndo_close
- function when a port is administratively brought down, this function should be
- fully disabling a given switch port. DSA takes care of marking the port with
- BR_STATE_DISABLED and propagating changes to the hardware if this port is
- disabled while being a bridge member
- Hardware monitoring
- -------------------
- These callbacks are only available if CONFIG_NET_DSA_HWMON is enabled:
- - get_temp: this function queries the given switch for its temperature
- - get_temp_limit: this function returns the switch current maximum temperature
- limit
- - set_temp_limit: this function configures the maximum temperature limit allowed
- - get_temp_alarm: this function returns the critical temperature threshold
- returning an alarm notification
- See Documentation/hwmon/sysfs-interface for details.
- Bridge layer
- ------------
- - port_join_bridge: bridge layer function invoked when a given switch port is
- added to a bridge, this function should be doing the necessary at the switch
- level to permit the joining port from being added to the relevant logical
- domain for it to ingress/egress traffic with other members of the bridge. DSA
- does nothing but calculate a bitmask of switch ports currently members of the
- specified bridge being requested the join
- - port_leave_bridge: bridge layer function invoked when a given switch port is
- removed from a bridge, this function should be doing the necessary at the
- switch level to deny the leaving port from ingress/egress traffic from the
- remaining bridge members. When the port leaves the bridge, it should be aged
- out at the switch hardware for the switch to (re) learn MAC addresses behind
- this port. DSA calculates the bitmask of ports still members of the bridge
- being left
- - port_stp_update: bridge layer function invoked when a given switch port STP
- state is computed by the bridge layer and should be propagated to switch
- hardware to forward/block/learn traffic. The switch driver is responsible for
- computing a STP state change based on current and asked parameters and perform
- the relevant ageing based on the intersection results
- Bridge VLAN filtering
- ---------------------
- - port_pvid_get: bridge layer function invoked when a Port-based VLAN ID is
- queried for the given switch port
- - port_pvid_set: bridge layer function invoked when a Port-based VLAN ID needs
- to be configured on the given switch port
- - port_vlan_add: bridge layer function invoked when a VLAN is configured
- (tagged or untagged) for the given switch port
- - port_vlan_del: bridge layer function invoked when a VLAN is removed from the
- given switch port
- - vlan_getnext: bridge layer function invoked to query the next configured VLAN
- in the switch, i.e. returns the bitmaps of members and untagged ports
- - port_fdb_add: bridge layer function invoked when the bridge wants to install a
- Forwarding Database entry, the switch hardware should be programmed with the
- specified address in the specified VLAN Id in the forwarding database
- associated with this VLAN ID
- Note: VLAN ID 0 corresponds to the port private database, which, in the context
- of DSA, would be the its port-based VLAN, used by the associated bridge device.
- - port_fdb_del: bridge layer function invoked when the bridge wants to remove a
- Forwarding Database entry, the switch hardware should be programmed to delete
- the specified MAC address from the specified VLAN ID if it was mapped into
- this port forwarding database
- TODO
- ====
- The platform device problem
- ---------------------------
- DSA is currently implemented as a platform device driver which is far from ideal
- as was discussed in this thread:
- http://permalink.gmane.org/gmane.linux.network/329848
- This basically prevents the device driver model to be properly used and applied,
- and support non-MDIO, non-MMIO Ethernet connected switches.
- Another problem with the platform device driver approach is that it prevents the
- use of a modular switch drivers build due to a circular dependency, illustrated
- here:
- http://comments.gmane.org/gmane.linux.network/345803
- Attempts of reworking this has been done here:
- https://lwn.net/Articles/643149/
- Making SWITCHDEV and DSA converge towards an unified codebase
- -------------------------------------------------------------
- SWITCHDEV properly takes care of abstracting the networking stack with offload
- capable hardware, but does not enforce a strict switch device driver model. On
- the other DSA enforces a fairly strict device driver model, and deals with most
- of the switch specific. At some point we should envision a merger between these
- two subsystems and get the best of both worlds.
- Other hanging fruits
- --------------------
- - making the number of ports fully dynamic and not dependent on DSA_MAX_PORTS
- - allowing more than one CPU/management interface:
- http://comments.gmane.org/gmane.linux.network/365657
- - porting more drivers from other vendors:
- http://comments.gmane.org/gmane.linux.network/365510
|