Homepage GitHub

Alternative transport protocols


(Pavel Kirienko) #1

This post is the result of my limited research into alternative transports conducted over the last two weeks. It is assumed that the reader is familiar with the latest draft of the specification v1.0.

(Aug 2019 edit: the ideas outlined here are being implemented either as-is or in a refined form. The current Specification draft may have diverged from this post which is not being actively updated. In the case of any contradiction, information provided in the Specification draft and related materials takes precedence over this post.)

Motivation

In the very early days of UAVCAN, an automotive electronics engineer, who worked for Mentor Graphics at the time, helped me with some of the major design decisions. He was of the opinion (paraphrasing) that a new intravehicular communication protocol would never see widespread adoption unless it is designed to be portable across different transports. I generally agree with this sentiment, although back then, I mostly ignored it in order to stay focused on a particular, narrow application domain so that I could work out and validate the main design decisions and ideas. The protocol, as it was back then before the first (version zero) specification was even released, was very different from what we have today. In particular, the model of communication and the transport layer design were strongly tied to the CAN bus which is no longer the case in the version one draft that we have worked out so far. As such, it is important to evaluate the current draft specification from the standpoint of cross-transport portability so as to ensure that no features may harm such flexibility in the first stable release.

It would be a mistake to think that the purpose of the protocol itself is defined through CAN bus or any other particular transport. Rather, it is intended to be a purely application-layer system. We have attempted to reflect this in the current draft spec by separating the core logic from CAN bus transport definition. The name “UAVCAN” is therefore confusing. The reason it is used is purely historical and has no semantic weight (which is the reason why the specs, both v0 and v1, do not provide an explanation for the acronym). I suggest that with the addition of a new transport layer to the specification (which is likely to take place in UAVCAN v1.x, where x > 0), the document will specify the following meaning for the name UAVCAN: Uncomplicated Application-level Vehicular Communication And Networking.

I have attempted to identify the current trends in relevant industries from my own experience in related projects and literature in order to define the objectives and requirements, which I have then used to construct two new transport protocols for UAVCAN.

Background

For the purposes of this evaluation, the following applications of interest have been chosen (in order of preference): light manned electric aircraft and heavy unmanned drones (which share the same technological base, according to an industry expert who shall remain unnamed in this public post), medium and light drones (which are generally well-served by CAN), micro-satellites, other types of aircraft, and autonomous driving systems.

The demands of modern onboard intelligence systems tend to exceed the capabilities of legacy intravehicular communication standards developed specifically for vehicular applications. Due to these unmet demands, the purely vehicular solutions (such as CAN bus, FlexRay, ARINC 429, MIL-STD-1553, etc) are being replaced by (or augmented with) alternatives built upon higher-performance conventional technologies originally developed for consumer or industrial applications. Notable examples of such technology transfer are the AFDX bus for modern avionics (ARINC 664; widely used in modern airliners and spacecraft) and the new AUTOSAR 17.03 extensions for high-performance automotive onboard networks; both of the standards are built on top of standard Ethernet (with some adjustments and limitations) with support for the conventional IP stack.

The effects of this transition can be seen in the market: high-reliability vehicular Ethernet switches are freely available from numerous suppliers off-the-shelf. Here is one such example, and here is another. Also, AFDX vendors recommend Catalyst 2900 (a regular COTS L2 switch) for testing and evaluation.

Despite being based on the same underlying technology (commodity Ethernet, copper cabling or fiber optics), standards tend to introduce various design trade-offs in an attempt to better suit their target applications.

For example, the AUTOSAR extension mentioned earlier is created to be compatible with common POSIX computing platforms with very minor restrictions. Its software design specs permit dynamic reconfiguration, arbitrary multi-threading (including dynamic thread spawning and termination), and even purely dynamic memory allocation. Overall there is some movement from hard determinism towards more flexible designs. This could probably be explained by the rapid increase of the complexity of onboard software but this is a separate topic. The SOME/IP protocol is very stateful and relies on fundamentally non-deterministic technologies such as TCP/IP, reminiscent of the Internet technology stack. There is no explicit support for high-reliability features such as interface redundancy.

AFDX, on the other hand, is leaning in the opposite direction, providing native support for redundant transports and strict timing and delivery guarantees. It can be viewed as a drop-in transport replacement for (nearly) obsolete ARINC 429, replacing large point-to-point wiring harnesses with a single compact switched network. One of its interesting properties is that the routing and bandwidth allocation information is configured statically in the switches, which alongside with its unusual treatment of MAC and IP addresses constitutes a significant departure from conventional Ethernet deployments. Higher-level layers are, however, built upon conventional UDP; the choice of UDP seems pretty straightforward due to its simplicity and time determinism (unlike TCP), the lack of guaranteed delivery here is irrelevant because a robust transport is already provided by the underlying L2/L3 network.

There is also a variety of industrial communication standards which are designed to provide higher availability or predictability but for different reasons, they are suboptimal or unfit for use in hi-rel vehicular applications (“The Evolution of Avionics Networks From ARINC 429 to AFDX” 5.2.1). Their failure modes are of particular concern. Industrial applications tend to be fail-safe rather than fail-operational so their design objectives are quite different.

In the scope of hard real-time deterministic applications, the big problem of Ethernet and switched networks in general is the difficulty predicting the worst case packet propagation latency. As reviewed in detail in “The Evolution of Avionics Networks From ARINC 429 to AFDX” and “Communications for Integrated Modular Avionics”, star network topologies have inherent contention points at the output ports of the network switch hardware, since this is where the traffic originating asynchronously from different sources has to be serialized and pushed out to the destination (or the next hop) sequentially (we’re talking about full-duplex links here; half-duplex is unsuitable due to its nondeterministic collision resolution policy). For well-behaved network hardware, it can be proved that given a limited network load, the packet propagation latency is always bounded. This is one of the cornerstone principles of AFDX; the ability of the network to meet the real-time requirements hinges on the under-utilization of its bandwidth. According to the information supplied by the above-mentioned expert from Mentor Graphics, there are parallels to be made with the early days of CAN, when automotive engineers neglected to take advantage of the built-in CAN ID arbitration features, and resorted to limiting the maximum bandwidth utilization in order to keep the data propagation latencies predictable (one can still find traces of such archaic thinking in some older documents which recommend to never exceed 50%-70% of the total CAN bus bandwidth).

There are efforts to somewhat alleviate the negative effects of output port contention in Ethernet networks to improve the performance of real-time applications. AFDX, in particular, prioritizes its data paths (“virtual links”, I won’t go into detail here; this can be viewed as a tunneling feature for the legacy ARINC 429) according to statically pre-configured routing settings in the switches (special AFDX switches, commodity network hardware may not be directly applicable without sacrificing latency).

Regular COTS networking hardware supports VLAN QoS and configurable classes of service. In fact, modern COTS L2 switches offer hardware support for traffic policing and prioritization based on arbitrary user-configurable rules; for example, it is technically possible to prioritize or drop L2 frames based on the value of some arbitrary bit field in a packet (see Juniper filters and classifiers; also, Cisco FlexMatch is usable for CoS assignment). While we are talking about networking hardware, it should also be mentioned (although it is probably a well-known fact anyway) that even the most basic single-chip Ethernet controllers support rather sophisticated hardware traffic filtering policies; e.g., the ubiquitous Microchip ENC28J60 supports packet filtering based on simple payload pattern matching (section 8.2). At the risk of getting ahead of myself, I should say that these advanced features can be further exploited to create a powerful and flexible real-time network architecture using only COTS hardware, of which we will talk later.

I am perceiving a slow expansion from strict determinism and rigorous models towards more flexible, less deterministic systems with somewhat relaxed requirements to their time predictability. If my perception is correct, the change could be explained by a steady increase in complexity of the onboard intelligence (both in hardware and software) and by slow relocation of the responsibilities from human operators to automated ones. For automotive systems, one could see the precursors in the addition of Ethernet networks alongside the strictly deterministic CAN/FlexRay and the new software development standards which permit dynamic threads and dynamic memory allocation. For avionics, the addition of Ethernet alongside very robust signaling links like ARINC 429 and the new software virtualization features outlined in ARINC 653 could be interpreted as pointing in the same direction.

The above overview was focused exclusively on wired and optical networks. At the first glance, this seems an obvious choice given the current set of available technologies, however, there is one new and very interesting undertaking to consider: Wireless Avionics Intra-Communications (WAIC). Networks that rely on physical rigging such as cables have common failure modes (e.g., a cable may be torn, wires may be affected by EMI) which do not affect wireless links. The failure modes of the latter are drastically different, which may theoretically permit one to construct a very robust network by effectively employing the dissimilarity between the failure modes of wired and wireless links.

Lastly, I would like to mention two small independent undertakings to develop a brokerless message bus over UDP (a logical bus over a physical star topology) (both sources are in Russian; Github links are available in the linked articles): MQTT/UDP – a brokerless reimplementation of MQTT using the UDP broadcast transport; Mutalk – a very compact pub/sub protocol, also based on UDP broadcast. The projects are targeting completely different applications; ones where a high degree of determinism is not required. However, they are worth mentioning because of their similar principles of communication.

Relevant sources:

Objectives

Looking at my recent experiences with a certain related application and considering the above assessment, I would like to perform a porting exercise to ensure that the concepts and principles that go into the first stable release of the specification will not prevent us from efficiently supporting new transports in the future. The propositions made below are very far from being spec-ready or even production-ready; rather, they should be considered as a set of basic ideas that we can build upon in the future. Some of them may be implemented in software as experimental extensions of the protocol.

Since I have mentioned the software, I should clarify that although the protocol supports different transports, this does not mean that every implementation is required to do the same. Obviously, some implementations will be focused on some particular type of transport (e.g., libuavcan is built for CAN), while others may support many types of transports concurrently (e.g., pyuavcan would be trivial to extend for any transport)).

So, the high-level objective of this exercise is to make UAVCAN usable over a more capable wired transport than CAN. The requirements are:

  • The maximum supported throughput is at least 1 Gbps, preferably up to 10 Gbps (for future extensibility).
  • Latency is bounded and predictable; for a typical deployment, it should be in the range of hundreds of microseconds per frame.
  • The transport is extensible for large deployments up to at least 1000+ nodes and 1+ km of total wiring length.
  • The transport should support efficient broadcasting since this is the primary method of data exchange for UAVCAN. There must be facilities for efficient traffic filtering both in the end nodes and in the auxiliary network equipment (e.g., packet switches).

Wireless transports should also be considered as either standalone or redundantly heterogeneous. The initial set of requirements stemmed from the properties of our target vehicular applications and included:

  • Same latency requirements.
  • Support for direct broadcasting between nodes.
  • Operating range up to 100 meters.

UAVCAN is designed as a logical bus (where “logical” means the high-level communication model and not the physical network topology; for example, CAN is a physical bus, a gigabit Ethernet network is a physical star/tree whereas low-speed Ethernet can be either). This choice of logical topology has some significant advantages; it should therefore not be altered. The challenges that it creates for the transport layer shall be managed by the transport layer itself, not at the expense of the upper layers of the stack. For example, CAN bus offers hardware acceptance filtering for subscription opt-in and service transfer addressing; similar mechanisms must be available in the new transports. While this topic is too complex to cover here extensively, the logical bus topology can be considered superior in its simplicity and flexibility as the data sources and consumers can be completely logically decoupled from each other. By contrast, topologies based on explicit routing (e.g., SpaceWire) require a logical binding between agents, whereas subscription-based topologies (e.g., SOME/IP, DDS/RTPS) tend to be very stateful and thus fragile. In order to avoid traffic duplication, the transport must support efficient multicasting natively. Again, this is not supposed to be an exhaustive overview of network architectures; such discussion is beyond the scope of this article.

CAN bus guarantees that the data propagation latency is equal across the whole network, meaning that every station receives a transport frame at the same time. This property is leveraged by UAVCAN for the precise time synchronization feature only. Therefore, provided that an alternative method of time synchronization is available, the new transport is relieved from guaranteeing uniform propagation latency.

As different applications may favor different transports, it is expected that different subsystems within one vehicle may choose to employ different transports while still needing to exchange data with each other. This pattern can be observed in common avionic systems, where, for example, CAN-based ARINC-825 subnets (e.g., wingtip avionics) may interconnect with the backbone AFDX network via gateway nodes. This use case should be well-supported.

The new transports must support heterogeneous redundant configurations to enable dissimilar transport redundancy. The types of involved transports and their properties should be hidden from the application.

The new transports should minimize restrictions or special requirements for the lower layers. For example, redundant Ethernet deployments in avionics require that a disconnected port shall continue transmitting data, despite the expectation that the data will never be delivered to the other end of the link; this is done to prevent stale data from backing up in the transmit queue because if a connection is restored, the stale data would be released on to the network and possibly disrupt the operation of the system.

Transport-agnostic model refinement

One might be easily mistaken to believe that the current draft specification is closely tied to the CAN bus and is therefore not portable. In order to demonstrate that this is not true, we need to update the communication model definition to make it transport-agnostic.

The following diagram introduces several new terms.

“Specifier” is a collection of identifiers that together define a category of entities. Specifiers are auxiliary ephemeral constructs which are needed only for completeness of the model and for reasoning about the protocol; implementations need not be involved with them.

“Route specifier” is either of:

  • a pair of source node ID and destination node ID;
  • a source node ID only; in this case, it is implied that the destination is the whole network (i.e., broadcast).

“Data specifier” is either of:

  • subject ID;
  • service ID and a selector indicating whether the transfer is a service request or a service response.

A data specifier describes what data structure is contained in the transfer and what it means (i.e., how it should be interpreted).

“Session specifier” contains a data specifier and a route specifier. Its purpose is to uniquely identify not only the meaning of data but also the agents participating in its exchange. The term may remind one of the layer 5 of the standard OSI model, but such mapping may not be entirely correct.

It is well known that one diagram is worth 1024 words:

Communication%20link%20model

Applying this model to CAN, one will see that the CAN frame identifier contains the session specifier and the transfer priority. The transfer ID is moved to the CAN payload since it is not a member of the session specifier and therefore it is useless for routing and filtering. The fact that the entirety of the session specifier is managed by the same feature of the protocol is the manifestation of the fact that the CAN bus covers several far-separated ISO/OSI layers.

The proposed model can be easily applied to various transport protocol stacks that implement stricter adherence to the ISO/OSI model. This is demonstrated in the next section. The model could also be applied to less well-layered transports, such as, for example, FlexRay, but it may be much more difficult.

Protocol extensions and modifications

Before we define the new transport-specific implementations, a few other questions must be resolved first.

Subject ID range problem

(The proposal has been accepted and implemented in Specification v1.0. At the time of writing, the subject-ID range was [0, 65535])

A well-layered transport like UDP or IEEE 802.15.4 will take care of datagram delivery between hosts without any help from the higher layers (which is unlike CAN). Therefore, the route specifier will not be used above the transport layer.

The data specifier, which is intended to communicate how the data contained in the transport datagram should be handled, needs to be carried in the transport frame next to the payload. As we are dealing with relatively high-level transports, reliance on bit-level data field segmentation (like in CAN ID) or integers of non-standard bit width (not 8, 16, 32, or 64) may be impractical. Therefore, the wire representation of a data specifier should fit into a standard-size integral value.

Per the definition provided earlier, a data specifier consists of:

  • A kind selector, i.e., message or service (2 cases, 1 bit).
  • If kind selector is “message”:
    • Subject ID (65536 cases, 16 bits)
  • If kind selector is “service”:
    • Service ID (512 cases, 9 bits)
    • Request/response selector (2 cases, 1 bit)

If the kind selector is set to “service”, the required number of bits is 1 + 9 + 1 = 11 bits, which fits into a standard-size 16-bit integer field, leaving 5 bits reserved for future needs.

If the kind selector is set to “message”, the required number of bits is 1 + 16 = 17 bits. The next standard-size integer field is 32 bits wide, which would leave 15 bits unused. A more practical solution is to reduce the number of subject ID cases to 32768, thereby freeing up one bit.

If the above change is implemented, the data specifier will be able to fit into a standard-size 16-bit integer field. The exact mapping can be defined for each transport layer individually; for example, the lower 32768 values can be used to represent the subject ID directly (since this is the most commonly used form of communication), the next 512 values can be reserved for service ID requests, and then 512 values for service responses. The unused 31744 values will be reserved for future use.

Due to the reduced range, the subject ID range segmentation should be altered as follows:

From To Capacity Purpose
0 24575 24576 Unregulated identifiers
28672 29695 1024 Non-standard (vendor-specific) regulated identifiers
31744 32767 1024 Standard regulated identifiers

Node ID range problem

As discussed earlier, the limit of 128 nodes per network is unacceptable for larger deployments (which will become available with new transports); therefore, the limit needs to be increased for non-CAN based transports.

Using existing networks as a reference, particularly the upcoming WAIC standard discussed earlier, one will see that there exist sensible vehicular applications requiring thousands of (simple) nodes per logical network.

Following the principle of adherence to standard-size integers, the next appropriate threshold for node ID size is 16 bit or 65536 nodes per network. This value would map well to some existing protocols, such as, for example, IEEE 802.15.4 (where valid node addresses range from 0 to 65534, inclusive, with 65535 reserved for broadcasting), the last hextet of an IPv6 address, or the last two octets of a class-B IPv4 address.

However, considering the limited set of available subject IDs (24576 values, we are not including regulated identifiers since they are fixed for all nodes), there set of practical usage scenarios where a network might sensibly utilize more than 24576 nodes is limited.

The specifics of highly deterministic nodes need to be considered as well; for example, nodes that perform mission-critical and/or hard real-time tasks that require a highly predictable behavior. Upon careful evaluation of the UAVCAN stack, one can generally see that the amount of resources (time and/or memory, with a possibility of trade-off) necessary for deterministic handling of a given transport frame is a function of the highest possible (worst case) number of nodes in the network (among other things, possibly, depending on the implementation). I will skip a detailed analysis here but the main reason for this dependency is that the protocol requires receiving nodes to maintain individual state per transmitting node.

One might argue that this is a design issue of the protocol but so far no better solutions that meet the core design goals have been found so the question of optimal design should not be raised in this discussion.

Additionally, some of the suitable transport protocols may require the transport protocol itself (meaning, besides the higher layers of the stack) to keep some state per node. For example, IP-based transports must allocate space for ARP tables. Complex deterministic nodes that are expected to initiate unicast (service) transfers (n.b.: most resource-constrained end-nodes will not need ARP by virtue of only needing broadcast transfers) or protocol bridge nodes will have to allocate copious amounts of memory for static ARP tables. To demonstrate the extent of the problem, if we were to use 16-bit node ID and limited the worst case (maximum) node capacity to 65536 nodes per network, the worst-case size of the highly-deterministic ARP table might be as high as 384 KiB (6 bytes per MAC address * 216 nodes / 1024 bytes per KiB) per redundant interface (assuming one MAC address per interface). Although as was said earlier, this consideration does not apply to most nodes (especially simple ones) because they can limit themselves to broadcasting only, requiring unicast transfers only for responding to service requests, in which case a single-entry ARP cache (6 bytes) will suffice.

Certainly, one could design a well-behaved deterministic node without O(1) containers but the dependency on the network node capacity would remain. Only its manifestation would change. Besides the memory footprint, it would also affect its frame processing time, although this would still be bounded. The point of the above was to demonstrate that the maximum network capacity must be sized properly to find the optimum satisfying all requirements:

  • The maximum number of supported nodes per network should be sufficient for any sensible application.
  • Deterministic nodes should not be burdened unnecessarily.

Given the limit of ~25k of subject ID and the determinism considerations, we could arbitrarily draw the line at 212 (4096) nodes per network, with a possibility of future extension.

There will remain an odd duality between different types of transports: some of them will be limited to 128 nodes per (sub-) network max (CAN 2.0, CAN FD), others will be able to reach the logical limit defined above.

Transfer ID range problem

The limited dynamic range of the transfer ID, and the resulting very short overflow period, is a serious limitation of the CAN transport. This problem affects nodes with redundant interfaces, requiring them to receive transfers from only one of the available redundant interfaces. Simultaneous reception from multiple interfaces is not possible because of the very short wraparound period of transfer ID (every 32 transfers). This is unfortunate because:

  • Simultaneous reception through all of the available interfaces (like in AFDX) reduces median latency and jitter (although it generally cannot improve the worst case). The CAN transport cannot benefit from these advantages.
  • In the case of an interface failure, the receiving node may lose some of the incoming data before switching over to one of the redundant interfaces. If simultaneous reception is used, failure of an interface does not affect the operation of the node as long as at least one of the available interfaces continues to function.
  • Intermittent failures of all of the available interfaces (e.g., due to a faulty connection or another common-mode connectivity failure) may render the node unable to receive data from the bus due to the switch-over delay.
  • Transmitting nodes must handle disconnected interfaces (this includes intentionally disconnected interfaces and those experiencing failures) in a special way, ensuring that their transmission queues do not contain stale transport frames. Otherwise, if the connection is restored, the obsolete frames will be transmitted on to the network, possibly disrupting application-level processes on the nodes that consume the published data. This is because the receiving nodes are unable to reliably compare the age of data due to frequently overflowing transfer ID values.
  • It is not possible to reliably determine the number of lost/undelivered transfers (unless the application layer is involved) because of the overflowing nature of the transfer ID. This makes certain use cases, such as re-requesting lost or missing data, more complex than they could be.

Unfortunately, due to the limited capabilities of CAN, trade-offs had to be made. Carrying the same trade-offs to more capable transports would be unwise. Hence, the dynamic range of the transfer ID should be increased.

There exist certain failure modes, such as, for example, the case of the temporarily disconnected interface, where an overflowing transfer ID is likely to cause problems regardless of how large the overflow period is. At the same time, the payload of highly capable transports is relatively cheap to accommodate a sufficiently large transfer ID to ensure that it will never overflow in a sensible scenario.

It is therefore proposed to equip all transports that are more capable than CAN with a very wide transfer ID parameter. For transports with the maximum throughput under 105 transfers per second, the transfer ID field should be at least 48 bits wide (overflow period at the specified transfer exchange rate: ~90 years). For higher-throughput transports, the transfer ID field should be at least 56 bits wide. The nearest standard-size integer field is 64 bits wide. As a theoretical worst case reference (unattainable in practice), a COTS 10 GbE adapter is capable of handling up to 14.9 million frames per second.

This change will greatly simplify transfer reception handling, increase the resilience of the protocol to interface failure, and solve the other problems listed above, at the expense of several bytes of overhead per frame.

Lastly, it should be noted that reassembly of multi-frame transfers can be done only on a per-interface level, meaning that frames belonging to the same transfer cannot be sourced from different interfaces. This is because the MTU is not guaranteed to be the same for all of the available redundant transports, especially if they are heterogeneous.

Compact data type identifier

(Aug 2019 edit: implemented in PyDSDL and PyUAVCAN with a slightly different structure under the revised name “data type hash”.)

Currently, a data type is unambiguously identified by its full name (e.g., uavcan.node.Heartbeat). We have eliminated numerical identifiers in v1.0 and introduced subjects and services instead. This resulted in a minor issue that data type compatibility cannot be easily and robustly validated at runtime.

To work around this problem, I propose a new concept for use with more capable transports than CAN: a compact data type identifier, or CDTID for brevity. A CDTID is defined as a function of the data type name and version. The fact that it is a function of an existing property rather than an entirely new user-level entity is important, as it relieves the user from maintaining a yet another numerical identifier (UAVCAN has plenty of them as it is).

Unlike the old data type signature used in UAVCAN v0, CDTID does not vary with the actual definition of the type. Instead, it depends purely on the name and version, leaving the compatibility-related matters to static analysis and DSDL processing tools.

Besides type safety, a CDTID can be used for filtering UAVCAN traffic by data type. In Ethernet networks, such filtering can be performed by COTS L2 switches and by network hardware on end nodes. As discussed earlier, many COTS L2 switches are known to support hardware traffic prioritization and filtering by matching packets against user-defined masks. These commonly available features could be employed to prevent irrelevant broadcast traffic (chosen by data type alongside other properties, e.g., subject ID) from propagating into certain ports where it is not needed, thereby decreasing the output port contention, latency, and jitter. This is conceptually similar to Virtual Link ID routing implemented in AFDX.

Additionally, CDTID simplifies postmortem log analysis: since every frame carries its data type information, the data can be analyzed without any prior knowledge of the network configuration.

A CDTID is constructed as a 64-bit unsigned integer. The value has a particular structure to facilitate filtering and routing:

  • The 32 most significant bits are a CRC32C hash of the root namespace suffixed with the fixed salt svo0 (in ASCII: 115, 118, 111, 48). The salt is chosen empirically to produce recognizable hexadecimal/binary pattern for the standard namespace uavcan (0x66666666).
  • The following 12 bits (i.e., 20…31 counted from LSB) contain the twelve least significant bits of a CRC32C hash of the sub-root namespace. For example, node for uavcan.node.Heartbeat, or primitive for uavcan.primitive.array.Integer8. If there is no sub-root namespace, the hash will be applied to an empty string, producing zero.
  • The following 12 bits (i.e., 8…19 counted from LSB) contain the twelve least significant bits of a CRC32C hash of the remaining part of the full data type name. For example, Heartbeat for uavcan.node.Heartbeat, or array.Integer8 for uavcan.primitive.array.Integer8.
  • The last 8 bits (the least significant byte) contain the major version number of the data type.

The following is a demo in Python provided for reference (based on PyDSDL):

ns_without_root = t.name_components[1:]
if len(ns_without_root) > 1:
    subroot_ns, name_tail = ns_without_root[0], '.'.join(ns_without_root[1:])
else:
    subroot_ns, name_tail = '', ns_without_root[0]
cdtid = (compute_crc32c((t.root_namespace + 'cvo0').encode()) << 32) | t.version.major
cdtid |= (compute_crc32c(subroot_ns.encode()) & 0xFFF) << 20
cdtid |= (compute_crc32c(name_tail.encode()) & 0xFFF) << 8
print(hex(cdtid))

Examples (the underscores separating the CDTID segments are added for clarity: (root namespace)_(sub-root namespace)_(tail)_(major version number)):

Full name 64-bit CDTID as hex
uavcan.Test.255.1 66666666_000_64b_ff
uavcan.internet.udp.OutgoingPacket.0.1 66666666_1b3_936_00
uavcan.internet.udp.HandleIncomingPacket.0.1 66666666_1b3_c2f_00
uavcan.node.Version.1.0 66666666_3fa_c2a_01
uavcan.node.GetInfo.0.1 66666666_3fa_2d8_00
uavcan.node.GetTransportStatistics.0.1 66666666_3fa_63b_00

The segmented nature of CDTID enables sophisticated hardware filtering not only by data type but also by its name (i.e., ignoring the version number; such as if an assumption was made that the destination supports all versions), the root namespace, and the sub-root namespace (e.g., a modem node may wish to receive only uavcan.internet.* and some vendor.custom_telemetry.* from the whole network). As explained above, such filtering can be implemented by masking away irrelevant segments of the CDTID. Again, additional filtering can be also performed by subject ID, if necessary, similar to VLID routing in AFDX.

The large space reserved for the root namespace hash is necessary to minimize the probability of collisions between different vendors or other namespace owners. There may be no easy way of ensuring that any two namespaces are collision-free unless there is some global repository of them (which is undesirable to have); hence the large hash. For reference, the collision probability for a perfect 32-bit hash is dependent on the total number of root namespaces as follows:

  • 10k namespaces — 1%
  • 20k namespaces — 5%
  • 30k namespaces — 10%

The remaining two hashes are made small because the conflicts within a namespace can be detected immediately and therefore are cheap to resolve manually. A 12-bit hash offers 4096 possible values, thereby limiting the total number of sub-root namespaces and the number of sub-root namespace entries. The collision probability assessment looks as follows, assuming perfect hash:

  • 10 items — 1%
  • 20 items — 5%
  • 30 items — 10%
  • 50 items — 25%
  • 75 items — 50%
  • 4k items — 100% (capacity limit)

Considering the rapidly increasing probability of collision, having more than 75 sub-root namespaces per root namespace and more than 75 data types per sub-root namespace (which yields: 752 = 5625 data types per root namespace) may be impractical without some form of manual control over the hash function (technically, it is always possible to find a set of 4096 names that will produce distinct non-conflicting hashes, but such names are likely to be meaningless or clumsy, defeating their purpose). It is possible to work around this by offering users some optional DSDL directives overriding the auto-computed hash values with manually provided values, which would decouple the hash from the name, allowing the user to pick both freely. This carries some serious disadvantages, such as the hash is no longer a function of mere type name, but also of its DSDL definition.

I would like to avoid detailed discussion and stop here because CDTID is not meant to be a finalized proposal; rather, it should be considered as an abstract idea of a compact representation of type information for safety and filtering purposes.

Time synchronization

As mentioned earlier, the currently defined time synchronization algorithm hinges on the assumption that the frame propagation latency throughout the whole bus is much less than a single bit period. This is true for CAN and similar physical bus topologies but does not hold for star or tree networks.

In the case of Ethernet-based networks, the problem of precise time synchronization is addressed well by IEEE 1588. Nearly every modern Ethernet-enabled microcontroller supports IEEE 1588 in hardware (all modern MCUs from NXP, STM, and Microchip seem to support it, according to my quick look-up), and the theoretical performance of this protocol exceeds that of UAVCAN.

Other transports, however, may not have such well-defined and well-supported standard solutions. In these cases, the algorithm defined in UAVCAN can still be used if augmented with the Olson latency recovery algorithm. The resulting solution will be less accurate than the native CAN-based one or IEEE 1588 but it is likely to still be sufficient for most distributed control needs.

The core assumption of the Olson algorithm is that the message propagation medium adds an unknown and variable latency to the message, but it is assumed that occasionally the medium will exhibit the minimal latency. The Olson algorithm can identify such low-latency packets and use them to establish synchronization with minimal clock skew. The algorithm is implemented entirely on the receiving side and requires no slave-to-master communication. Therefore, unlike IEEE 1588, it scales very well for large networks. The short-term attainable accuracy equals the best-case (minimal) packet propagation delay from the master to the slave (the long-term accuracy is also dependent on the drift rate of the slave’s local clock); the worst case error is bounded by the worst case propagation delay.

The described algorithm can be implemented without any modifications to the synchronization protocol; the changes will be limited to slave-side logic only.

Proposed transport-specific implementations

The proposed two new transport protocols for this evaluation are the standard OSI layer 4 UDP/IP stack and a simple wireless PAN protocol IEEE 802.15.4.

The UDP/IP stack is chosen primarily because of its native compatibility with the Internet protocol suite and Ethernet, which, as we established earlier, is finding widespread use in safety-critical vehicular systems. Another equally important reason for its use is the widespread support of the Internet protocol suite by all sorts of commodity and industrial equipment, their systems, and the huge variety of available physical layers (e.g. regular copper cables, high-speed fiber optics, wireless, power line communications, etc.). With proper design provided, a UDP-based protocol can take advantage of the flexibility of its transport and thus becoming equally flexible itself. Unlike other L4 Internet Protocol Suite protocols (e.g., TCP), UDP is well suited for high-reliability real-time applications, conditional that the underlying layers provide adequate guarantees (such as robust equipment, limited port contention on the switching hardware, bounded latency, etc; refer to the earlier sections for the background).

While IEEE 802.15.4 may be unusable in some of the targeted applications due to its limited bandwidth (250 kbps), it is representative of low-level simple wireless network protocols, and thus works as a baseline for this exercise. One could also imagine a sensible subset of this protocol that would be usable in hard real-time environments (e.g., the standard supports deterministic TDMA out of the box), but this discussion would be out of place here. Such particulars of the transport lie way below the level of abstraction we’re currently dealing with.

UDP/IP

One UDP datagram represents one UAVCAN transport frame. The data specifier is encoded in the destination port number at the UDP level, which allows us to take advantage of the datagram processing capabilities of the standard UDP/IP stack: the UDP stack will deliver UAVCAN frames to the appropriate handlers based on the port number. The port number mapping will be as follows (the specified ranges are inclusive):

  • 16384…49151 — subject ID, offset by 16384.
  • 15872…16383 — service ID for request transfers, offset by 15872.
  • 15360…15871 — service ID for response transfers, offset by 15360.

The remaining values are free for other uses (non-UAVCAN-related). Particularly:

The port distribution can be visualized as follows, where w - well-known ports, S - services, M - subjects, e - ephemeral ports, - - free/unused; 1024 ports per symbol:

w--------------SMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMeeeeeeeeeeeeeeee

The source port number is not used and can be arbitrary (ephemeral).

The node ID is encoded in the least significant bits of the IP address, which can be either IPv4 or IPv6. For example, 192.168.1.123 corresponds to node ID 379. Broadcast transfers will be sent to the local subnet broadcast address.

The remaining information – transfer ID, CDTID, priority, and multi-frame segmentation metadata – is encoded in the header. There will be two header formats: one for single-frame transfers, and the other for multi-frame transfers. The latter is a superset of the former, adding the multi-frame transfer reconstruction metadata.

Below is the header format for single-frame transfers. Note that the header is 16 bytes large, which is important for ensuring proper data alignment (n.b. some implementations may choose to alias data structures directly onto the frame payload). No additional integrity check is added since Ethernet and UDP provide a sufficiently low probability of undetected errors.

The field marked Fl contains frame flags; it is located in the most significant byte of the transfer ID (leaving 56 bits for the actual transfer ID value). The flags are as follows, starting from the most significant bit:

  • 7…5 — priority, 8 levels.
  • 4…1 — reserved/unused.
  • 0 — multi-frame transfer indicator (zero for this header format).
                       ┌ hardware filtering block ┐
    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
--+--------------------+--+-----------------------+
 0|     Transfer ID    |Fl|         CDTID         |
--+--------------------+--+-----------------------+
16|                    Payload...                 |
--+-----------------------------------------------+
# DSDL notation:
uint56 transfer_id  # Monotonic, non-overflowing
uint8 flags  # bits 7..5 - priority, bit 0 - multiframe transfer
uint64 compact_data_type_id

The bytes 7 to 15, inclusive, contain information that can be leveraged by Ethernet switches and other network hardware to filter and prioritize packets. This information must be located near the beginning of the frame (which is the case here) because some hardware may be unable to inspect the payload deep inside the frame.

Many UDP-based networks will be able to avoid reliance on multi-frame transfers due to the large payload carrying capability of UDP. Most modern network devices support jumbo frames up to 9 KiB large. The trade-off of using large frames is that they have adverse effects on jitter and latency of high-priority transfers.

If a multi-frame transfer is needed, the appropriate flag will be set, in which case the header will be constructed as follows:

                       ┌ hardware filtering block ┐
    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
--+--------------------+--+-----------------------+-----------+
 0|     Transfer ID    |Fl|         CDTID         |Fr.idx. EOT|
--+--------------------+--+-----------------------+-----------+
20|                           Payload...                      |
--+-----------------------------------------------------------+
# DSDL notation:
uint56 transfer_id  # Monotonic, non-overflowing
uint8 flags  # bits 7..5 - priority, bit 0 - multiframe transfer
uint64 compact_data_type_id
uint32 frame_index_eot  # MSB set in the last frame, cleared otherwise

The last field of the header is the frame index within the current transfer. We are using the frame index instead of a toggle bit to make the protocol resilient to UDP frame reordering (although it cannot occur in a well-constructed static network, this makes the protocol compatible with non-deterministic networks as well). The end of the transfer is indicated by setting the most significant bit of the frame index.

Notice that the data is aligned at 4 bytes here, which is suboptimal, but acceptable since a multi-frame transfer payload cannot be aliased directly anyway.

The payload is appended with CRC32C of itself, which is similar to CAN except that the CRC function is stronger due to larger data blocks involved.

IEEE 802.15.4

In the case of this simple wireless protocol, all of the transfer metadata, except for the destination node ID, has to be contained in the header before the transfer payload. The standard defines its own 16-bit node ID which can be directly mapped to UAVCAN node ID ensuring that the valid range of UAVCAN node ID values is not exceeded.

The source node ID needs to be attached to every frame because it is expected that wireless networks will use transport-layer encryption. Per the IEEE 802.15.4 standard, encrypted frames do not contain the short 16-bit address of the origin, replacing it with the long 64-bit MAC address, which can’t be easily mapped to the short address (i.e., node ID). Hence, the source node ID is always reported in the header. If encryption is not used and the source node ID is available in the transport frame metadata, receivers should ignore it anyway in order to avoid ambiguities.

It it assumed that multi-frame transfers will be common because the payload capacity of a single IEEE 802.15.4 frame may be as low as 95 bytes. Additionally, a shorter header cannot be defined without sacrificing data alignment. Hence, there is only one frame format defined:

    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
--+--------------------+--+-----------+-----+-----+
 0|    Transfer ID     |Pr|Fr.idx. EOT|S.NID|DtSpc|
--+--------------------+--+-----------+-----+-----+
16|                    Payload...                 |
--+-----------------------------------------------+
# DSDL notation:
uint56 transfer_id      # Monotonic, non-overflowing
uint8 priority          # Only the three most significant bits used
uint32 frame_index_eot  # MSB set in the last frame, cleared otherwise
uint16 source_node_id
uint16 data_specifier

The rules for handling the frame index are the same as for UDP: the most significant bit will be set in the last frame of the transfer. For single-frame transfers, therefore, the value of this field will always be 32768 (0x8000).

The data specifier values are arranged as follows (the specified ranges are inclusive):

  • 0…32767 — subject ID.
  • 32768…33279 — service ID for request transfers, offset by 32768.
  • 33280…33791 — service ID for response transfers, offset by 33280.
  • 33792…65535 — unused/reserved.

The payload is appended with CRC32C of itself, the same as UDP.

It is debatable whether such bandwidth-limited mediums should also carry CDTID together with the data. On one hand, it makes sense to trade-off type safety for bandwidth and latency, especially considering that wireless protocols cannot easily use CDTID for routing or filtering. On the other hand, wireless environments may be viewed as prone to misconfiguration due to the shared nature of the medium (although the shared-medium argument is easily negated by encrypting each network’s communication, thereby reducing the likelihood of conflicts caused by misconfiguration).

Conclusion

It has been demonstrated that UAVCAN can be leveraged with other transport protocols besides CAN, which is necessary step to meeting the needs of current and future applications. The current specification draft does not limit the protocol’s compatibility with other transports provided that the range of subject ID is reduced.

There are no plans to announce support for any transport protocol other than CAN 2.0 and CAN FD in the first release of the specification. There exists at least one relevant project where we may be able to employ a non-specified UDP/Ethernet-based extension of UAVCAN and assess its performance; after that, this discussion will probably be resurrected.

Interested parties and early adopters are welcome to share feedback.


UAVCAN v1.0 effort: Pyuavcan dev32 and Libcanard v0.2 were released today
Yukon design megathread
PR preparations for the upcoming UAVCAN v1.0 release: new website, new logo
UAVCAN: a highly dependable publish-subscribe protocol for real-time intravehicular networking
Motivation - why do we need UAVCAN
Weekly dev call - Meeting notes
Yukon design megathread
(Pavel Kirienko) #2

Paging @scottdixon and @kjetilkjeka.


(Scott Dixon) #3

gah! This is a lot. I am reading it though. Will discuss tomorrow.


(Scott Dixon) #4

My high-level feedback is that there is merit to the ideas here and that this proposal is well researched and reasoned.

I think it’s really important to summarize that this only requires two changes to the current v1 draft: a 1-bit reduction in the subject identifier and a change to the time synchronization function. Is this correct? If so we should pursue these two changes to v1 and label all other research in this area as “v-next”.

Part of my trepidation here is resourcing. We need to dig-in to the reference implementations for v1 and this direction, while exciting and relevant; is ultimately a very deep hole that would distract us from delivering an iteration in a timely manner.


(Pavel Kirienko) #5

Only one change is needed: 1-bit reduction in the subject ID. The time synchronization related changes apply only to non-physical-bus networks, so they will not affect CAN-based networks, and therefore they can be introduced later without affecting compatibility with existing CAN deployments.


(Pavel Kirienko) #6

(Ahmed Khalaf) #7

I think this is going in the right direction in general.
However, I find it quite out-dated to talk about different transports “new intravehicular communication protocol is to never see widespread adoption unless it is designed to be portable across different transports”

Intravehicular compute/communication infrastructure is pushed to modernize and converge with data-center technology by high bandwidth, scalability, re-configurability and connectivity demands.
Applying concepts as quality-of-service and containerization/SoA will make it difficult to sustain a bus like CAN or FlexRay on the long term.

Looking into ROS2, DDS and alliances like CCIX, GEN-Z and even CXL, there are key game changers already in play.
The whole compute model is going further away from “message passing” and “Transports” to data-centric inter-connects.


(Pavel Kirienko) #8

Hi Ahmed,

Thanks for the feedback. I am not sure I quite understand the point about transports, could you elaborate perhaps? It seems like we’re speaking different languages here which is exciting because it usually implies that we approach the problem from very different perspectives.

A transport is something that is always at the foundation of any communication protocol (including, say, the link between your brain and your fingers). No matter what kind of data you exchange and how you model it, you need a transport to get it from point A to point B. Hence, as long as we’re stuck with passing data in any form (a concept that is unlikely to go away anytime soon since it seems to be pretty fundamental for our universe), we will use transports. As we are interested to keep the protocol generic and repurposable, we will need to support different transports. Would you agree?

Reliable delivery and quality of service are generally irrelevant for UAVCAN, as you (and everyone interested) will soon learn from our write-up which is currently undergoing some minor edits. The write-up will be published here on this forum when finished; briefly, QoS is outside of the scope of UAVCAN since it is of low relevance for real-time vehicular networks we’re targeting.


(Scott Dixon) #9

I’d also add that, while DDS is proven and powerful and while ROS2 promises to become a very compelling technology for many robotics systems, there are drawbacks to these technologies when it comes to determinism and efficiency. The expanded vision of UAVCAN Pavel puts forward here provides an interesting balance between abstraction and efficiency that may be appropriate for certain systems where DDS is considered too unwieldy or where the excellent tooling and rich ecosystem of ROS isn’t quite as important. I’m specifically interested in what a “medium-level” protocol that can still be fully determined statically and is capable of hard-realtime interactions would be like. I’m imagining that UAVCAN, and a well-designed set of frameworks and tools, could be optimal for things like satellites and small robotic systems where the compute is distributed and limited. I’m also interested in defining a common gateway between higher-level data interchange networks like DDS and UAVCAN sub-systems where UAVCAN can provide an appropriate abstraction for complex sub-systems that integrate with a vehicle through a single interface contract. But that is just one way of looking at this evolution. Others might argue that we should think more about DDS over UAVCAN to focus on the latter as a true transport protocol instead of a micro-application layer protocol. The biggest difference between the two approaches would be where simulation is inserted into a system. If a simulated system always omits UAVCAN then the DDS-over-UAVCAN approach is appropriate. If UAVCAN can also be used by simulated systems then my DDS-to-UAVCAN bridge becomes appropriate.

Note that this is not a well-researched or carefully considered post; It’s just some thoughts I decided to scrawl out while on a bus (like a physical bus…with wheels…and humans). Take it for what it’s worth.