Alternative transport protocols in UAVCAN

pavel.kirienko · January 11, 2019, 8:44pm

This post is the result of my limited research into alternative transports conducted over the last two weeks. It is assumed that the reader is familiar with the latest draft of the specification v1.0.

(Aug 2019 edit: the ideas outlined here are being implemented either as-is or in a refined form. The current Specification draft may have diverged from this post which is not being actively updated. In the case of any contradiction, the information provided in the Specification draft and related materials takes precedence over this post. The main testing grounds for the outlined ideas is PyUAVCAN; please find the latest MVP of the proposed approaches described in the PyUAVCAN documentation.)

Motivation

In the very early days of UAVCAN, an automotive electronics engineer, who worked for Mentor Graphics at the time, helped me with some of the major design decisions. He was of the opinion (paraphrasing) that a new intravehicular communication protocol would never see widespread adoption unless it is designed to be portable across different transports. I generally agree with this sentiment, although back then, I mostly ignored it in order to stay focused on a particular, narrow application domain so that I could work out and validate the main design decisions and ideas. The protocol, as it was back then before the first (version zero) specification was even released, was very different from what we have today. In particular, the model of communication and the transport layer design were strongly tied to the CAN bus which is no longer the case in the version one draft that we have worked out so far. As such, it is important to evaluate the current draft specification from the standpoint of cross-transport portability so as to ensure that no features may harm such flexibility in the first stable release.

It would be a mistake to think that the purpose of the protocol itself is defined through CAN bus or any other particular transport. Rather, it is intended to be a purely application-layer system. We have attempted to reflect this in the current draft spec by separating the core logic from CAN bus transport definition. The name “UAVCAN” is therefore confusing. The reason it is used is purely historical and has no semantic weight (which is the reason why the specs, both v0 and v1, do not provide an explanation for the acronym). I suggest that with the addition of a new transport layer to the specification (which is likely to take place in UAVCAN v1.x, where x > 0), the document will specify the following meaning for the name UAVCAN: Uncomplicated Application-level Vehicular Communication And Networking.

I have attempted to identify the current trends in relevant industries from my own experience in related projects and literature in order to define the objectives and requirements, which I have then used to construct two new transport protocols for UAVCAN.

Background

For the purposes of this evaluation, the following applications of interest have been chosen (in order of preference): light manned electric aircraft and heavy unmanned drones (which share the same technological base, according to an industry expert who shall remain unnamed in this public post), medium and light drones (which are generally well-served by CAN), micro-satellites, other types of aircraft, and autonomous driving systems.

The demands of modern onboard intelligence systems tend to exceed the capabilities of legacy intravehicular communication standards developed specifically for vehicular applications. Due to these unmet demands, the purely vehicular solutions (such as CAN bus, FlexRay, ARINC 429, MIL-STD-1553, etc) are being replaced by (or augmented with) alternatives built upon higher-performance conventional technologies originally developed for consumer or industrial applications. Notable examples of such technology transfer are the AFDX bus for modern avionics (ARINC 664; widely used in modern airliners and spacecraft) and the new AUTOSAR 17.03 extensions for high-performance automotive onboard networks; both of the standards are built on top of standard Ethernet (with some adjustments and limitations) with support for the conventional IP stack.

The effects of this transition can be seen in the market: high-reliability vehicular Ethernet switches are freely available from numerous suppliers off-the-shelf. Here is one such example, and here is another. Also, AFDX vendors recommend Catalyst 2900 (a regular COTS L2 switch) for testing and evaluation.

Despite being based on the same underlying technology (commodity Ethernet, copper cabling or fiber optics), standards tend to introduce various design trade-offs in an attempt to better suit their target applications.

For example, the AUTOSAR extension mentioned earlier is created to be compatible with common POSIX computing platforms with very minor restrictions. Its software design specs permit dynamic reconfiguration, arbitrary multi-threading (including dynamic thread spawning and termination), and even purely dynamic memory allocation. Overall there is some movement from hard determinism towards more flexible designs. This could probably be explained by the rapid increase of the complexity of onboard software but this is a separate topic. The SOME/IP protocol is very stateful and relies on fundamentally non-deterministic technologies such as TCP/IP, reminiscent of the Internet technology stack. There is no explicit support for high-reliability features such as interface redundancy.

AFDX, on the other hand, is leaning in the opposite direction, providing native support for redundant transports and strict timing and delivery guarantees. It can be viewed as a drop-in transport replacement for (nearly) obsolete ARINC 429, replacing large point-to-point wiring harnesses with a single compact switched network. One of its interesting properties is that the routing and bandwidth allocation information is configured statically in the switches, which alongside with its unusual treatment of MAC and IP addresses constitutes a significant departure from conventional Ethernet deployments. Higher-level layers are, however, built upon conventional UDP; the choice of UDP seems pretty straightforward due to its simplicity and time determinism (unlike TCP), the lack of guaranteed delivery here is irrelevant because a robust transport is already provided by the underlying L2/L3 network.

There is also a variety of industrial communication standards which are designed to provide higher availability or predictability but for different reasons, they are suboptimal or unfit for use in hi-rel vehicular applications (“The Evolution of Avionics Networks From ARINC 429 to AFDX” 5.2.1). Their failure modes are of particular concern. Industrial applications tend to be fail-safe rather than fail-operational so their design objectives are quite different.

In the scope of hard real-time deterministic applications, the big problem of Ethernet and switched networks in general is the difficulty predicting the worst case packet propagation latency. As reviewed in detail in “The Evolution of Avionics Networks From ARINC 429 to AFDX” and “Communications for Integrated Modular Avionics”, star network topologies have inherent contention points at the output ports of the network switch hardware, since this is where the traffic originating asynchronously from different sources has to be serialized and pushed out to the destination (or the next hop) sequentially (we’re talking about full-duplex links here; half-duplex is unsuitable due to its nondeterministic collision resolution policy). For well-behaved network hardware, it can be proved that given a limited network load, the packet propagation latency is always bounded. This is one of the cornerstone principles of AFDX; the ability of the network to meet the real-time requirements hinges on the under-utilization of its bandwidth. According to the information supplied by the above-mentioned expert from Mentor Graphics, there are parallels to be made with the early days of CAN, when automotive engineers neglected to take advantage of the built-in CAN ID arbitration features, and resorted to limiting the maximum bandwidth utilization in order to keep the data propagation latencies predictable (one can still find traces of such archaic thinking in some older documents which recommend to never exceed 50%-70% of the total CAN bus bandwidth).

There are efforts to somewhat alleviate the negative effects of output port contention in Ethernet networks to improve the performance of real-time applications. AFDX, in particular, prioritizes its data paths (“virtual links”, I won’t go into detail here; this can be viewed as a tunneling feature for the legacy ARINC 429) according to statically pre-configured routing settings in the switches (special AFDX switches, commodity network hardware may not be directly applicable without sacrificing latency).

Regular COTS networking hardware supports VLAN QoS and configurable classes of service. In fact, modern COTS L2 switches offer hardware support for traffic policing and prioritization based on arbitrary user-configurable rules; for example, it is technically possible to prioritize or drop L2 frames based on the value of some arbitrary bit field in a packet (see Juniper filters and classifiers; also, Cisco FlexMatch is usable for CoS assignment). While we are talking about networking hardware, it should also be mentioned (although it is probably a well-known fact anyway) that even the most basic single-chip Ethernet controllers support rather sophisticated hardware traffic filtering policies; e.g., the ubiquitous Microchip ENC28J60 supports packet filtering based on simple payload pattern matching (section 8.2). At the risk of getting ahead of myself, I should say that these advanced features can be further exploited to create a powerful and flexible real-time network architecture using only COTS hardware, of which we will talk later.

I am perceiving a slow expansion from strict determinism and rigorous models towards more flexible, less deterministic systems with somewhat relaxed requirements to their time predictability. If my perception is correct, the change could be explained by a steady increase in complexity of the onboard intelligence (both in hardware and software) and by slow relocation of the responsibilities from human operators to automated ones. For automotive systems, one could see the precursors in the addition of Ethernet networks alongside the strictly deterministic CAN/FlexRay and the new software development standards which permit dynamic threads and dynamic memory allocation. For avionics, the addition of Ethernet alongside very robust signaling links like ARINC 429 and the new software virtualization features outlined in ARINC 653 could be interpreted as pointing in the same direction.

The above overview was focused exclusively on wired and optical networks. At the first glance, this seems an obvious choice given the current set of available technologies, however, there is one new and very interesting undertaking to consider: Wireless Avionics Intra-Communications (WAIC). Networks that rely on physical rigging such as cables have common failure modes (e.g., a cable may be torn, wires may be affected by EMI) which do not affect wireless links. The failure modes of the latter are drastically different, which may theoretically permit one to construct a very robust network by effectively employing the dissimilarity between the failure modes of wired and wireless links.

Lastly, I would like to mention two small independent undertakings to develop a brokerless message bus over UDP (a logical bus over a physical star topology) (both sources are in Russian; Github links are available in the linked articles): MQTT/UDP – a brokerless reimplementation of MQTT using the UDP broadcast transport; Mutalk – a very compact pub/sub protocol, also based on UDP broadcast. The projects are targeting completely different applications; ones where a high degree of determinism is not required. However, they are worth mentioning because of their similar principles of communication.

Relevant sources:

Communications for Integrated Modular Avionics (focus on spacecraft avionics and AFDX)
AUTOSAR for Intelligent Vehicles (next-generation automotive electronics, incl. driverless cars and associated challenges)
The Evolution of Avionics Networks From ARINC 429 to AFDX (a brief overview of historical trends in avionics from the first onboard electronics to fiber optics).
“Digital Avionics Handbook”, Spitzer & Ferrell.
“Ther Fiber-Optic High-Speed Data Bus for a New Generation of Military Aircraft”, R. W. Uhlhorn (paywall; I can PM my copy if desired); focus on the technical challenges of optical HSDB design and an overview of data exchange systems in military aircraft (the publication is quite old though).
SOME/IP specification (Ethernet-based networking for automotive applications, part of AUTOSAR, unfit for hi-rel, also see vsomeip).
SpaceWire user guide.
MQTT/UDP user guide (English).
A tiny UDP/IP stack (several hundred LoC) constructed by an enthusiast; much more compact than lwIP or uIP, easy to validate (Russian).

Objectives

Looking at my recent experiences with a certain related application and considering the above assessment, I would like to perform a porting exercise to ensure that the concepts and principles that go into the first stable release of the specification will not prevent us from efficiently supporting new transports in the future. The propositions made below are very far from being spec-ready or even production-ready; rather, they should be considered as a set of basic ideas that we can build upon in the future. Some of them may be implemented in software as experimental extensions of the protocol.

Since I have mentioned the software, I should clarify that although the protocol supports different transports, this does not mean that every implementation is required to do the same. Obviously, some implementations will be focused on some particular type of transport (e.g., libuavcan is built for CAN), while others may support many types of transports concurrently (e.g., pyuavcan would be trivial to extend for any transport)).

So, the high-level objective of this exercise is to make UAVCAN usable over a more capable wired transport than CAN. The requirements are:

The maximum supported throughput is at least 1 Gbps, preferably up to 10 Gbps (for future extensibility).
Latency is bounded and predictable; for a typical deployment, it should be in the range of hundreds of microseconds per frame.
The transport is extensible for large deployments up to at least 1000+ nodes and 1+ km of total wiring length.
The transport should support efficient broadcasting since this is the primary method of data exchange for UAVCAN. There must be facilities for efficient traffic filtering both in the end nodes and in the auxiliary network equipment (e.g., packet switches).

Wireless transports should also be considered as either standalone or redundantly heterogeneous. The initial set of requirements stemmed from the properties of our target vehicular applications and included:

Same latency requirements.
Support for direct broadcasting between nodes.
Operating range up to 100 meters.

UAVCAN is designed as a logical bus (where “logical” means the high-level communication model and not the physical network topology; for example, CAN is a physical bus, a gigabit Ethernet network is a physical star/tree whereas low-speed Ethernet can be either). This choice of logical topology has some significant advantages; it should therefore not be altered. The challenges that it creates for the transport layer shall be managed by the transport layer itself, not at the expense of the upper layers of the stack. For example, CAN bus offers hardware acceptance filtering for subscription opt-in and service transfer addressing; similar mechanisms must be available in the new transports. While this topic is too complex to cover here extensively, the logical bus topology can be considered superior in its simplicity and flexibility as the data sources and consumers can be completely logically decoupled from each other. By contrast, topologies based on explicit routing (e.g., SpaceWire) require a logical binding between agents, whereas subscription-based topologies (e.g., SOME/IP, DDS/RTPS) tend to be very stateful and thus fragile. In order to avoid traffic duplication, the transport must support efficient multicasting natively. Again, this is not supposed to be an exhaustive overview of network architectures; such discussion is beyond the scope of this article.

CAN bus guarantees that the data propagation latency is equal across the whole network, meaning that every station receives a transport frame at the same time. This property is leveraged by UAVCAN for the precise time synchronization feature only. Therefore, provided that an alternative method of time synchronization is available, the new transport is relieved from guaranteeing uniform propagation latency.

As different applications may favor different transports, it is expected that different subsystems within one vehicle may choose to employ different transports while still needing to exchange data with each other. This pattern can be observed in common avionic systems, where, for example, CAN-based ARINC-825 subnets (e.g., wingtip avionics) may interconnect with the backbone AFDX network via gateway nodes. This use case should be well-supported.

The new transports must support heterogeneous redundant configurations to enable dissimilar transport redundancy. The types of involved transports and their properties should be hidden from the application.

The new transports should minimize restrictions or special requirements for the lower layers. For example, redundant Ethernet deployments in avionics require that a disconnected port shall continue transmitting data, despite the expectation that the data will never be delivered to the other end of the link; this is done to prevent stale data from backing up in the transmit queue because if a connection is restored, the stale data would be released on to the network and possibly disrupt the operation of the system.

Transport-agnostic model refinement

One might be easily mistaken to believe that the current draft specification is closely tied to the CAN bus and is therefore not portable. In order to demonstrate that this is not true, we need to update the communication model definition to make it transport-agnostic.

The following diagram introduces several new terms.

“Specifier” is a collection of identifiers that together define a category of entities. Specifiers are auxiliary ephemeral constructs which are needed only for completeness of the model and for reasoning about the protocol; implementations need not be involved with them.

“Route specifier” is either of:

a pair of source node ID and destination node ID;
a source node ID only; in this case, it is implied that the destination is the whole network (i.e., broadcast).

“Data specifier” is either of:

subject ID;
service ID and a selector indicating whether the transfer is a service request or a service response.

A data specifier describes what data structure is contained in the transfer and what it means (i.e., how it should be interpreted).

“Session specifier” contains a data specifier and a route specifier. Its purpose is to uniquely identify not only the meaning of data but also the agents participating in its exchange. The term may remind one of the layer 5 of the standard OSI model, but such mapping may not be entirely correct.

It is well known that one diagram is worth 1024 words:

Communication%20link%20model

Applying this model to CAN, one will see that the CAN frame identifier contains the session specifier and the transfer priority. The transfer ID is moved to the CAN payload since it is not a member of the session specifier and therefore it is useless for routing and filtering. The fact that the entirety of the session specifier is managed by the same feature of the protocol is the manifestation of the fact that the CAN bus covers several far-separated ISO/OSI layers.

The proposed model can be easily applied to various transport protocol stacks that implement stricter adherence to the ISO/OSI model. This is demonstrated in the next section. The model could also be applied to less well-layered transports, such as, for example, FlexRay, but it may be much more difficult.

Protocol extensions and modifications

Before we define the new transport-specific implementations, a few other questions must be resolved first.

Subject ID range problem

(The proposal has been accepted and implemented in Specification v1.0. At the time of writing, the subject-ID range was [0, 65535])

A well-layered transport like UDP or IEEE 802.15.4 will take care of datagram delivery between hosts without any help from the higher layers (which is unlike CAN). Therefore, the route specifier will not be used above the transport layer.

The data specifier, which is intended to communicate how the data contained in the transport datagram should be handled, needs to be carried in the transport frame next to the payload. As we are dealing with relatively high-level transports, reliance on bit-level data field segmentation (like in CAN ID) or integers of non-standard bit width (not 8, 16, 32, or 64) may be impractical. Therefore, the wire representation of a data specifier should fit into a standard-size integral value.

Per the definition provided earlier, a data specifier consists of:

A kind selector, i.e., message or service (2 cases, 1 bit).
If kind selector is “message”:
- Subject ID (65536 cases, 16 bits)
If kind selector is “service”:
- Service ID (512 cases, 9 bits)
- Request/response selector (2 cases, 1 bit)

If the kind selector is set to “service”, the required number of bits is 1 + 9 + 1 = 11 bits, which fits into a standard-size 16-bit integer field, leaving 5 bits reserved for future needs.

If the kind selector is set to “message”, the required number of bits is 1 + 16 = 17 bits. The next standard-size integer field is 32 bits wide, which would leave 15 bits unused. A more practical solution is to reduce the number of subject ID cases to 32768, thereby freeing up one bit.

If the above change is implemented, the data specifier will be able to fit into a standard-size 16-bit integer field. The exact mapping can be defined for each transport layer individually; for example, the lower 32768 values can be used to represent the subject ID directly (since this is the most commonly used form of communication), the next 512 values can be reserved for service ID requests, and then 512 values for service responses. The unused 31744 values will be reserved for future use.

Due to the reduced range, the subject ID range segmentation should be altered as follows:

From	To	Capacity	Purpose
0	24575	24576	Unregulated identifiers
28672	29695	1024	Non-standard (vendor-specific) regulated identifiers
31744	32767	1024	Standard regulated identifiers

Node ID range problem

As discussed earlier, the limit of 128 nodes per network is unacceptable for larger deployments (which will become available with new transports); therefore, the limit needs to be increased for non-CAN based transports.

Using existing networks as a reference, particularly the upcoming WAIC standard discussed earlier, one will see that there exist sensible vehicular applications requiring thousands of (simple) nodes per logical network.

Following the principle of adherence to standard-size integers, the next appropriate threshold for node ID size is 16 bit or 65536 nodes per network. This value would map well to some existing protocols, such as, for example, IEEE 802.15.4 (where valid node addresses range from 0 to 65534, inclusive, with 65535 reserved for broadcasting), the last hextet of an IPv6 address, or the last two octets of a class-B IPv4 address.

However, considering the limited set of available subject IDs (24576 values, we are not including regulated identifiers since they are fixed for all nodes), there set of practical usage scenarios where a network might sensibly utilize more than 24576 nodes is limited.

The specifics of highly deterministic nodes need to be considered as well; for example, nodes that perform mission-critical and/or hard real-time tasks that require a highly predictable behavior. Upon careful evaluation of the UAVCAN stack, one can generally see that the amount of resources (time and/or memory, with a possibility of trade-off) necessary for deterministic handling of a given transport frame is a function of the highest possible (worst case) number of nodes in the network (among other things, possibly, depending on the implementation). I will skip a detailed analysis here but the main reason for this dependency is that the protocol requires receiving nodes to maintain individual state per transmitting node.

One might argue that this is a design issue of the protocol but so far no better solutions that meet the core design goals have been found so the question of optimal design should not be raised in this discussion.

Additionally, some of the suitable transport protocols may require the transport protocol itself (meaning, besides the higher layers of the stack) to keep some state per node. For example, IP-based transports must allocate space for ARP tables. Complex deterministic nodes that are expected to initiate unicast (service) transfers (n.b.: most resource-constrained end-nodes will not need ARP by virtue of only needing broadcast transfers) or protocol bridge nodes will have to allocate copious amounts of memory for static ARP tables. To demonstrate the extent of the problem, if we were to use 16-bit node ID and limited the worst case (maximum) node capacity to 65536 nodes per network, the worst-case size of the highly-deterministic ARP table might be as high as 384 KiB (6 bytes per MAC address * 2¹⁶ nodes / 1024 bytes per KiB) per redundant interface (assuming one MAC address per interface). Although as was said earlier, this consideration does not apply to most nodes (especially simple ones) because they can limit themselves to broadcasting only, requiring unicast transfers only for responding to service requests, in which case a single-entry ARP cache (6 bytes) will suffice.

Certainly, one could design a well-behaved deterministic node without O(1) containers but the dependency on the network node capacity would remain. Only its manifestation would change. Besides the memory footprint, it would also affect its frame processing time, although this would still be bounded. The point of the above was to demonstrate that the maximum network capacity must be sized properly to find the optimum satisfying all requirements:

The maximum number of supported nodes per network should be sufficient for any sensible application.
Deterministic nodes should not be burdened unnecessarily.

Given the limit of ~25k of subject ID and the determinism considerations, we could arbitrarily draw the line at 2¹² (4096) nodes per network, with a possibility of future extension.

There will remain an odd duality between different types of transports: some of them will be limited to 128 nodes per (sub-) network max (CAN 2.0, CAN FD), others will be able to reach the logical limit defined above.

Transfer ID range problem

The limited dynamic range of the transfer ID, and the resulting very short overflow period, is a serious limitation of the CAN transport. This problem affects nodes with redundant interfaces, requiring them to receive transfers from only one of the available redundant interfaces. Simultaneous reception from multiple interfaces is not possible because of the very short wraparound period of transfer ID (every 32 transfers). This is unfortunate because:

Simultaneous reception through all of the available interfaces (like in AFDX) reduces median latency and jitter (although it generally cannot improve the worst case). The CAN transport cannot benefit from these advantages.
In the case of an interface failure, the receiving node may lose some of the incoming data before switching over to one of the redundant interfaces. If simultaneous reception is used, failure of an interface does not affect the operation of the node as long as at least one of the available interfaces continues to function.
Intermittent failures of all of the available interfaces (e.g., due to a faulty connection or another common-mode connectivity failure) may render the node unable to receive data from the bus due to the switch-over delay.
Transmitting nodes must handle disconnected interfaces (this includes intentionally disconnected interfaces and those experiencing failures) in a special way, ensuring that their transmission queues do not contain stale transport frames. Otherwise, if the connection is restored, the obsolete frames will be transmitted on to the network, possibly disrupting application-level processes on the nodes that consume the published data. This is because the receiving nodes are unable to reliably compare the age of data due to frequently overflowing transfer ID values.
It is not possible to reliably determine the number of lost/undelivered transfers (unless the application layer is involved) because of the overflowing nature of the transfer ID. This makes certain use cases, such as re-requesting lost or missing data, more complex than they could be.

Unfortunately, due to the limited capabilities of CAN, trade-offs had to be made. Carrying the same trade-offs to more capable transports would be unwise. Hence, the dynamic range of the transfer ID should be increased.

There exist certain failure modes, such as, for example, the case of the temporarily disconnected interface, where an overflowing transfer ID is likely to cause problems regardless of how large the overflow period is. At the same time, the payload of highly capable transports is relatively cheap to accommodate a sufficiently large transfer ID to ensure that it will never overflow in a sensible scenario.

It is therefore proposed to equip all transports that are more capable than CAN with a very wide transfer ID parameter. For transports with the maximum throughput under 10⁵ transfers per second, the transfer ID field should be at least 48 bits wide (overflow period at the specified transfer exchange rate: ~90 years). For higher-throughput transports, the transfer ID field should be at least 56 bits wide. The nearest standard-size integer field is 64 bits wide. As a theoretical worst case reference (unattainable in practice), a COTS 10 GbE adapter is capable of handling up to 14.9 million frames per second.

This change will greatly simplify transfer reception handling, increase the resilience of the protocol to interface failure, and solve the other problems listed above, at the expense of several bytes of overhead per frame.

Lastly, it should be noted that reassembly of multi-frame transfers can be done only on a per-interface level, meaning that frames belonging to the same transfer cannot be sourced from different interfaces. This is because the MTU is not guaranteed to be the same for all of the available redundant transports, especially if they are heterogeneous.

Compact data type identifier

(Aug 2019 edit: implemented in PyDSDL and PyUAVCAN with a slightly different structure under the revised name “data type hash”.)

(Dec 2019 edit: the below proposal has been identified to be incompatible with data type extensibility measures; pending rework.)

Currently, a data type is unambiguously identified by its full name (e.g., uavcan.node.Heartbeat). We have eliminated numerical identifiers in v1.0 and introduced subjects and services instead. This resulted in a minor issue that data type compatibility cannot be easily and robustly validated at runtime.

To work around this problem, I propose a new concept for use with more capable transports than CAN: a compact data type identifier, or CDTID for brevity. A CDTID is defined as a function of the data type name and version. The fact that it is a function of an existing property rather than an entirely new user-level entity is important, as it relieves the user from maintaining a yet another numerical identifier (UAVCAN has plenty of them as it is).

Unlike the old data type signature used in UAVCAN v0, CDTID does not vary with the actual definition of the type. Instead, it depends purely on the name and version, leaving the compatibility-related matters to static analysis and DSDL processing tools.

Besides type safety, a CDTID can be used for filtering UAVCAN traffic by data type. In Ethernet networks, such filtering can be performed by COTS L2 switches and by network hardware on end nodes. As discussed earlier, many COTS L2 switches are known to support hardware traffic prioritization and filtering by matching packets against user-defined masks. These commonly available features could be employed to prevent irrelevant broadcast traffic (chosen by data type alongside other properties, e.g., subject ID) from propagating into certain ports where it is not needed, thereby decreasing the output port contention, latency, and jitter. This is conceptually similar to Virtual Link ID routing implemented in AFDX.

Additionally, CDTID simplifies postmortem log analysis: since every frame carries its data type information, the data can be analyzed without any prior knowledge of the network configuration.

A CDTID is constructed as a 64-bit unsigned integer. The value has a particular structure to facilitate filtering and routing:

The 32 most significant bits are a CRC32C hash of the root namespace suffixed with the fixed salt svo0 (in ASCII: 115, 118, 111, 48). The salt is chosen empirically to produce recognizable hexadecimal/binary pattern for the standard namespace uavcan (0x66666666).
The following 12 bits (i.e., 20…31 counted from LSB) contain the twelve least significant bits of a CRC32C hash of the sub-root namespace. For example, node for uavcan.node.Heartbeat, or primitive for uavcan.primitive.array.Integer8. If there is no sub-root namespace, the hash will be applied to an empty string, producing zero.
The following 12 bits (i.e., 8…19 counted from LSB) contain the twelve least significant bits of a CRC32C hash of the remaining part of the full data type name. For example, Heartbeat for uavcan.node.Heartbeat, or array.Integer8 for uavcan.primitive.array.Integer8.
The last 8 bits (the least significant byte) contain the major version number of the data type.

The following is a demo in Python provided for reference (based on PyDSDL):

ns_without_root = t.name_components[1:]
if len(ns_without_root) > 1:
    subroot_ns, name_tail = ns_without_root[0], '.'.join(ns_without_root[1:])
else:
    subroot_ns, name_tail = '', ns_without_root[0]
cdtid = (compute_crc32c((t.root_namespace + 'cvo0').encode()) << 32) | t.version.major
cdtid |= (compute_crc32c(subroot_ns.encode()) & 0xFFF) << 20
cdtid |= (compute_crc32c(name_tail.encode()) & 0xFFF) << 8
print(hex(cdtid))

Examples (the underscores separating the CDTID segments are added for clarity: (root namespace)_(sub-root namespace)_(tail)_(major version number)):

Full name	64-bit CDTID as hex
`uavcan.Test.255.1`	`66666666_000_64b_ff`
`uavcan.internet.udp.OutgoingPacket.0.1`	`66666666_1b3_936_00`
`uavcan.internet.udp.HandleIncomingPacket.0.1`	`66666666_1b3_c2f_00`
`uavcan.node.Version.1.0`	`66666666_3fa_c2a_01`
`uavcan.node.GetInfo.0.1`	`66666666_3fa_2d8_00`
`uavcan.node.GetTransportStatistics.0.1`	`66666666_3fa_63b_00`

The segmented nature of CDTID enables sophisticated hardware filtering not only by data type but also by its name (i.e., ignoring the version number; such as if an assumption was made that the destination supports all versions), the root namespace, and the sub-root namespace (e.g., a modem node may wish to receive only uavcan.internet.* and some vendor.custom_telemetry.* from the whole network). As explained above, such filtering can be implemented by masking away irrelevant segments of the CDTID. Again, additional filtering can be also performed by subject ID, if necessary, similar to VLID routing in AFDX.

The large space reserved for the root namespace hash is necessary to minimize the probability of collisions between different vendors or other namespace owners. There may be no easy way of ensuring that any two namespaces are collision-free unless there is some global repository of them (which is undesirable to have); hence the large hash. For reference, the collision probability for a perfect 32-bit hash is dependent on the total number of root namespaces as follows:

10k namespaces — 1%
20k namespaces — 5%
30k namespaces — 10%

The remaining two hashes are made small because the conflicts within a namespace can be detected immediately and therefore are cheap to resolve manually. A 12-bit hash offers 4096 possible values, thereby limiting the total number of sub-root namespaces and the number of sub-root namespace entries. The collision probability assessment looks as follows, assuming perfect hash:

10 items — 1%
20 items — 5%
30 items — 10%
50 items — 25%
75 items — 50%
4k items — 100% (capacity limit)

Considering the rapidly increasing probability of collision, having more than 75 sub-root namespaces per root namespace and more than 75 data types per sub-root namespace (which yields: 75² = 5625 data types per root namespace) may be impractical without some form of manual control over the hash function (technically, it is always possible to find a set of 4096 names that will produce distinct non-conflicting hashes, but such names are likely to be meaningless or clumsy, defeating their purpose). It is possible to work around this by offering users some optional DSDL directives overriding the auto-computed hash values with manually provided values, which would decouple the hash from the name, allowing the user to pick both freely. This carries some serious disadvantages, such as the hash is no longer a function of mere type name, but also of its DSDL definition.

I would like to avoid detailed discussion and stop here because CDTID is not meant to be a finalized proposal; rather, it should be considered as an abstract idea of a compact representation of type information for safety and filtering purposes.

Time synchronization

As mentioned earlier, the currently defined time synchronization algorithm hinges on the assumption that the frame propagation latency throughout the whole bus is much less than a single bit period. This is true for CAN and similar physical bus topologies but does not hold for star or tree networks.

In the case of Ethernet-based networks, the problem of precise time synchronization is addressed well by IEEE 1588. Nearly every modern Ethernet-enabled microcontroller supports IEEE 1588 in hardware (all modern MCUs from NXP, STM, and Microchip seem to support it, according to my quick look-up), and the theoretical performance of this protocol exceeds that of UAVCAN.

Other transports, however, may not have such well-defined and well-supported standard solutions. In these cases, the algorithm defined in UAVCAN can still be used if augmented with the Olson latency recovery algorithm. The resulting solution will be less accurate than the native CAN-based one or IEEE 1588 but it is likely to still be sufficient for most distributed control needs.

The core assumption of the Olson algorithm is that the message propagation medium adds an unknown and variable latency to the message, but it is assumed that occasionally the medium will exhibit the minimal latency. The Olson algorithm can identify such low-latency packets and use them to establish synchronization with minimal clock skew. The algorithm is implemented entirely on the receiving side and requires no slave-to-master communication. Therefore, unlike IEEE 1588, it scales very well for large networks. The short-term attainable accuracy equals the best-case (minimal) packet propagation delay from the master to the slave (the long-term accuracy is also dependent on the drift rate of the slave’s local clock); the worst case error is bounded by the worst case propagation delay.

The described algorithm can be implemented without any modifications to the synchronization protocol; the changes will be limited to slave-side logic only.

Proposed transport-specific implementations

The proposed two new transport protocols for this evaluation are the standard OSI layer 4 UDP/IP stack and a simple wireless PAN protocol IEEE 802.15.4.

The UDP/IP stack is chosen primarily because of its native compatibility with the Internet protocol suite and Ethernet, which, as we established earlier, is finding widespread use in safety-critical vehicular systems. Another equally important reason for its use is the widespread support of the Internet protocol suite by all sorts of commodity and industrial equipment, their systems, and the huge variety of available physical layers (e.g. regular copper cables, high-speed fiber optics, wireless, power line communications, etc.). With proper design provided, a UDP-based protocol can take advantage of the flexibility of its transport and thus becoming equally flexible itself. Unlike other L4 Internet Protocol Suite protocols (e.g., TCP), UDP is well suited for high-reliability real-time applications, conditional that the underlying layers provide adequate guarantees (such as robust equipment, limited port contention on the switching hardware, bounded latency, etc; refer to the earlier sections for the background).

While IEEE 802.15.4 may be unusable in some of the targeted applications due to its limited bandwidth (250 kbps), it is representative of low-level simple wireless network protocols, and thus works as a baseline for this exercise. One could also imagine a sensible subset of this protocol that would be usable in hard real-time environments (e.g., the standard supports deterministic TDMA out of the box), but this discussion would be out of place here. Such particulars of the transport lie way below the level of abstraction we’re currently dealing with.

UDP/IP

One UDP datagram represents one UAVCAN transport frame. The data specifier is encoded in the destination port number at the UDP level, which allows us to take advantage of the datagram processing capabilities of the standard UDP/IP stack: the UDP stack will deliver UAVCAN frames to the appropriate handlers based on the port number. The port number mapping will be as follows (the specified ranges are inclusive):

16384…49151 — subject ID, offset by 16384.
15872…16383 — service ID for request transfers, offset by 15872.
15360…15871 — service ID for response transfers, offset by 15360.

The remaining values are free for other uses (non-UAVCAN-related). Particularly:

0…1023 — free for the well-known UDP ports.
49152…65535 — free for ephemeral ports.

The port distribution can be visualized as follows, where w - well-known ports, S - services, M - subjects, e - ephemeral ports, - - free/unused; 1024 ports per symbol:

w--------------SMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMeeeeeeeeeeeeeeee

The source port number is not used and can be arbitrary (ephemeral).

The node ID is encoded in the least significant bits of the IP address, which can be either IPv4 or IPv6. For example, 192.168.1.123 corresponds to node ID 379. Broadcast transfers will be sent to the local subnet broadcast address.

The remaining information – transfer ID, CDTID, priority, and multi-frame segmentation metadata – is encoded in the header. There will be two header formats: one for single-frame transfers, and the other for multi-frame transfers. The latter is a superset of the former, adding the multi-frame transfer reconstruction metadata.

Below is the header format for single-frame transfers. Note that the header is 16 bytes large, which is important for ensuring proper data alignment (n.b. some implementations may choose to alias data structures directly onto the frame payload). No additional integrity check is added since Ethernet and UDP provide a sufficiently low probability of undetected errors.

The field marked Fl contains frame flags; it is located in the most significant byte of the transfer ID (leaving 56 bits for the actual transfer ID value). The flags are as follows, starting from the most significant bit:

7…5 — priority, 8 levels.
4…1 — reserved/unused.
0 — multi-frame transfer indicator (zero for this header format).

                       ┌ hardware filtering block ┐
    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
--+--------------------+--+-----------------------+
 0|     Transfer ID    |Fl|         CDTID         |
--+--------------------+--+-----------------------+
16|                    Payload...                 |
--+-----------------------------------------------+
# DSDL notation:
uint56 transfer_id  # Monotonic, non-overflowing
uint8 flags  # bits 7..5 - priority, bit 0 - multiframe transfer
uint64 compact_data_type_id

The bytes 7 to 15, inclusive, contain information that can be leveraged by Ethernet switches and other network hardware to filter and prioritize packets. This information must be located near the beginning of the frame (which is the case here) because some hardware may be unable to inspect the payload deep inside the frame.

Many UDP-based networks will be able to avoid reliance on multi-frame transfers due to the large payload carrying capability of UDP. Most modern network devices support jumbo frames up to 9 KiB large. The trade-off of using large frames is that they have adverse effects on jitter and latency of high-priority transfers.

If a multi-frame transfer is needed, the appropriate flag will be set, in which case the header will be constructed as follows:

                       ┌ hardware filtering block ┐
    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
--+--------------------+--+-----------------------+-----------+
 0|     Transfer ID    |Fl|         CDTID         |Fr.idx. EOT|
--+--------------------+--+-----------------------+-----------+
20|                           Payload...                      |
--+-----------------------------------------------------------+
# DSDL notation:
uint56 transfer_id  # Monotonic, non-overflowing
uint8 flags  # bits 7..5 - priority, bit 0 - multiframe transfer
uint64 compact_data_type_id
uint32 frame_index_eot  # MSB set in the last frame, cleared otherwise

The last field of the header is the frame index within the current transfer. We are using the frame index instead of a toggle bit to make the protocol resilient to UDP frame reordering (although it cannot occur in a well-constructed static network, this makes the protocol compatible with non-deterministic networks as well). The end of the transfer is indicated by setting the most significant bit of the frame index.

Notice that the data is aligned at 4 bytes here, which is suboptimal, but acceptable since a multi-frame transfer payload cannot be aliased directly anyway.

The payload is appended with CRC32C of itself, which is similar to CAN except that the CRC function is stronger due to larger data blocks involved.

IEEE 802.15.4

In the case of this simple wireless protocol, all of the transfer metadata, except for the destination node ID, has to be contained in the header before the transfer payload. The standard defines its own 16-bit node ID which can be directly mapped to UAVCAN node ID ensuring that the valid range of UAVCAN node ID values is not exceeded.

The source node ID needs to be attached to every frame because it is expected that wireless networks will use transport-layer encryption. Per the IEEE 802.15.4 standard, encrypted frames do not contain the short 16-bit address of the origin, replacing it with the long 64-bit MAC address, which can’t be easily mapped to the short address (i.e., node ID). Hence, the source node ID is always reported in the header. If encryption is not used and the source node ID is available in the transport frame metadata, receivers should ignore it anyway in order to avoid ambiguities.

It it assumed that multi-frame transfers will be common because the payload capacity of a single IEEE 802.15.4 frame may be as low as 95 bytes. Additionally, a shorter header cannot be defined without sacrificing data alignment. Hence, there is only one frame format defined:

    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
--+--------------------+--+-----------+-----+-----+
 0|    Transfer ID     |Pr|Fr.idx. EOT|S.NID|DtSpc|
--+--------------------+--+-----------+-----+-----+
16|                    Payload...                 |
--+-----------------------------------------------+
# DSDL notation:
uint56 transfer_id      # Monotonic, non-overflowing
uint8 priority          # Only the three most significant bits used
uint32 frame_index_eot  # MSB set in the last frame, cleared otherwise
uint16 source_node_id
uint16 data_specifier

The rules for handling the frame index are the same as for UDP: the most significant bit will be set in the last frame of the transfer. For single-frame transfers, therefore, the value of this field will always be 32768 (0x8000).

The data specifier values are arranged as follows (the specified ranges are inclusive):

0…32767 — subject ID.
32768…33279 — service ID for request transfers, offset by 32768.
33280…33791 — service ID for response transfers, offset by 33280.
33792…65535 — unused/reserved.

The payload is appended with CRC32C of itself, the same as UDP.

It is debatable whether such bandwidth-limited mediums should also carry CDTID together with the data. On one hand, it makes sense to trade-off type safety for bandwidth and latency, especially considering that wireless protocols cannot easily use CDTID for routing or filtering. On the other hand, wireless environments may be viewed as prone to misconfiguration due to the shared nature of the medium (although the shared-medium argument is easily negated by encrypting each network’s communication, thereby reducing the likelihood of conflicts caused by misconfiguration).

Conclusion

It has been demonstrated that UAVCAN can be leveraged with other transport protocols besides CAN, which is necessary step to meeting the needs of current and future applications. The current specification draft does not limit the protocol’s compatibility with other transports provided that the range of subject ID is reduced.

There are no plans to announce support for any transport protocol other than CAN 2.0 and CAN FD in the first release of the specification. There exists at least one relevant project where we may be able to employ a non-specified UDP/Ethernet-based extension of UAVCAN and assess its performance; after that, this discussion will probably be resurrected.

Interested parties and early adopters are welcome to share feedback.

pavel.kirienko · January 11, 2019, 8:45pm

Paging @scottdixon and @kjetilkjeka.

scottdixon · January 16, 2019, 6:01am

gah! This is a lot. I am reading it though. Will discuss tomorrow.

scottdixon · January 23, 2019, 6:07am

My high-level feedback is that there is merit to the ideas here and that this proposal is well researched and reasoned.

I think it’s really important to summarize that this only requires two changes to the current v1 draft: a 1-bit reduction in the subject identifier and a change to the time synchronization function. Is this correct? If so we should pursue these two changes to v1 and label all other research in this area as “v-next”.

Part of my trepidation here is resourcing. We need to dig-in to the reference implementations for v1 and this direction, while exciting and relevant; is ultimately a very deep hole that would distract us from delivering an iteration in a timely manner.

pavel.kirienko · January 23, 2019, 9:06am

Only one change is needed: 1-bit reduction in the subject ID. The time synchronization related changes apply only to non-physical-bus networks, so they will not affect CAN-based networks, and therefore they can be introduced later without affecting compatibility with existing CAN deployments.

ahmedkhalaf · June 11, 2019, 3:40pm

I think this is going in the right direction in general.
However, I find it quite out-dated to talk about different transports “new intravehicular communication protocol is to never see widespread adoption unless it is designed to be portable across different transports”

Intravehicular compute/communication infrastructure is pushed to modernize and converge with data-center technology by high bandwidth, scalability, re-configurability and connectivity demands.
Applying concepts as quality-of-service and containerization/SoA will make it difficult to sustain a bus like CAN or FlexRay on the long term.

Looking into ROS2, DDS and alliances like CCIX, GEN-Z and even CXL, there are key game changers already in play.
The whole compute model is going further away from “message passing” and “Transports” to data-centric inter-connects.

pavel.kirienko · June 28, 2019, 10:59am

Hi Ahmed,

Thanks for the feedback. I am not sure I quite understand the point about transports, could you elaborate perhaps? It seems like we’re speaking different languages here which is exciting because it usually implies that we approach the problem from very different perspectives.

A transport is something that is always at the foundation of any communication protocol (including, say, the link between your brain and your fingers). No matter what kind of data you exchange and how you model it, you need a transport to get it from point A to point B. Hence, as long as we’re stuck with passing data in any form (a concept that is unlikely to go away anytime soon since it seems to be pretty fundamental for our universe), we will use transports. As we are interested to keep the protocol generic and repurposable, we will need to support different transports. Would you agree?

Reliable delivery and quality of service are generally irrelevant for UAVCAN, as you (and everyone interested) will soon learn from our write-up which is currently undergoing some minor edits. The write-up will be published here on this forum when finished; briefly, QoS is outside of the scope of UAVCAN since it is of low relevance for real-time vehicular networks we’re targeting.

scottdixon · September 9, 2019, 4:53pm

I’d also add that, while DDS is proven and powerful and while ROS2 promises to become a very compelling technology for many robotics systems, there are drawbacks to these technologies when it comes to determinism and efficiency. The expanded vision of UAVCAN Pavel puts forward here provides an interesting balance between abstraction and efficiency that may be appropriate for certain systems where DDS is considered too unwieldy or where the excellent tooling and rich ecosystem of ROS isn’t quite as important. I’m specifically interested in what a “medium-level” protocol that can still be fully determined statically and is capable of hard-realtime interactions would be like. I’m imagining that UAVCAN, and a well-designed set of frameworks and tools, could be optimal for things like satellites and small robotic systems where the compute is distributed and limited. I’m also interested in defining a common gateway between higher-level data interchange networks like DDS and UAVCAN sub-systems where UAVCAN can provide an appropriate abstraction for complex sub-systems that integrate with a vehicle through a single interface contract. But that is just one way of looking at this evolution. Others might argue that we should think more about DDS over UAVCAN to focus on the latter as a true transport protocol instead of a micro-application layer protocol. The biggest difference between the two approaches would be where simulation is inserted into a system. If a simulated system always omits UAVCAN then the DDS-over-UAVCAN approach is appropriate. If UAVCAN can also be used by simulated systems then my DDS-to-UAVCAN bridge becomes appropriate.

Note that this is not a well-researched or carefully considered post; It’s just some thoughts I decided to scrawl out while on a bus (like a physical bus…with wheels…and humans). Take it for what it’s worth.

pavel.kirienko · February 21, 2020, 1:36am

About two weeks ago, the IEEE 802.3 working group published a new standard 802.3cg. This is a new variant of Ethernet optimized for fieldbus and low-level vehicle bus applications where the data rates are relatively low but the reliability and immunity requirements are high. The new standard directly competes with traditional fieldbuses such as CAN, EtherCAT, or Modbus, and is intended to displace them from the market. The key idea is that high data rate segments of an industrial or a vehicular network (SOME/IP, DDS) already leverage Ethernet, and it is beneficial from the systems design perspective to converge on the same networking technology across the entire stack.

The new standard currently defines a simple 10 Mbps single-pair architecture, with a possibility of future extension upwards. Both half-duplex and full-duplex modes are defined.

A supremely interesting addition compared to prior Ethernet standards is a deterministic multi-drop topology based on a new media access control policy called PLCA RS which is similar to the token ring. It enables the construction of time-deterministic bus topologies where media access is controlled by a coordinator (instead of the old non-deterministic decentralized CSMA/CD). Although I suspect it might be problematic for some real-time networks due to the inherent fair access policy interfering with prioritization. For reference, existing real-time high-integrity Ethernet-based systems typically rely on switched networks.

A modification of PoE called PoDL (power over data lines) is provided; it enables power delivery over the twisted data pair (not available for multi-drop topologies though):

The standard also includes a connector specification, including heavy-duty IP67 connectors for high-reliability systems operating in adverse environments.

Resources:

Technology overview (1.2 MB)
PLCA RS FAQ (351.3 KB)
Connectors (1.2 MB)

JediJeremy · May 6, 2020, 7:03am

Hi Pavel. I’m well into my attempts to implement UAVCAN on the Espressif ESP8266 / ESP32 microcontrollers and I think I’m at the point where I can ask some semi-sensible questions about certain details regarding the serial and UDP transports, which I intend to use.

First, I thought I’d give an overview of my use case so you can see why I’m making certain decisions: I’m working on “Maker-grade” laboratory equipment for citizen-science level home/small labs. Some of these may turn into vendor products in time, but they’re intended be DIY kits published for anyone to build and modify. A typical lab can be considered to be a “room scale” robot with sensors and actuators that must work together to run an experiment. I’ve built prototype examples of:

Scales (microgram sensitivity)
Power meters & data loggers.
Mini Centrifuges (50,000 RPM or >1000G)
Robotic (“Digital”) Pipettes for dispensing millilitre to microlitre quantities.
Robotic Microscope (with digital camera, lighting calibration, CNC slide stage/focus)
Temperate controllers for hotplates
Magnetic stirrers
Liquid & Air pumps
Lights & Lasers
Pick & Place robot (to hold the pipette for automated tasks)

The devices can work independently in ‘manual mode’, but there are good reasons to network them:

Calibrating/Configuring devices through a UI that’s easier to use than attempting complex setups with a knob, two buttons and a half-inch OLED screen. (If that)
Central management/recording of an experimental protocol from a computer running Labview (or similar) or bespoke Python code.
eg: When starting a new experiment, configuring the pipette to a particular dispense mode, setting the centrifuge to an appropriate speed/duration, weighing a centrifuge sample and then preparing a ‘counterweight’ sample, starting data recorders.
Remote monitoring of long-running experiments from outside the lab.
eg: overnight cell culturing, plant growth. That implies multiple ‘management’ computers temporarily connecting to devices over the network, probably via TCP/IP. (since WiFi/UDP broadcasts are not usually routed outside the subnet)
Remote control of the protocol to avoid contamination.
eg: If the experimenter is already holding their favorite electronic pipette, some of the buttons could activate the centrifuge / microscope / tray loaders to avoid touching other control knobs.
Emergency stop / safe mode buttons. In the case of a lab accident (eg, a sample exploding in the centrifuge, unbalanced centrifuge ‘going for a walk’, beaker boiling over on the hotplate, motor jamming, liquid spill, or even a fire) a big red button that shuts down all the equipment is preferable to running around the lab to turn off individual devices. Several E-Stop buttons might be distributed around the room (of different severity) and these must function even (especially!) if the management computer crashes or the WiFi access point fails - hence a desire for redundant or multiple transports.
eg: If the building fire evacuation alarm goes off, a “Safe Mode” button at the door should put all the robots into standby / turn off the hotplates and centrifuges until the user returns.

Some pieces of equipment are able to be tethered, but others must be wireless (like the pipette) using network protocols like WiFi, Zigbee, LoRa or the “ESP NOW” mesh API which Espressif modules can use to send WiFi packets directly to each other without an Access Point. Other modules (like stepper motor controllers) can benefit from hard real-time links over CAN within a single device, which the ESP32 supports. Both Espressif chips have integrated WiFi and the ‘Arduino’ boards have USB serial interfaces.

For these reasons I’m especially interested in the UDP protocol over WiFi, and the serial protocol both over TCP/IP and USB serial connection. A typical use would be plugging a device into the USB port of the management computer for initial configuration so that it can then connect to a nominated WiFi access point / E-Stop button controller. Then it can be unplugged, moved to the bench, and be remotely monitored/operated over WiFi.

OK so given all of that, here are my specific questions about the alternative transports:

Is it appropriate for the different network interfaces to act as independent “nodes”? If I plug in a USB serial cable it makes sense to use PNP to allocate the node ID for the ‘temporary’ serial transport. (I can’t predict the host ID / what other serial devices might be connected in advance, especially on out-of-the-box first use) Once the device’s WiFi interface is configured and brought up it will be allocated an address by DHCP and there’s no guarantee that will match the node ID already obtained for the USB serial connection. (And changing node ID’s is explicitly prohibited by the spec.)
It would seem that it’s not possible to treat the USB serial and WiFi links (either UDP or TCP) as redundant transports for the above reasons, since the USB serial should always be available first. Does that sound right?
Will pyuavcan have the ability to automatically detect new USB serial connections? Or is it left to the application (eg Yukon) to detect port changes and configure pyuavcan with the new transports? Are multiple hardware serial ports considered to be part of the same UAVCAN ‘network’? (And if so, will pyuavcan re-transmit messages from one serial port to the others?)
Is it appropriate for multiple TCP/IP tunneled serial connections to a single device to act as connections to the same node, or should they also be instantiated as new nodes? Basically, should the Transport ID’s be shared across multiple connections or should a new TCP/IP session see counters starting from zero? (and potentially different Node ID’s)
Am I correct in thinking that a separate Transfer ID counter needs to be maintained for each ‘session specifier’ even on transports which have monotonic ID’s? This seems to cause a proliferation of large counters that need to be kept indefinitely for an arbitrary matrix of transports/sessions, even though a single monotonic ID per transport (or node) is enough to guarantee uniqueness.
eg: Would the Heartbeat function need to maintain Transfer ID’s for every serial/wireless transport, potentially including multiple TCP/IP serial connections, even though they may have equal-sized monotonic IDs? Or can they all be classified as Redundant Transports and share a single ID per function? Or can I just keep one ID per transport? Or node?
The ESP chips are fairly roomy as microcontrollers go (80K to 200K of RAM) but memory is still tight.
It makes sense that each Subject message be broadcast over all available transports in parallel, but should Service Response messages also be sent over all transports, or only the transport on which the Service Request originated? (I can’t find a clear rule in the spec. but I might have missed it.)

I’ve also got a few questions regarding the serial protocol when being used on ‘noisy’ hardware ports and software (XON/XOFF) flow control, but I might save those for now.

pavel.kirienko · May 6, 2020, 8:25pm

Hi Jeremy,

Thank you for the detailed description of your case. I think UAVCAN should suit it well even though this use case is somewhat unconventional.

Yes, it is acceptable and natural for a unit to expose independent UAVCAN nodes. Redundant transports only make sense if they are used in a similar configuration at runtime (regardless of whether the redundancy is homogeneous or heterogeneous).

Seeing as in your case USB and IP serve different physical networks, they should expose independent UAVCAN nodes, so yes, your assumption is correct.

PyUAVCAN does not involve itself with managing OS resources such as serial ports or sockets, it is the responsibility of the higher-level logic implemented in the application. How the OS resources are mapped to UAVCAN nodes is determined entirely by the application. It is possible to set up a redundant transport that leverages multiple ports concurrently, or several independent transports where each works with a dedicated (or multiple) serial port. I perceive you’ve already read this but I will post the link anyway for the benefit of other readers: https://pyuavcan.readthedocs.io/en/latest/.apidoc_generated/pyuavcan.transport.redundant.html.

Both approaches are viable. If your TCP/IP connections leverage the same underlying L2 network, then it doesn’t seem to make sense to have several of them. If they operate on top of completely different networks (e.g., one is lab-local, another is used to interface with a remote client, which is a very made-up example), you should use independent nodes per transport.

In general, the following rule applies: if the objective of an additional transport is to increase the reliability of the system, use the same node in a redundant transport configuration. If the objective of the transport is to expand connectivity to other (sub-)systems, use a dedicated node.

JediJeremy:

Am I correct in thinking that a separate Transfer ID counter needs to be maintained for each ‘session specifier’ even on transports which have monotonic ID’s? This seems to cause a proliferation of large counters that need to be kept indefinitely for an arbitrary matrix of transports/sessions, even though a single monotonic ID per transport (or node) is enough to guarantee uniqueness.
eg: Would the Heartbeat function need to maintain Transfer ID’s for every serial/wireless transport, potentially including multiple TCP/IP serial connections, even though they may have equal-sized monotonic IDs? Or can they all be classified as Redundant Transports and share a single ID per function? Or can I just keep one ID per transport? Or node?
The ESP chips are fairly roomy as microcontrollers go (80K to 200K of RAM) but memory is still tight.

The objective of transfer-IDs is not only uniqueness but also detection of missing data, so they shall be sequential. Concerning your question, however, you seem to have misunderstood the fact that transfer-IDs are computed at the presentation layer and then shared among all available transports. Under your example, the correct option is “or node”.

Section 4.1.3.4 Transmission over redundant transports states that all transports shall be utilized concurrently for any outgoing transfer. Service response transfers are not exempted.

JediJeremy · May 7, 2020, 3:57am

Thanks for the reply! That helped a lot, though I might need to clarify some of my questions a little, especially about the Transfer ID.

Thank you for the detailed description of your case. I think UAVCAN should suit it well even though this use case is somewhat unconventional.

I agree… I’ve gone through almost every industrial automation protocol that exists, and MODBUS(+extensions) is the only other one that even comes close, but UAVCAN beats it by having the SI datatypes as part of the spec… pretty essential with scientific equipment. Every other spec. usually fails by having an onerous ‘vendor’ process (to get an ID that allows you to transmit on the bus) that essentially excludes hobbyist-level makers.

The objective of transfer-IDs is not only uniqueness but also detection of missing data, so they shall be sequential. Concerning your question, however, you seem to have misunderstood the fact that transfer-IDs are computed at the presentation layer and then shared among all available transports. Under your example, the correct option is “or node”.

Right, this is the part I should clarify… so what that means is that in the case where a device has multiple interfaces which act as separate nodes (USB, UDP, Zigbee, etc) and might be running dozens of application-level functions (say 100 for a fairly complex app) Then each ‘node interface’ has to keep a fairly large table (of 800 bytes for 64 bit monotonic IDs) for the Subjects.

If the unit wishes to invoke Services on other nodes (eg. as part of a discovery process that lists available devices for some app service) the spec seems to say that a counter also needs to be kept per local node interface + remote node + service port ‘session’. If there’s 100 devices on a network that appear over time, that’s another 800 bytes per service port. If I’m invoking a couple of services (even if they don’t respond) then I’m quickly using more memory to store transfer ID’s than actual network buffers, and I can’t ever deallocate them.

And that’s just the ID storage. Add overhead for the keys and table management and that’s likely to double. It could easily consume the majority of a microcontroller’s memory, especially if arbitrary numbers of virtual transports are allowed.

Now lets’ say I violate the spec. and keep ONE global Transport ID counter for the entire node, (ie: one for USB, one for UDP etc.) shared between all Subjects and Service Requests. Since missing data is a typical event, it’s not going to affect much besides incrementing a ‘missing data’ counter on other nodes, if they even care about that. Right?

For apps that don’t have a hard reliability requirement, it won’t make any practical difference. It’s not nice, but it’s basically optional. My unit just gets flagged as a ‘low reliability device’.

For serial transports where each transfer must be decoded regardless (no filtering hardware) a single monotonic counter that increments per transfer satisfies the uniqueness requirement, and allows the transport to detect missing data (at the transport level) although it will not know the port that data relates to. However if the transport is over a ‘reliable’ link such as TCP/IP tunneled serial, PPP, xmodem etc. there’s basically no chance of that happening anyway.

It seems that for some of the alternate transports, maintaining a large table of Transfer ID’s has no benefit over a single global Transfer ID shared by all subjects, and is a detriment on memory-constrained devices.

Similarly, the requirement to maintain a Transfer ID counter per Service Request session specifier seems excessive. A single global ID per node (or even per unit) would suffice. The responding node simply copies the Transfer ID regardless… it doesn’t seem to care if the ID’s are non-sequential. There is no requirement in the spec. for the node running the Service to maintain a list of ‘last request Transfer ID’ that I can see. If services are idempotent, it shouldn’t even matter. The spec. doesn’t even seem to say what should happen if the requests are out-of-order. (I assume because Cyclic IDs on CAN would be problematic if enough frames were lost)

So if the Service Responder doesn’t really care about sequential Transfer ID’s per port and the Requester also doesn’t care (it can detect ‘lost data’ anyway because the response never comes) why not use a single global Transfer ID counter for all Requests for monotonic ID transports? (Of course the situation remains very different for Cyclic IDs)

I understand this is probably giving you a really bad feeling and violates some of your design goals, but I’d encourage you to have a think about whether sequential monotonic ID’s per session specifier is really an absolute hard requirement for all transports, or can be relaxed to ‘optional’ given how much of an overhead it can be. Especially for Service Requests.

Section 4.1.3.4 Transmission over redundant transports states that all transports shall be utilized concurrently for any outgoing transfer. Service response transfers are not exempted.

Sorry, I should have clarified… consider the case where multiple serial TCP/IP connections and a UDP transport all route to a single node instance, which you indicated was a viable approach. (It would certainly save memory) They’re not really ‘redundant’, so it’s a bit of a grey area. I’m asking if a request that comes in over a TCP/IP serial tunnel should cause the response to be sent back only to that tunnel, or should it be sent to ALL the TCP tunnels connected to that node and also via UDP.

At first glance that would seem to be wasting bandwidth on interfaces that aren’t involved in the request, but perhaps there’s a good reason for it.

pavel.kirienko · May 8, 2020, 9:50am

The resource utilization issues you are describing seem to be rooted in the fact that your networks are highly dynamic. A typical vehicular bus is unlikely to encounter nodes or transports that are added or removed at runtime so the specification requires that once a transfer-ID counter is allocated, it should not be removed. This is described in section 4.1.1.7 Transfer-ID:

The initial value of a transfer-ID counter shall be zero. Once a new transfer-ID counter is created, it shall
be kept at least as long as the node remains connected to the transport network; destruction of transfer-ID counter states is prohibited.

Footnote: The number of unique session specifiers is bounded and can be determined statically per application, so this requirement does not introduce non-deterministic features into the application even if it leverages aperiodic/ad-hoc transfers.

In v0 we had a provision for dynamically reconfigurable networks where we allowed transfer-ID counters to be dropped by timeout to reclaim the memory back. If you consider this measure sufficient to support your case, we could consider re-introducing that provision back into v1 by lowering the requirement level of the above text to “destruction of transfer-ID counter states is not recommended”.

Note, however, that such optimization requires the node to make assumptions about the maximum transfer-ID timeout setting on the remote (receiving) nodes.

Yes. However, observe that a sequence counting mechanism that enables detection of missing data is required by safety-critical vehicular databus design guidelines (e.g., FAA CAST-16). This capability is provided per subject/service. I understand that it may not be relevant for your specific case but we should explore other solutions before introducing special provisions for uncommon use cases in the specification.

This is specified explicitly but the wording may be suboptimal. Observe, section 4.1.4 Transfer reception:

An ordered transfer sequence is a sequence of transfers whose temporal order is covariant with their transfer-ID values.

Reassembled transfers shall form an ordered transfer sequence.

Therefore, an out-of-order transfer-ID indicates that the transfer shall be discarded unless the previous transfer under this session specifier was received more than transfer-ID timeout units of time ago:

For a given session specifier, a successfully reassembled transfer that is temporally separated from any other successfully reassembled transfer under the same session specifier by more than the transfer-ID timeout is considered unique regardless of its transfer-ID value.

If the interfaces are managed by a single logical node instance, then by definition they form a redundant group, hence it is required to emit every outgoing transfer once per interface. This behavior does not incur undesirable side effects because (section 4.1 Abstract concepts):

Redundant transports are designed for increased fault tolerance, not for load sharing.

The objective […] is to guarantee that a redundant transport remains fully functional as long as at least one transport in the redundant group is functional.

I understood that in the case you described you should run a dedicated logical node per transport. As a similar example, I am currently working on a UAVCAN bootloader for deeply embedded systems that supports UAVCAN/CAN alongside with UAVCAN/serial over UART or USB CDC ACM. The bootloader exposes an independent logical node instance per transport interface, so they do not form a redundant group. Hence, a request received over USB is responded to using USB only, on the assumption that the interfaces interconnect completely different networks (e.g., the CAN may be connected to the vehicular bus while the USB may interconnect only the local node and the technician’s laptop).

JediJeremy · May 11, 2020, 7:18am

The resource utilization issues you are describing seem to be rooted in the fact that your networks are highly dynamic.

Oh yes, definitely. I’m very aware I’m porting this to an application domain which is a little bit ‘next door’ to what UAVCAN was originally intended for, but I’m hoping my experiences can be helpful to make the spec. more widely used in those domains. I see a lot of potential there, and the work on alternative transport protocols is an indication that UAVCAN is trying to expand in those directions.

I’ve had extensive experience writing microcontroller firmware over the years for PICs, Atmels (Arduinos), and now ESP, and a constant issue has been finding ways to link them together. Protocols are either so heavy-weight that they won’t fit the memory constraints, or light-weight that they fail to provide enough capability. (eg. classic MODBUS doesn’t do strings or floating point numbers) IoT is all the rage now, and while RESTful services are easier to implement on modern chips, their Achilles heel is the need for centralized servers.

In v0 we had a provision for dynamically reconfigurable networks where we allowed transfer-ID counters to be dropped by timeout to reclaim the memory back. If you consider this measure sufficient to support your case, we could consider re-introducing that provision back into v1 by lowering the requirement level of the above text to “ destruction of transfer-ID counter states is not recommended ”.

The last thing I want to do is cause changes to the spec. which either make it more complicated, or which degrade the original focus on high-reliability intra-vehicle networks. What I suspect is happening here is a conflict between two core design goals of UAVCAN: high reliability for hard real-time systems, and minimal shared context.

The Transfer ID’s are the ‘battleground’ between those two goals - they are the minimum shared state required to create high reliability, and in static networks with fixed numbers of nodes and transports they do that with minimal overhead. All good.

But yes, in more dynamic implementations with arbitrary numbers of peers and alternate transports, that minimum shared state starts getting quite large. One design goal starts losing the conflict with the other.

Worse, the moment something becomes optional in the spec. (or “not recommended”) it begins causing problems for implementors. They’ll ask why it’s optional, and in what cases. That’s bad, confusion must be avoided at all costs.

So how do we resolve those conflicting goals? How do we keep FAA CAST-16 reliability, while potentially enabling low-context alternate transports for other domains (especially configuration and ‘debug’ monitoring) while preventing confusion?

Other specs resolve this by having ‘Profiles’. eg: the MPEG/H264 standards specify a set of features, but also define which features should be enabled in certain circumstances. eg: Limits on block sizes which allow ‘hardware’ ASIC decoders in Blueray players to guarantee they will always be able to decode a baseline stream (since once sold, those players last for decades in people’s homes and cannot be upgraded) but which are more flexible and advanced in other profiles intended for modern professional-grade camera gear which only needs to communicate with equally modern editing software.

So they have a “Baseline Profile” used in videoconferencing hardware, a “High Profile” for high-definition television broadcasts, and “High 4:4:4 Predictive Profile” for pro camera gear. (as well as others) In effect the same algorithms are used in each profile, but with changes to datatype sizes and limits on the amount of processing power and memory available to the codec.

I would advise thinking along the same lines, and defining which features of UAVCAN form a “High Reliability Profile” (basically the entire current feature set) which gives the safety-critical guarantees you’ve worked so hard for, but also a “Low State Profile” which lists what parts of the protocol become optional when optimizing for that domain.

Any place in the spec where you say must or shall or should potentially gets marked with the profile that it’s for.

Ideally the profiles are interoperable in at least one direction… the same way an “Extended Profile” H.264 decoder can parse “Constrained Baseline Profile” streams but not vice-versa. But it’s clear to implementors that if they don’t conform to some optional part of the spec, then their implementation sits in a different class, even if they can exchange compatible frames.

This also gives you options down the track if you wish to extend the spec further, such as adding extra timing constraints that might differentiate a “Hard Real-Time” profile (perhaps implemented in an FPGA) from libraries such as pyuavcan which will always be limited by OS delays.

The issues I’m having is that I have to keep a large amount of state to satisfy a Profile that I’m never going to reach regardless. There’s no way a UDP transport over WiFi is ever going to meet FAA CAST-16 reliability standards.

So if I can’t reach that bar anyway, but still see huge advantages in using UAVCAN for it’s app-level features like SI datatypes and decentralized node discovery, then what other parts of the spec. also become optional? Making my own choices on a feature-by-feature basis seems… unwise. And has the potential to fragment the standard into confetti.

I am currently working on a UAVCAN bootloader for deeply embedded systems that supports UAVCAN/CAN alongside with UAVCAN/serial over UART or USB CDC ACM. The bootloader exposes an independent logical node instance per transport interface, so they do not form a redundant group. Hence, a request received over USB is responded to using USB only, on the assumption that the interfaces interconnect completely different networks (e.g., the CAN may be connected to the vehicular bus while the USB may interconnect only the local node and the technician’s laptop).

Yes, that is exactly the kind of use case I’m also looking at! (The TCP serial case is mostly a wireless version of the same) That means you’ve also got a situation where you need to allocate lots of state to provide the alternate transport, so much it might potentially interfere with the CAN functions.

If I understand the bootloader concept, you may have a situation where the USB interface is connected to a host network with an unknown set of logical nodes, have to emit a broad range of subjects, and invoke services (if you’re using uavcan.file to retrieve the new firmware) from arbitrary remote units. Possibly while sharing some state/services between the ‘logical interface nodes’ within the unit. (such as statistics and registers) I have this problem multiplied by a potentially arbitrary number of TCP/IP connections (practically limited to a dozen or so) if I choose to implement that transport.

Anything that can be done to reduce the overhead of the serial transport implicitly improves the reliability of the CAN transport. Paradoxically, degrading the reliability of one can improve the other. In this case, “The Perfect is the enemy of the Good” quite literally.

This is specified explicitly but the wording may be suboptimal. Observe, section 4.1.4 Transfer reception
For a given session specifier, a successfully reassembled transfer that is temporally separated from any other successfully reassembled transfer under the same session specifier by more than the transfer-ID timeout is considered unique regardless of its transfer-ID value.

I’ll admit I didn’t get the full implication of that at first, but I did read the bit which said:

4.1.4.1:
Transfer-ID timeout is a time interval that is 2 (two) seconds long. The semantics of this entity are explained below. Implementations are allowed to redefine this value provided that such redefinition is explicitly documented.

And said to myself “Well, in that case I’ll simply set it to Zero for my implementation and I’ll document that and then I won’t have to worry about it anymore.” - again, not nice and probably not what you wanted, but if you’re going to make things optional then some of us are going to take the easy way out.

Although in my defense I was thinking about the WiFi UDP and TCP/IP serial protocols where the order is fairly strictly determined by the laws of physics (for the radio link) and TCP protocol. Plus I’m a big fan of idempotency.

The alternative is storing even more monotonic Transfer ID’s per session for up to 2 seconds on what could be high-traffic links (10Mbit/s over WiFi) and that could easily overflow my microcontrollers’ limited RAM.

hmm… re-reading it again I still don’t see if the protocol explicitly specifies what should happen if the transfers (not frames, but complete transfers) arrive out-of-sequence. The Transfer-ID timeout seems to de-duplicate identical transfer ID’s within that interval, but non-sequential transfers seem to be totally allowed.

I suppose that implicitly means that out-of-order transfers are allowed and should be responded to in the order they arrive? And if they’re service requests it’s up to the requester to sort it out when the replies get back?

I guess that’s unavoidable. If a unit starts rebooting several times a second (eg from power brownouts) you have to accept transfers where the counter’s gone back to zero, although if it’s within the 2 second window they will be discarded.

Oh, on the topic of reception timestamps:

4.1.4.1:
Transport frame reception timestamp specifies the moment of time when the frame is received by a node. Transfer reception timestamp is the reception timestamp of the earliest received frame of the transfer.

In cases where a frame arrives in small fragments (say over serial links) Would you prefer that timestamp to be sampled at the start of the frame or the end? I could set the timestamp to the start frame delimiter, the first actual frame byte, last byte, or the end delimiter in cases where they’re measurably different. I’d probably pick the first frame byte, since frame delimiters can be ambiguous as to which one you’re getting.

ps: when I’m re/quoting you, what’s the bbcode to include your name header thingy? That quite nice.

pavel.kirienko · May 11, 2020, 11:10am

The idea of profiles was poked by Scott a few months ago: Future Proposal: Featherweight Profile. Overall I think it’s probably a sensible direction to move towards but we have not yet accumulated the critical mass of alterations to justify breaking off a profile. Scott’s proposal is actually on the opposite side of the determinism/flexibility spectrum but the idea is the same.

That’s not quite true. The way you described it sounds like the resource utilization of the protocol stack is a function of the network configuration which is not something that can be robustly controlled or predicted by a given node. The protocol is intentionally designed to ensure that a properly constructed implementation (stack) can demonstrate predictable behaviors regardless of the network configuration. It is also a featured property of Libcanard (which is optimized for real-time systems).

The bootloader allocates a well-known set of transfer-ID counters statically and its memory footprint is not dependent on which interfaces are active or which of the tasks are being performed. Now, your case is different because certain base assumptions that the Specification makes about the underlying communication system are not met in your design (it’s too dynamic).

But setting the transfer-ID timeout to zero (which is related to transfer reception) does not automatically relieve you from the requirement to keep transfer-ID states for outgoing transfers. If you just used zero TID for outgoing transfers you would run into compatibility issues with third-party software and hardware. But the following approach is viable from the protocol design standpoint (the RAM issue notwithstanding):

I think you might consider removing least-recently-used TID counters automatically when the RAM resources are exhausted. It’s probably the solution that minimally departs from the Specification.

In the longer term, we should consider extending the Specification or introducing a profile (the latter is much more convoluted) based on your experiments here. We are watching you, Jeremy.

Out-of-order transfers shall be dropped. This is explicitly required by “Reassembled transfers shall form an ordered transfer sequence.” If you accept a transfer with an out-of-order TID, the set of transfers will not form an ordered sequence, hence a violation. Would you like to volunteer to add a footnote to clarify this, perhaps?

You don’t have to. Normally, in our applications, a node works non-stop until the system is shut down (if ever). See, the transfer-ID timeout has to be sized properly to suit the trade-off (section 4.1.2.4 Behaviors, non-normative blue box):

Low transfer-ID timeout values increase the risk of undetected transfer duplication when such transfers are significantly delayed due to network congestion, which is possible with very low-priority transfers when the network load is high.
High transfer-ID timeout values increase the risk of an undetected transfer loss when a remote node suffers a loss of state (e.g., due to a software reset).

I expect this to be specified once the UAVCAN/serial made it to the spec doc. The existing experimental implementation in PyUAVCAN timestamps by the first delimiter of the frame (the ambiguity is resolved retroactively) and I think it’s probably optimal because the delimiter is the first element of the frame. Timestamping at the end is undesirable because the payload transmission time (and its escaping, if any) would skew the timestamp.

You can select the quoted text and then click “Quote”:

JediJeremy · May 12, 2020, 1:21pm

Yup, I totally get that. My intention is to implement the spec as fully as I can, and i would never intentionally violate the monotonic order by going backwards or to zero. The issue is maintaining strict sequence per session… that’s what potentially has 64K maximum ID’s per port (one for each 16-bit serial protocol node ID)

At the moment I’m implementing the spec as-is, but with only 80K of RAM on the ESP8266 I know that there are edge cases (eg: a remote node - real or virtual - keeps rebooting and getting a new node ID over a long enough time, or an ‘intentional DDoS’ attack over TCP/IP serial) where my code will have to either throw away state intentionally, or crash. (and reboot, potentially adding to the problem!)

I do also plan to use UAVCAN (with libcanard) ‘as intended’ for some of my robots, which may have several ESP32 motor controller / sensor nodes within a single device connected by a CAN bus. That’s a big part of the attraction… using the same protocol for in the ‘internal facing’ and ‘external facing’ interfaces.

If you’re not familiar with the ESP32 I recommend having a look… it’s a very capable and popular chip. It will probably become my main platform, although I’d also like the stack to run on the older ESP8266. There’s a lot of those still out there.

https://en.wikipedia.org/wiki/ESP32

I’ll probably never get the stack to fit in an Atmel/Arduino without some serious compromises.

That’s why I was asking about violating the strict sequential order… I figured that skipping ID’s (by using a more ‘global’ counter) for outgoing service requests would at least prevent counters going backwards, and is an expected case. I could keep a ‘maximum ID I’ve ever used’ and restart from that.

Outgoing subject ID’s are at least a static size table, and while it would be nice to shrink that for very memory constrained devices, I can always constrain that by limiting the virtual node interfaces I create. (eg: by rejecting TCP/IP connection requests)

Subscribing to subjects requires a session table, but I can always limit the number of my own subscriptions if memory is tight, or dump old state on a timer (and ‘resync’ later) if I have to listen to an arbitrary number of nodes. I lose reliability, but that’s my problem. If the network is small and stable everything will be fine, and if not it’s my choice what’s most important to preserve.

It’s the service request session index that is the worry. Throwing away outgoing service request transfer ID’s on a timed basis (or having a fixed-limit table and discarding the overflow) is certainly possible, but would seem to be a greater violation of the spec than skipping ahead? I can (mostly) predict the consequences of ID skipping, but how remote nodes will respond to completely lost state (ID resets to zero) seems more unknown. Especially if things get stressed and I have do it repeatedly.

Throwing away incoming service request transfer ID state at least only affects the local node. I still think it’s way easier to simply respond to any monotonic request that arrives, regardless of apparent session transfer order or duplication. I get a request, respond, and then forget about it. Totally stateless. It makes little difference for “read” operations, and “write” conflicts can be handled at the app level through idempotency, which solves other conflict modes too. (like multiple node access)

Are there any ports/services which allow nodes to force the dumping of session state on remote nodes? Aka “I’m rebooting/shutting down, forget everything about me?” Is that behavior implied by the uavcan.node.Heartbeat MODE_INITIALIZATION and MODE_OFFLINE messages? Can I dump state if I don’t see a heartbeat for a while, or should I be assuming the node might come back?

I suppose that once I subscribe to a node for any subject (or make a request) I could also create a subscription to it’s heartbeat and use that to decide when to dump the session state. That’s not too hard.

Ah! I get it now. I was reading that as “Reassembled (frames for a) transfer shall form an ordered transfer sequence (of frames)”, basically making explicit that all frames must be received for a transfer buffer to be complete. Sorry, my mistake.

Oh not yet. Let me at least implement the spec. first before I go changing anything? At least if my confusions on the way are blogged, I’ll be able to remember where I had troubles.

But yeah, some clarification of what implementors should do if it doesn’t arrive in strict sequence (because of transmission errors, weird packet re-ordering by routers, or node reboots) would help. It’s easy to specify you should transmit in order, but reception can’t be controlled.

A literal interpretation of that would imply that a lost transfer could leave the receiving node in a state that would reject all future transfers (a broken sequence) or if a node rebooted then all transfers would be rejected until the counter reached the previous known value. Both of which seem undesirable.

It could also be taken to imply that if the transfer order skips an ID, then the receiver should wait some time period in the hopes that the missing transfer will eventually turn up (by redundant transport or repetition or because the router re-ordered it) and that would mean delaying all the later frames in some buffer until that is resolved. Which also seems undesirable.

With Cyclic transfer ID’s everything would resolve quickly (once it cycles around to the same ID) but monotonic ID’s never would.

That would certainly be ideal. I also build and fly dones (mostly tricopters) and I know that bits of the drone are very susceptible to reboots (especially when the battery is getting low and you punch the motors hard) and a fast motor controller resync time is the difference between crashing and recovery!

The CAN case has this covered, but If an arm takes a hit and you’re limping home with a broken wire and rebooting speed controller and have a serial-over-bluetooth redundant link, two seconds is a long, long time.

I’m thinking of the rare case when the frame start delimiter gets corrupted in transit… the frame should parse anyway (since one separator delimiter is enough, according to the implementation docs) but if the previous frame ended a while ago the timing could appear to be very skewed if the ‘most recent delimiter’ timestamp is used. (could actually appear to be ‘backwards in time’ long before the frame was sent!) That’s why I suggested the first header byte - it can’t go missing.

pavel.kirienko · May 12, 2020, 2:58pm

Under the model adopted by the spec, skipping ahead occurs in the event of a transient failure in the transport network (e.g., frame loss), whereas a reset to zero occurs if the remote node is restarted. In both cases, the behavior of both parties is well-specified.

Assuming that the loss of state (i.e., transfer-ID reset) is unlikely to occur often (from your description I infer that it’s so), removal of the least-recently-used transfer-ID on memory exhaustion is a lesser departure from the spec than the alternative because the alternative involves continuous violation of the specification per transfer (reuse of transfer-ID across different ports) whereas the LRU TID removal occurs only under special circumstances such as network reconfiguration.

The problem in this reasoning is that it introduces a leaky abstraction into the transport layer, requiring the application level to resolve issues pertaining to ordering and idempotency. If the transfer ordering constraints are upheld at the transport layer, the relevant context at the application layer is reduced.

There are no dedicated interfaces for state manipulation, and that’s not exactly the purpose of MODE_OFFLINE (it is intended as a way to let a node signal its departure explicitly instead of timing out).

That seems sensible, but does it add value if you have the LRU TID removal policy in place?

No no no. See, if a sequence is ordered that does not imply that it is also contiguous.

Indeed, this could be a valid interpretation, although it’s borderline malicious. I suppose we should add clarifications around this.

Good catch. This failure case is not considered and it’s a bug. Will file a ticket against PyUAVCAN.

pavel.kirienko · May 12, 2020, 3:06pm

Yes indeed, it’s the duty of the implementer to identify the optimal trade-off between:

In v0 we had an explicit provision for auto-tuned TID timeout where the optimal value is computed by the implementation at runtime based on the messaging frequency, and it is implemented in libuavcan v0. In v1 this possibility is not explicitly mentioned but it’s not prohibited either. I am not yet sure if there are any hidden complications or interesting edge cases arising out of such behavior so endorsing such approaches in the spec would be probably unwise, but the possibility is nevertheless still there.

finwood · June 8, 2020, 6:39pm

Well over a year has passed since the beginning of this thread. What does the specification roadmap look like today? As @JediJeremy, I am interested in using UAVCAN over UDP/IP and would like to push the specification process forwards.

Would there be a way to accelerate the specification (or at least a formal draft) of UAVCAN/UDP?