Homepage GitHub

UAVCAN v1.0 and ArduPilot


(Andrew Tridgell) #1

I thought I should put together some of the thoughts we’re having in the ArduPilot dev community regarding UAVCAN v1.0 versus the existing v0 protocol. I’m one of the people leading the effort for UAVCAN in ArduPilot, and so far I’m not convinced that embracing 1.0 is the right thing for us to do, as it would be highly disruptive just at a time when we are finally getting wider usage of UAVCAN.
The key pain points of the existing protocol from my point of view are:

  • lack of message extension
  • lack of dialect negotiation

I’ll give some examples of these pain points to illustrate.
Many of the issues with v0 uavcan can be illustrated by the gnss::Fix and Fix2 messages. The issues are:

  • the switch from Fix to Fix2 was done to add RTK status bits (mode and sub-mode), plus changes to 32 bit float for velocity and changing covariance representation.
  • we should have been able to extend Fix by adding the extra status bits without creating a whole new message. As uavcan didn’t support that, we ended up with both messages, and so GPS UAVCAN modules need to send both and flight controllers receive both, with no obvious path to ever get rid of the first message. That is not great.
  • for mavlink2 I added in the concept of message extensions, allowing for additional fields to be added to messages without breaking existing recipients. If a sender doesn’t know the extra fields then the recipient sees a zero value in those fields (the semantics of zero values needs to be chosen carefully to cope with this)
  • these message extensions have allowed us to evolve mavlink2 to add additional fields while maintaining compatibility. Just look for ‘extensions’ in the xml for examples.
  • we also have a capabilities message, which allows a flight controller to announce what new elements of the protocol it supports. This allows a GCS to know if it should use a newer protocol element instead of an older one. That helps a lot.

These concepts should be baked into uavcan. It is especially important for traditional 1MBit can as the bandwidth is often very constrained, especially given the huge encoding overheads (effectively under 0.5MBit). Wasting bandwidth by sending multiple almost identical messages is a very non-scalable way to handle communication on a limited bandwidth medium.

We’ve also found it painful to deal with the v0 messages due to the way they over complicate things. The Fix2 message with its numerous ways of representing the accuracies is a good example. Does anyone really want a 6x6 matrix? I know you can use the 6 element form, but it was really silly to go with such a general representation on a transport with such limited bandwidth when the real world use cases are just 3 numbers (horizontal accuracy, vertical accuracy and speed accuracy).

Similarly with RawAirData for airspeed. We only need 2 numbers, a pressure and a temperature. It has two pressures and four temperatures, plus a 16 element covariance matrix, none of which is clearly specified. It’s like a parody of a normal limited bandwidth comms protocol. What should have fitted into one CAN frame instead takes a pile of frames and leaves developers scratching their heads as to how to fill or interpret the fields.

I suspect the lack of message extensions is what led to the over specification of messages like this. The person who makes up the message needs to think of all possible corner cases because they know they will have no sensible opportunity to fix it without having to create AirData2, AirData3 etc. If we had extensions then we could have started with the very simple message, and if their really was a need for the extension once the message is in use then add it without breaking the existing usage.

I know this is complicated by the seeding of the crc with the message structure signature. Mavlink has that as well, and I got around that by limiting the crc seeding to be based on the part of the message structure that is in the ‘core’ message (ie. with no extensions). That leaves the extension vulnerable to two devs adding different extensions, but it is a practical method that covers most of the things we care about while still giving us the ability to add extensions.

I’m the first to admit that the mavlink2 extension system was not ideal. It was designed to fit within the constraints created from the history of the mavlink 0.9 and 1.0 protocols, while maintaining API compatibility as closely as possible and allowing for mavlink2 and mavlink1 to co-exist on the same transport. For end users it was a huge win, as it “just works”, and now they find that extra info turns up on their GCS displays that wasn’t available before.

At a lower level we may have been better off using something like DER encoding with ASN.1 structure. While lots of people dislike ASN.1, it does have really nice mechanism for extensions while also being able to create reasonably efficiently packed messages.

Anyway, now that I’ve explained the pain points with v0 I’ll give my perspective on v1. I have yet to see anything that suggests that v1 actually addresses the above pain points. I also don’t yet see how we are going to give a smooth migration path for our existing users onto v1, especially if they have UAVCAN devices with low amounts of flash. Our uavcan bootloader is currently around 18k (including DNA etc), and is based around libcanard. The available flash space for bootloader on existing devices is 23k. So we have 5k to add support for dual-stack with v1. Is that possible? If it isn’t then moving to v1 is a non-starter as we can’t get users to setup debuggers to change bootloader, and we need dual stack anyway so they have a smooth path to try a new v1 capable firmware but can move the device back to the old firmware if things don’t go as planned.

Doing dual stack in the flight controller will be possible for the boards that have 2M of flash. For boards with 1M of flash it will be tight, but may be possible (the old Pixhawk1 with the 1M flash bug does still matter for us as a use case). Dual stack in a stm32f103 can node is a much harder proposition, as we have only a few k of spare flash, and very little free ram. We really don’t want to force hardware change on our users yet again.

For those who haven’t seen it, this is one of the key pieces of our UAVCAN ecosystem push for ArduPilot:


It is basically ArduPilot running on CAN nodes, using all the standard ArduPilot sensor libraries, but on small footprint MCUs like F1 and F3. The aim is to make it really easy for vendors to create new UAVCAN peripherals. They just need a hwdef.dat to give the pinout, and they need ArduPilot to support the sensor they want on UAVCAN. Then creating the firmware is trivial.
We’re making a big push at the moment to get lots of vendors to make peripherals based on this and we’re just starting to get momentum. A shift to v1 would likely stall the progress we’re making, which could kill off UAVCAN as a viable system for wider adoption for a long time. That isn’t attractive.

Anyone, that’s probably enough for now. I hope it was helpful in showing our perspective.

One final note, I am hugely grateful for the effort you and others have put into UAVCAN, and I realise my contributions have been trivial by comparison. None of the above criticism of v0 and v1 should reduce the fact that your efforts over many years have made what we’re doing now possible. We just need to make engineering decisions based on what we see as being the key factors for us now.

Cheers, Tridge


(Philip) #2

Our old devices are 64K, our new devices have no such limitations, but we are not in the business of depreciating support for devices that just because…


(jani.hirvinen) #3

All jDrones devices are 128k minimum. Such as Gen.Node, OLED Display, AirSpeed, Compass, Baro, Servo drivers and so on.


(Andrew Tridgell) #4

Following up my own post with another pain point of v0 that I forgot to mention.
We’ve found it quite common that a message gets added by a vendor, let’s call them COMPANYX, and then later we want to adopt that message into a different namespace, say ardupilot namespace or uavcan namespace.
Right now if we renamed a DSDL from org/COMPANYX/equipment/foo/2000X.FooBar.uavcan to org/ardupilot/equipment/foo/200YY.FooBar.uavcan then the signature would change, which means the original vendors equipment would no longer be compatible. This makes it really painful to do the natural migration of new messages from vendor namespaces into more widely used namespaces. The original vendor needs to carry patches against the upstream code that has adopted their message in order to use it with their equipment.
I don’t see how this is addressed in v1, although I could be wrong.
There are many ways this could be addressed. Perhaps the simplest would be to allow DSDL directives to specify an override for the signature, or allow an override for the namespace to be used when computing the signature.


Introducing AP_Periph - easy UAVCAN firmware creation
(Scott Dixon) #5

We have not considered this but it is a very interesting idea that is very much inline with a key goal of v1 which is to provide greater flexibility for vendors to define types. @pavel.kirienko, what do you think of defining a formal mechanism for migrating a message definition to a different namespace while maintaining compatibility with the existing one? Ultimately names shouldn’t matter on the wire so the only thing that prevents this is a lookup that can tolerate duplicate definitions when the duplicates are compatible. Actually… didn’t I just do that for pyuavcan (https://github.com/UAVCAN/pyuavcan/pull/68)?


(Pavel Kirienko) #6

Tridge, thank you for the extensive post. The importance of such direct feedback from industry leaders is hard to overstate.

I think that we are well-aligned on the subject of v0’s deficiencies. The release of v1 is the direct result of our attempts to resolve them, and we are certainly aware that the breaking transition will somewhat damage the protocol and the surrounding ecosystem. We expect, however, that the short-term damage is far outweighed by the long-term benefits of the new v1 specification because its design is based not only on theoretical assumptions but also on our practical experience with v0. I think we covered it reasonably well in the recently published roadmap, in the July’s article, and in the Stockholm Summit recap, so those who are looking for more details will know where to find them.

I understand that perhaps our approaches might seem non-obvious to someone with deep experience in the domain of small unmanned vehicles, but that should be attributed to the fact that the popularity of UAVCAN is growing in other domains, such as space vehicles and manned electric aircraft. I use the term “software-defined vehicles” to describe the meta-domain, I think it reflects the core principles well, as well as the role of UAVCAN in it, which is to serve as a medium-level protocol that is both deterministic and abstract. Extension of the protocol to the new domains should not reduce its utility for small unmanned systems; quite the contrary – considering the existing development practices and the growing regulatory pressure, UAV systems could benefit from adopting practices from the other fields.

I think that transferring existing methods and approaches from protocols designed to address different problems based on incompatible models and assumptions, such as MAVLink (2), is a serious mistake. One needs to keep in mind that the core design goals of UAVCAN include statelessness (low-context communication) and decentralization (no super-node, module-serves-the-network). Here is a relevant excerpt from one of the lengthy online discussions that shaped v1:

In order to steer this conversation away from dead-end paths, let me say now that any design decisions that focus on bus masters, centralized activities of any kind, or stateful/context-dependent communication go directly against the core design principles of the protocol. As such, things like protocol version negotiation at the time of dynamic node ID allocation, or centralized data type compatibility checking are not going to happen.

[…]

The reason why statefulness and context-dependency […] are evil and are to be avoided is that they introduce significant complexity and make node behaviors harder to design, validate, and predict. Each independent interaction between agents shall have as few dependencies on the past states as possible. This simplifies the analysis, makes the overall system more robust, and makes it tolerant to a sudden loss of state (e.g., unexpected restart/reconnection of a node). Additionally, in a decentralized setting, maintenance of a synchronized shared state information can be a severe challenge. Decentralization by itself is extremely important as it allows the network to implement complex behaviors while avoiding excessive concentration of decision-making logic in a single node, thus contributing to overall robustness and ease of system analysis.

The above should be sufficient motivation for complete avoidance of network initialization procedures of any kind. There will be no mandatory data types besides the already existing NodeStatus. Any node shall be able to immediately receive and interpret any transfer from the bus without any preparatory stages or special network initialization routines. This implies that the protocol will be redundant since all of the information necessary to interpret a given transfer must be directly attached to the transfer.

Some of the issues in v0 were caused by wrong base assumptions about how the ecosystem is going to operate. Lack of built-in means of advancing data type definitions was the direct result of the assumption that one who is defining a data type is able to model its usage in great detail, relieving the implementers and users from compatibility-related issues completely. Further, it was also assumed that the protocol maintainers will be able to foresee the most common application-level use cases and provide an adequate set of standard data types to address them.

As you are absolutely correct to point out, the result was unsatisfactory: we ended up with dozens of poorly specified, overly generic definitions that were impossible to advance. Support for vendor-specific types was lacking as well, so one looking to avoid dealing with the poor set of standard types could not do that easily, especially considering the fact that the application-level types were co-existing in the same type library (i.e., namespace) with the types that are essential for the operation of the protocol, such as NodeStatus.

We solved the problem by shifting the responsibility from the author of the data type to the integrator, requiring the latter to ensure that the equipment is configured to use the correct data type versions. As a result, data type authors now have the ability to introduce changes into data types, both breaking and backward-compatible. The v1 specification provides a set of strict, well-defined rules that allows one to reason constructively about breaking changes. Backward-compatible changes, such as the addition of new fields or some minor modifications of existing fields, are possible as well, provided that the memory footprint of the object is not affected. Specifically, when defining a new data type, the original author leaves some space unused, which can be utilized for additional fields in newer versions of the data type later. We opted out of supporting arbitrary extensions at the end because they complicate the use of the protocol in deterministic hard real-time systems, where the worst-case memory footprint of the object must be known statically.

The next obvious step that we took was to remove all application-specific types from the standard namespace. You will not find anything about GNSS receivers, IMUs, or ESCs there, and they are never coming back. Instead, we are delegating the task of defining and maintaining domain-specific data types to vendors, who are presumed to be far more qualified for that than UAVCAN maintainers are. Such definitions are still stored in the same repository but under a different namespace (not uavcan).

The seeding of the transfer CRC with a data type signature was a mistake. This behavior was dropped in v1 (along with several other simplifications of the protocol such as removal of tail array optimizations and data type identifiers); now, the CRC is invariant to the kind of data contained in the transfer. This change eliminated a leaky abstraction, providing a much cleaner design and layering, and removed the serious logical inconsistency that multi-frame transfers were protected against a data type mismatch while single-frame transfers were not. Further, it makes this case a non-issue:

We’ve found it quite common that a message gets added by a vendor, let’s call them COMPANYX, and then later we want to adopt that message into a different namespace, say ardupilot namespace or uavcan namespace. Right now if we renamed a DSDL from org/COMPANYX/equipment/foo/2000X.FooBar.uavcan to org/ardupilot/equipment/foo/200YY.FooBar.uavcan then the signature would change, which means the original vendors equipment would no longer be compatible. This makes it really painful to do the natural migration of new messages from vendor namespaces into more widely used namespaces. The original vendor needs to carry patches against the upstream code that has adopted their message in order to use it with their equipment.

The described case warrants special attention for a different reason: it seems to show that you might be managing the ecosystem in a suboptimal way. If you are supporting a piece of equipment from a third-party vendor, you are supposed to use the message definitions provided by that vendor. I don’t think you have valid reasons for creating verbatim copies of those definitions in your namespace; the utility of that action is negative because you add new entities to replicate existing functionality with no added value, increasing the burden of maintenance and confusing the adopters. I understand that you might have been induced to implement this approach by the fact that supporting vendor-specific namespaces in v0 was hard, but in v1 it should no longer be the case and so ArduPilot should avoid practicing that in the future.

If you are interested in the future of the standard, particularly in the sense of supporting new protocols (Ethernet, serial, wireless), then you will probably want to know that we are resurrecting the runtime compatibility enforcement in more capable transports that can tolerate the resulting overhead (Ethernet and serial). The new mechanism is not going to affect the CAN (FD) transport, it is purely a matter of future development. We call it “data type hash”, and it is roughly outlined here: Alternative transport protocols. I am not yet entirely happy with that proposal, because it moves the responsibility of ensuring compatibility from the integrator back to the data type developer, which could be seen as undoing some of the progress we have made in v1, and also it is not entirely compatible with the vague idea of polymorphic types that I was thinking about lately. At any rate, this is a matter of ongoing research, and it should be crystal clear that it does not, and will not, affect any existing or future v1-over-CAN deployments, because it simply does not map onto the CAN transport at all.

I think the best way to see how the issues you have outlined are addressed in v1 is to read the specification. I would also be delighted to have a call with you if you are willing to tolerate my bad English. :slight_smile:


I am entirely sympathetic to the ROM footprint issues. I can’t say it with confidence, but I think that it might be possible to squeeze a dual-stack implementation into your bootloaders within the 5K budget, especially so if you are comfortable with optimizing libcanard heavily for your specific use case. We say that v0 and v1 are different, but for a bootloader, the difference amounts to slightly different bit layouts here and there, which should be easy to generalize over.

There exists an alternative which is not great but you may want to consider it anyway: update the bootloader together with the application when the version is changed. This is done trivially by embedding the bootloader image into the application binary. The major drawback of this approach is that there exists a brick-prone window between the point where the original bootloader is erased and the new one is installed.

Yet another alternative is to rely on the default bootloader supplied by STMicroelectronics. Those come with serious limitations, but they work as a last resort if the original bootloader is unusable and a JTAG/SWD probe is not available.

The PX4 project has a BSD-licensed UAVCAN v0 bootloader for STM32 targets that fits into 8K of ROM (made by David Sidrane and Ben Dyer), perhaps you could benefit from that also.

(Opinion: not sure if pertinent here, but I would say that designing a new part for a generic UAVCAN node with less than 512K of ROM is a mistake. Flash memory is cheap and saving pennies on it is unlikely to pay off considering the great difficulties of managing ROM-constrained environments in the long term. Software is getting more complex, and the hardware should evolve to suit the growing demands.)

The v1 version of Libcanard still requires work, and we could use all the help we can get. Kjetil and Åsmund did a great job already, but it’s not quite done yet. If anybody from the ArduPilot team would like to help, let’s coordinate here on the forum or on the dev call.


(Pavel Kirienko) #7

As I just posted above, nothing prevents one from migrating a data type from one namespace into another (unless you are using an experimental transport that uses the data type hash), but I don’t think the use cases where this would be actually needed are common.

The pull request you linked is about non-atomic namespaces, I don’t think it’s related to the current discussion.


(Andrew Tridgell) #8

Hi Pavel, thanks for taking the time to reply.

as long as you are aware of the different aims and contexts of the two protocols then I think that learning from the experience of other protocols is not a mistake at all. Far from it, it is essential to prevent big mistakes being made that are known to be mistakes from past experience.

but the solution you’re adopting (adding reserved fields), means two things:

  • the original message maker has to be prescient enough to know how many extension bytes may be needed in the future, something they have no way of knowing
  • as most messages won’t ever be extended, all users end up paying for those “might be needed someday” bytes.

If I understand your deterministic real-time argument then you couldn’t even zero-trim the tail of the msg (something we do in mavlink2 to great effect) otherwise the timing would change when extensions are used
so you’re ending up with a significantly less optimal system which is going to push simple messages into multi-frame when they don’t need it and which doesn’t actually solve the problem. That really doesn’t seem like a good design, especially on such a low bandwidth bus.

I think that is a very poor assumption, and I think it is going to lead to choas for the consumers of these devices, at least in the consumer UAV space.
What tends to happen is that some new idea, say “RTK GPS with yaw” will suddenly get a lot of interest from many vendors at once (eg. when ublox F9P comes out). They all are rushing to get a product to market. Each will be working in secret and not talking to the others. They will end up each inventing a new “GPS with RTK yaw” message, which will all be a bit different. Then the flight controller developers will be faced with having to support and parse a dozen different messages and know how to test all of them. They won’t ever be able to drop them as products from all vendors will have shipped and will need to be supported for many years.
This issue will be reduced somewhat as I convince more vendors to adopt AP_Periph, which relieves them of the burden of having to develop their own UAVCAN firmware, and which ensures that development and testing can happen against a common platform, but it is hardly ideal for something we’d like to see widely adopted.

please correct me if I’m wrong, but I think this means we would then be reliant on only the low bit count IDs to determine how to parse msgs. We don’t have the luxury of a UUID system (with large bit counts) as the overhead is too high, so we are totally reliant on somehow coordinating these competing, secretive vendors in selection of IDs for their messages. What will stop ID conflicts leading to mis-parsing of data?

it cleaner, yes, but it is cleaner in the way that a hospital is cleaner if it has no patients :slight_smile:
I don’t think it addresses the practical reality of the way vendors work together (or more importantly, don’t work together)

the reason for moving it is social/political. The original messages are in the namespaces of startups that don’t necessarily want their name to be ensconced forever in the message used across the industry. When they develop the message to solve a problem they are not thinking ahead to when they want it to get into an upstream implementation, they are just trying to hit a deadline for a shipping product. Later when they start caring about working with others we want to move responsibility for the message into a longer term organisation namespace.

I had read it, but I didn’t see how the issues I raised are addressed. Perhaps an example of how data type evolution would work in practice would be good? Exactly how would the Fix/Fix2 problem be prevented (apart from just punting on the whole thing and saying “not our problem”) ?

we already do that. We have a the bootloader in ROMFS and a CAN parameter "FLASH_BOOTLOADER’. The user can set that to 1 using the usual parameter UIs and the app will check if the bootloader is up to date and flash it if not, then report what it has done using a debug text msgs. It then resets FLASH_BOOTLOADER back to zero. Not really elegant, but it is functional.

my memory of reading up on that is that the restrictions are pretty bad. Not really attractive.

low-end new nodes are going to F303cc, which is 256k. We did that mostly so we could have a “universal” can node that can support all of the sensors that AP_Periph can do, with config via parameters, but it also gives us breathing space for dual stack. I don’t want to drop the existing 128k f103 support though.

ok, that’s good, although I’d like to be absolutely sure that is the case
The other big thing I’d like to clarify with you is if with dual-stack could we mix v0 and v1 nodes on the same bus at the same time with a flight controller receiving and acting on messages from both v0 and v1 nodes. From my reading of the spec and understanding of CAN I’m pretty sure the answer is no, but really that should be spelled out absolutely clearly in the v1 spec as it is the single most important thing about whether v1 is going to actually succeed and if someone should consider switching.
If I’m wrong about that and v1 can properly co-exist then that totally changes the conversation, and putting effort into v1 starts to make sense. If they can’t properly co-exist then I’d have to ask why on earth it wasn’t designed to allow for co-existence? I can’t believe it’s not possible, and the destruction of the nascent user and vendor community that is finally getting traction with v0 is truly disturbing.
Maybe in my sleep deprived state I just missed a clear statement on that in the spec and in the FAQ, but it really should have been in huge font and be right up at the front when explaining the basics. A search for “v0” doesn’t even get a hit in the v1 spec.
Cheers, Tridge


(Pavel Kirienko) #9

Regardless of how many fields one can append to an existing message, at some point the technical debt accumulating from such compatibility-preserving extensions would make evolution of the type difficult. From that assumption follows that occasional breaking changes are unavoidable; given that, it should be possible to find a sensible balance between the probability of premature breakage and the amount of reserved space. Generalizing from my experience, I see that lack of particular fields is rarely a problem; generally, when an existing interface requires change, its structure ends up being affected significantly enough to make preservation of bit compatibility impossible or at least hard to ensure. I realize that this might be different in MAVLink considering that it is designed to address different objectives.

We have the major version number update policy to deal with such breaking changes on a per-type basis.

The problem with real-time systems and message trimming is that a real-time network has to guarantee that a specific performance goal is met under all operating conditions. If a message is changing length depending on its contents, the network would have to be designed for the worst case, that is, untrimmed message size. The outcome is that the bandwidth released by trimming would not be usable for real-time processes anyway.

The above is not to say that trimming and other non-deterministic methods are useless. After all, we have variable-length data structures in UAVCAN – arrays and tagged unions. I perceive that given the real-time considerations, the utility of trimming compared to the resulting complication of the transport is insignificant.

If you think this is misguided, we could discuss this in-depth and see if it gets us to any interesting decisions.

The fact that Specification is withdrawing from managing high-level application-specific data types such as GNSS messages doesn’t mean that they are not going to be managed at all. We have the dedicated repository for public regulated data types at https://github.com/UAVCAN/public_regulated_data_types, where any vendor is free to propose a new data type. Each vendor has a dedicated namespace. Users and implementers of UAVCAN are advised to rely on existing data type definitions published in that repo instead of reinventing the wheel. UAVCAN maintainers still decide which types should be accepted into the repository and which should not be, which sets some minimal quality bar, allowing us to combine the expertise of UAVCAN maintainers in designing UAVCAN interfaces with the expertise of vendors in their respective application domain. I think this approach is optimal or close to it because it allows us to combine the best of the two.

You understood the part about the low bit count correctly.

UAVCAN v0 used to have the concept of “data type ID”, it is gone from v1. In v1, a data type is identified by its name only. When one needs to establish communication over UAVCAN by publishing a stream of messages of a particular type, the stream is assigned a numerical identifier called “subject-ID” (as in, the subject of a message). Subscribers to the stream expect it to be of a particular type; it is up to the integrator to enforce the correct type matching.

There exists a very rare exception of data types where the subject-ID is defined by the data type designer; it is called “fixed subject-ID”. Normally, only the low-level types defined by the specification itself can benefit from a fixed subject-ID, for example, the Heartbeat message, the time synchronization message, log message, and most of the standard services.

Data types defined by vendors for use in closed ecosystems, such as within proprietary vehicular systems, also can benefit from fixed subject identifiers, because the scope and usage scenarios of data types defined for a closed ecosystem can be reasonably foreseen by the type designer.

Data types published in the public regulated data type repository are also allowed to use fixed subject identifiers; the ID conflicts are avoided simply because all of the definitions are managed in a centralized manner.

That’s it. A standalone vendor is simply banned from using fixed identifiers for their types if it is planned to be made public. Here, UAVCAN follows the same approach as DDS, CAN Kingdom, CANopen (assuming dynamic PDO mapping), or ROS: the syntax of a data type is entirely decoupled from its semantics.

A vendor releasing, say, an RTK unit, will have the following options:

  • Use data types with fixed subject identifiers from the public regulated data type repository. Soon there will be application-specific data types for RTK, but they are unlikely to ever get a fixed subject-ID because it breaks the architecture of the protocol, so see the next option.
  • Provide a configuration parameter, allowing the integrator to choose the subject-ID at the device integration time. This is the correct approach in this specific case.

You can find the discussion that led to the above decisions here if you are interested in the context: On standards and regulation

Understood. As I said, currently one can rename a type arbitrarily without affecting the compatibility. It’s important that the concept of compatibility is concerned only with the syntax of the data, that is, with the arrangement of its primitives, not with its purpose. This would break for the new experimental transports relying on the data type hash, but as I wrote already that is a matter of ongoing research and experimentation. I would like to explore the possibility of supporting polymorphism through the data type hash, it is not yet clear to me if there exist valid approaches.

A practical example is provided in section 3.8.3 of the specification, in the blue box at the end.

Speaking about the Fix types specifically, once Fix2 is released, the original Fix would be marked @deprecated. For a long while (the period depending on the typical design lifespan in the target industry), systems will be supporting both Fix and Fix2, then Fix will be removed from the data type set and vendors will cease to support it.

It’s complicated.

The short answer:

  • If the toggle bit in the first frame of a transfer is zero, you are dealing with v0.
  • If the toggle bit in the first frame of a transfer is one, you are dealing with v1.

The slightly longer answer requires a truth table. Each UAVCAN-over-CAN frame has a tail byte. The tail byte contains the metadata necessary for multi-frame transfer reassembly and deduplication (because the CAN bus may spuriously duplicate frames, but this is outside of the scope right now).

image

The figure is taken from the v0 specification, in v1 it is identical except that the number of payload bytes can be up to 63.

  • Start of transfer – this flag is set if the current frame is the first frame of a transfer.
  • End of transfer – this flag is set if the current frame is the last frame of a transfer. For single-frame transfers, both start and end are set.
  • Toggle – this flag alternates between the frames in the same transfer. Its original state, that is, when the start flag is set, encodes the version of UAVCAN.
Start of transfer End of transfer Toggle Protocol version
1 x 0 v0
1 x 1 v1
0 x x ?

See? To support multi-frame transfers in a dual-stack application, you will have to maintain an additional state on a per-transfer basis. It’s not difficult, just a minor inconvenience. Frames belonging to the same transfer share the same CAN ID value, so that helps (this is a hard requirement in both versions).

The full answer requires a reminder that in a CAN bus, nodes are not allowed to publish CAN frames with different payloads under the same CAN ID simultaneously. Within the same UAVCAN version, this is not a problem because the CAN ID differentiation is enforced by the node-ID embedded into it. When we share the same bus between v0 and v1, we arrive at the complication. Here is the ID structure defined for v0:

And here is v1:

It is easy to see that due to the source node-ID being shifted one bit to the left (which was unavoidable because we needed to reserve the least significant bit for protocol version, to avoid having this issue in the future), the sets of emitted CAN ID for nodes under different node-ID and different protocol versions intersect. The intersection is small and it does not make a difference for a typical light UAV where UAVCAN networks are simple, but it might be a problem for a safety-critical deployment unless special measures are taken to avoid collisions.

A robust solution is to ensure that v0 nodes are assigned even node-ID values, not odd – this works because a v1 node always sets the version bit, thus ensuring that CAN ID sets do not intersect. Alternatively, the differentiation could be enforced via the first three priority bits, for example all even priority values could be banned for v1 nodes (leaving only 1, 3, 5, 7), and for v0 nodes the following would be prohibited: 0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27.

However, observe that for the problem to actually have an effect on the application, all other bits of the CAN ID will have to be identical, which is highly unlikely. I expect that ArduPilot deployments may simply ignore the problem.


(Andrew Tridgell) #10

ok, so vendor1 proposes a msg there, then vendor2 sees it and decides to be a good member of the community and use that existing msg. So far so good. Then several things could happen:

  • vendor1 decides to rev the major on the msg. They don’t care about impact on others as this is “their” message.
  • vendor2 wants something extra in the message, now they have to create a new msg in vendor2 namespace

Why won’t this lead over time to an ever growing list of incompatible messages that every flight controller will have to deal with?

given the variable length arrays are all over the messages this conclusion seems unjustified. We could do zero tail trimming but a device could choose not to use it if the real-time consideration is important for that device. So make zero tail trimming optional. I suspect the vast majority of use cases will want it to be enabled.

many thanks for your detailed explanation of this compatibility! Maybe this should be added to the v1 FAQ or spec?
To actually make use of this I think we’d need DSDL compiler support for it. Maybe the v1 compiler could take a list of what v0 data types we need support for and automatically check if there are issues. If there are then it could enforce the priority values for conflicting v1 msgs (eg. rounding up), along with a compiler warning. That means the v1 compiler will need to parse v0 dsdl, or we need some variant of the v0 compiler which generates enough data for the v1 to be able to decide what the conflicts are.
If this is rare enough then it could be acceptable. I don’t like the idea of just rolling the dice and assuming no conflicts without something that checks it is a valid assumption, as debugging this would be nasty and it could easily be safety critical, especially if the conflict is with a message that is only generated under special circumstances (eg. “parachute release”).
If we can prove that this really works then v1 becomes a much more viable option with dual-stack.

I do think the lack of extension fields for messages in v1 is a huge mistake. In my opinion the real-time argument against them is very weak. If you do want to cover that case then we could add support for a “NOEXTENSIONS” flag in a DSDL definition to prevent extensions for that particular message. I don’t think many messages would need this.

That is a self-fulfilling argument. If there is no method to add message extensions with compatibility then the developer may as well change the structure in many ways at the same time. If there is a way to make it compatible then they will put the effort in to extend it in a compatible way. Just look at the numerous extensions we’ve added to the mavlink2 xml to see how this has worked in practice. It has been a massive win. Just look for “extensions” in this file.


Cheers, Tridge


(Pavel Kirienko) #11

Because we expect people to collaborate sensibly. I don’t think there is a bulletproof solution to this problem that doesn’t make assumptions about human behavior, or at least I can’t see one at the moment, but I expect that the solution that we have is viable.

As good moderators, we will be on a lookout to prevent your scenario from developing in the public regulated repository.

It is important to shed the v0-focused mindset and internalize that v1 is far more forgiving and flexible when it comes to building generic pub/sub interfaces because it completely decouples syntax from semantics. In your example of a multi-antenna GNSS unit, a vendor could take an existing orientation-invariant GNSS type and publish the yaw on a separate subject using the standard type uavcan.si.sample.angle.Scalar.1.0. Another sensible alternative is to consult with the community on what the optimal interface should look like, and then standardize around that. Plowing through with an ad-hoc type is unlikely to be beneficial for anybody.

Vendors will always be interested in reducing the adoption barriers for their hardware; hence, collaboration and standardization are in their interest – to some extent, the market will self-regulate. Using an example from a different industry, would you not agree that a vendor that releases a very special camera that requires a dedicated PCIe extension card to operate would have a harder time crossing the adoption barriers than the one who relies on conventional Ethernet/FireWire/USB, even if the latter impose undesirable overheads? The appearance of new technology does have a tendency to cause various incompatible standard to appear, but such turmoil is relatively short-lived, followed by a steady-state with clear leaders emerged and kept stable by the market forces.

Maybe your view is a little too pessimistic?

Variable-length structures are not in the same category as zero-trimming. Real-time streams rarely contain variable-size arrays, and when they do, their size is fixed at the run time (for example, a covariance matrix). The size of a real-time message is defined by the external constraints of the process, such as the dimensionality of a process state vector, instead of the current state of the process. The addition of zero-trimming would shift the size dependency onto the latter.

Would it be beneficial for most applications? Of course. Would it be a zero-cost feature? Of course not. It does add complexity and my current assessment is that the utility of the feature does not outweigh the cost of its introduction.

The real-time argument was about arbitrary zero-trimming, not extensions, sorry for mixing that. They are tightly related but they are not identical.

I wonder if we could somehow express the concept of extensions through polymorphism that I mentioned in this thread already? Have you considered derived types in MAVLink?

The rough idea that I have is that the current hard requirement that messages exchanged over the same subject shall be of the same type is unnecessarily strict. I would like to allow type variance so that subscribers could use a more generic type than the subject type, and publishers could use a more specific type than the subject type. This trivially extends to allow backward-compatible extensions as long as the publisher-subscriber relation is covariant with their type relation.

The possibilities enabled by this are huge, but building upon the above discussion of compatibility I would like to provide the following specific example: suppose there is a standard equipment-specific type X. A vendor releases their special piece of equipment that needs extra data to be added to X, so they define their own type Y that derives from X. An autopilot that seeks maximum compatibility subscribes to X and can process messages of type Y. Another autopilot that seeks maximum feature utilization subscribes to Y and that renders it compatible only with the device of that particular vendor.

Extension fields can be trivially expressed through polymorphism by defining the extensions in derived types. This is not the same as zero-by-default fields in MAVLink, but it is a more powerful mechanism.

@scottdixon Do we want to dig deeper?

We track it here: https://github.com/UAVCAN/specification/issues/20.

The facilitation of dual-stacking is way above the compiler’s pay grade.

The CAN ID conflict issue is much easier to resolve: just ask your users to use only even node-ID values for v0 nodes, and modify the v0 PnP allocator accordingly. This is a robust and sufficient solution. No need to do anything else, no need to touch the compiler or adjust the transfer priorities.

In ArduPilot, I highly recommend to run two nodes side-by-side, if you want dual-stack: one v0 and one v1; they need not be aware of each other. Route the received CAN frames in the shared driver based on the protocol version bit in the CAN ID: zero goes to v0, one goes to v1; in this case, you won’t need to apply the stateful toggle bit detection logic. Make v0 optional, so that users who have migrated to v1 would not incur the unnecessary overhead.

I think it’s pretty straightforward. Page @TSC21 FYI.


(Andrew Tridgell) #12

no, because it is conceptually harder, and more difficult to express across a range of language bindings. Not everything is C++.

how would that work for the language bindings? If the consumer wanted the extra capabilities of the derived type but the sender is sending the base type, how does that present in the API?
What happens when we want to add another extension to an already extended type? We end up with two layers of derived type? Then three?
This really seems to be twisting things pretty badly just to avoid the very simple concept of extensions in a message.

That is horrible. The autopilot will want to support multiple sensors at the same time, some of which will have the extensions and some won’t. Using the approach you’ve suggested would make life very hard for the autopilot developer.

No it isn’t! I’m not sure if you understand how v0 has been deployed in practice. Vendors have v0 UAVCAN nodes with firmware developed in-house where the developer has left the company. Some have hard coded node IDs. Some have the node ID wired into configuration files and monitoring code. These nodes may have FAA certification which requires no further code changes without re-certifying.
The UAVCAN developer community needs to make the transition to support v1 as painless as possible. What you propose is very fragile, it is very easy to make a mistake with a node with bad consequences and no way to detect the issue.
You chose to make v1 a breaking change. Now it is the responsibility of the v1 toolchain to make that change as painless as possible. The approach I’ve suggested where the v1 compiler knows about the incompatibility with v0 is a lot more robust. Users who don’t need v0 compatibility can just not list any v0 messages that they need to coexist with and the impact on their usage would be zero.
I think you need to re-evaluate your position with regard to v0. You need to show that you are a trustworthy steward of this ecosystem and really think about how the change to v1 will impact the existing users.
Right now I don’t actually see a lot to attract me to work on v1. It seems to have a very academic flavour to it, with very little practical thought about how the protocol is actually used. It doesn’t address the key pain points of v0 and the plan for the transition is badly thought out.
Cheers, Tridge


(Andrew Tridgell) #13

It has always been the curse of CANBUS that running different protocols on the same bus doesn’t work. Users find this extremely frustrating, especially with devices talking so many different CAN protocols eg. we support KDECAN, ToshibaCAN and UAVCAN in ArduPilot with more on the way. With transports that support IP protocols this is solved, and you can have hundreds of protocols on the same bus at the same time.
I know we can’t completely solve this for CAN, but we should try to not exacerbate the problem by making v0 and v1 hard to combine on the one bus. We need this to be a priority or we are letting down the UAVCAN community.
Whatever method is used must be robust. It is common that UAVCAN devices are used in high value vehicles doing things where mistakes really matter.


(Philip) #14

Thank you Tridge!

Now we have 0.9 out there, it’s time to be expanding.

All I ask for is no breaking changes. 1.0 is time to expand capabilities, add extensibility, and expand the ecosystem.


(Andrew Tridgell) #15

Instead of a “NOEXTENSION” tag, a “MAXEXTENSION xxx” flag in the DSDL may be better to satisfy the realtime constraints. This would specify the maximum number of bytes that extensions may use for this message. The author of the message could calculate what maximum message side would guarantee the constraints needed for the message and set it accordingly. The default would be no limit, so as not to constrain messages where realtime is not a consideration. If the author of a message was sure that adding any extensions would be detrimental then they could specify a maximum of zero.


(Pavel Kirienko) #16

What I am talking about is an abstract concept defined at the protocol level. It is not related to any programming language, although I described it using the terminology that is widely used in OOP. Obviously it can be expressed in any conventional programming language from C to Erlang, but that is irrelevant. I didn’t mean to distract the discussion from the core topic though, sorry.

Per my description above, this is not allowed. Polymorphism is not a functionally equivalent replacement for extension fields, but potentially it can be used to attain the same goals.

Anyway, I think this discussion is slightly out of place here. Our design process is built around somewhat more formalized proposals than what I scribbled there in my previous post; the topic requires much more thought and analysis from my end before it can be discussed constructively. I mentioned it to indicate that we are not blind to the message extensibility problem and that it is on the long-term roadmap. I understand the value of extension fields, but I think we could search for a solution that offers a more rigorous data model and greater type safety than simply pasting optional values at the end. The solution is missing from v1 currently, but it certainly does not mean that it won’t be introduced in a compatible manner in a subsequent minor revision.

From re-reading the OP post, I understand that the lack of unconstrained message extensibility (where “unconstrained” means that the extension capability is not dependent on the availability of padding fields, as we discussed) and the breakage are the only perceived issues in v1. Saying that “v1 does not address the key pain points” is at least unfair and it takes ignoring the fact that v1 provides viable solutions to the critical issues in v0, many of which are not mentioned in the OP post.

UAVCAN v1 is being deployed in highly complex vehicular applications where v0 could not be used due to its inherent limitations, only some of which were reviewed here. The most critical issue in v0 is the syntax-semantics entanglement, which by itself was an underlying cause for some other, more apparent issues, such as the over-specification of the standard data types. UAVCAN v0 is a great protocol for trivial UAV applications with unsophisticated hardware setups and straightforward design requirements, such as those that can be found in various basic industrial applications or hobby machines. The problem of v0 is that it breaks outside of that domain, and it is not possible to take v1 out of there without breaking backward compatibility. While the transition is painful, it is beneficial for everyone, especially the existing adopters of v0, because the much-improved architecture of v1 will increase the reach and coverage of its ecosystem, effectively increasing the available product options for integrators and at the same time increasing the reachable market for product manufacturers.

The improved ecosystem management policies and the new technical capabilities enable a very long design lifespan for v1, potentially over 10 years, although the exact amount will be defined by the feedback we receive from adopters. That bar might be too high for v0 due to its monolithic architecture.

Considering the state of the industry at large, one can see that v1 is a fundamental improvement over v0.

This is new information for me. Do they ship hardware for integration into 3rd-party systems with hard-coded node-IDs? Could you clarify, please, what is their motivation here? Clearly it’s an obstacle to integration.

There is another approach to segregating the CAN ID space so that v0 and v1 do not intersect based on the subject/service identifiers, but I will need to describe it later in a separate post. The core idea is that v0 utilizes the lower part of the range, whereas v1 utilizes the upper part of it. The compiler-based approach will not work because it cannot account for the integration-time identifier assignment.


(Pavel Kirienko) #17

I opened two new issues in the Specification repository for visibility:


(Scott Dixon) #18

Totally agree with you Tridge. My take on what we decided with v1 was not so much to focus on “vendors” defining types (although this is something we could allow) but to allow other groups to manage ecosystems themselves. My assumption has always been that px4 would simply port the v0 types over to a v1 domain as-is and continue to evolve them for there. I also assumed that there would be no promise or expectation of compatibility across such domains. This allows market forces to select the healthiest ecosystem when they compete and allows for improvements to the underlying technology to be consumed by all domains.

For example, imagine this world:

Here we see px4 along with a fictitious competing drone ecosystem. Each domain can define and evolve types independently of each other. Manufacturers decide which ecosystem to support but do not dictate specific messages unless they are participating in that domain’s governing body.

At the same time we see two other orthogonal domains maintained by different groups. These are not competing with the drone domains and they are free to borrow types and maintain compatibility from other domains where that makes sense.

So, let’s say px4 is a robust domain supported by a large install base. Manufacturer PowerSys introduces a great new BMS that supports the Px4 power messages which quickly becomes a huge success. SatCo, which builds space vehicles using UAVCAN-space messages, sees the PowerSys BMS as a great product to use for their cubesats. They can either approach the manufacturer to add a version of their BMS that uses the UAVCAN-space Power messages or SatCo can work with the UAVCAN-space community to add support for the Px4 Power messages.

Anyway, this is how I hope it works. I do not want to see a world where vendors are defining arbitrary types and flight controllers are working to support a smorgasbord of devices.


(Andrew Tridgell) #19

That is fine as far as it goes, but as it is there is no scope for any of those vendors to evolve the message over time as new capabilities are needed. History tells us it will be needed quite frequently. If we can’t work out a sane method to extend messages than I think we will see a breakdown of the system you describe.


(Andrew Tridgell) #20

How would it be expressed in C? That is one of the tests of a good protocol, that it can be expressed clearly in a baseline language like C.

that means it doesn’t really help. Being able to talk to two different versions of the same class of sensor is really essential.

ok, but I hope you have now got the message that I think message extensibility is essential :slight_smile:

That is because I was listing the pain points that we experience with UAVCAN. I know you perceive different pain points, and that is fine, but as a project we need to decide if/when we embrace v1. At the moment it looks like v1 will bring a lot of new migration pain to our user community and it doesn’t actually address the existing pain points that our community has experienced. That makes it not an attractive option to spend time on.

only 10 years? I’d hope that a new protocol would aim for longer than that. Protocol code that I wrote more than 25 years ago is still in very active use. It has evolved over time (in a compatible fashion!) but it is still used.

I’m talking about devices that are developed in-house, and now form a critical part of their production platforms. Companies like that would like to be able to use new devices as they come out, but they expect that they can continue to use their existing devices. The expectation would be that as long as the old devices use a different node ID than the new devices that they can co-exist on the same CAN bus. I don’t think that is an unreasonable expectation.

as long as it allows for existing v0 devices to continue to operate on the same bus as v1 without changing their config then it would be workable. We need a flight controller to be able to take advantage of v1 devices while still communicating properly with legacy v0 devices.