UAVCANv1 libcanard & nunavut templates memory usage concerns

PetervdPerk · February 10, 2021, 2:31pm

While implementing UAVCANv1 in C using libcanard and the nunavut C templates. I’ve noticed that RAM requirements for such a application can be quite high, furthermore I do see some potential memory savings if we add some changes in the implementation

A good example would be making a simple publisher using the reg/drone/service/battery/status_0.2.uavcan message.

First we need to allocate a heap that libcanard can use to allocate memory on it, we use O1Heap as our memory allocator, to calculate to the correct we use this formula H(M,n) = 2 M (1 + ⌈log2 n⌉), where M = (reg_drone_service_battery_Status_0_2_EXTENT_BYTES_ 600UL ) and n=1, Thus we need 600bytes of preallocated heap in this single publisher example
Then in our publish function we need to allocate the generated reg_drone_service_battery_Status_0_2 C data structure on our stack, which is 1044 bytes
To serialize the C data structure we’ve to allocate a payload buffer array with the size of (reg_drone_service_battery_Status_0_2_SERIALIZATION_BUFFER_SIZE_BYTES_ 534UL) bytes
Then we allocate CanardTransfer put the payload in and copy this data again using canardTransfer

In total we need atleast 600 + 1044 + 534 = 2178 bytes of RAM to make this simple example to work. And this is just a single thread with a single publisher and no subscriber, if we add multiple publisher and we listen for different subscribers this will grow hard.

About the potential memory savings I see 2 options that might mitigate this.

currently nunavut generates all entries for an array, thus in the case of the cell voltages we generate a Float32[255] whereas we might only want to report let’s say 6 voltages, now we need to allocate 1020 bytes on our stack instead of 24 bytes. Yet we always know we’re not going to exceed this at all, could there be a possibility to limit this compile time? E.g. somekind of define we can override?
To publish a message you’ve allocate 3 times
1. Allocate C struct
2. Allocate payload array
3. let libcanard allocate the canardtransfer in the preallocated O1Heap array
Yet in theory we could avoid allocating the payload array and use the preallocated memory in the O1Heap instead. Of course this would be require some architectural changes in libcanard though.

scottdixon · February 10, 2021, 4:21pm

I can’t answer for libcanard but for Nunavut I’d love to define a couple of generic container types that the user can implement:

sparse array
vector

The PODs would generate whatever types the user specified and the serialization logic would use a Nunavut-defined interface to these types.

If you are interested in pursing this direction please open an issue in the Nunavut repo.

david.lenfesty · February 10, 2021, 6:18pm

What might be an interesting topic to explore is using iterators instead of pushing all the frames onto a heap.

I’m trying that out in my Rust implementation, and it seems pretty ergonomic to me at least. Although there are two caveats: iterators are first-class concepts in Rust and as such have some really nice
interfaces, and Rust has the borrow checker which prohibits some of the very easy bugs I could see coming out of a C implementation.

To me it seems quite useful because it consumes far less RAM at minimal expense per frame, and it’s a bit more flexible, you can still collect the frames into a buffer if you need that functionality. All you need to do is maintain a reference to the transfer payload and a few bits and bobs for housekeeping.

(I don’t necessarily think it’s the solution but it’s what I’ve been using and I feel it may add to the discussion).

Here’s the impl for CAN

pavel.kirienko · February 10, 2021, 6:21pm

Speaking broadly, I wouldn’t worry too much about RAM usage because one can always either upgrade the MCU or (where not possible) get low-level and serialize things manually. That said, your memory estimates are incorrect.

You already understand this but for the benefit of other readers let me clarify that the heap size formula is not specific to O1Heap but is applicable to any first-fit allocator (many allocators implement the first-fit strategy; your standard malloc() probably does; other implementations have a higher worst-case bound). If you use less memory then you are risking releasing a product that works on the table but breaks in the field if the customer is unlucky enough to hit a bad allocation/deallocation sequence that leads to catastrophic heap fragmentation.

Since the required heap space is always greater than the peak memory usage, storing things in the heap is expensive, so one should prefer static/stack allocation where reasonable. Libcanard uses heap only for three kinds of state:

Prioritized TX queue. When you call canardTxPush, your data is segmented into CAN frames and they are enqueued into this heap-allocated queue. Each item is of size (MTU + ~32), so for CAN FD, it is about ~96 bytes per allocation.
RX payload buffers. When the first frame of a transfer is received, libcanard looks up its extent and allocates EXTENT bytes on the heap to store the reassembled transfer payload. This is obviously only for those transfers for which you have active subscriptions, otherwise nothing happens.
RX sessions. When we see a new node start sending transfers to a particular port that we are subscribed to, we allocate a bit of data for bookkeeping purposes (typically ca. 48 bytes per publishing node).

(the exact sizes depend on sizeof(void*) and sizeof(size_t))

Observe that Libcanard does not allocate CanardTransfer on the heap. Since you are interested in publishing data, then your memory use looks roughly as follows:

Allocate your message object on the stack. For the battery status that sets you back by about 1 KiB as you wrote earlier, which shouldn’t be a problem for a modern MCU.
Allocate the serialization buffer on the stack. For the battery status that is up to 534 bytes as you wrote. (Here is an optimization prospect that one might consider: if your compiler is configured with strict-aliasing disabled then you can let the serialization buffer and the message object overlap. This works because our serialization routines are strictly single-pass; once a specific field is serialized it is never accessed again. This is of course inadmissible in high-integrity systems though.)
Invoke canardTxPush(), which will make up to ((serialization buffer size) / MTU) allocations, each up to (MTU + ~32) bytes large. If your message contains 6 cell voltages (36 bytes total) and you are using CAN FD (63 bytes per frame), then you get one allocation of ~68 bytes.

So in this specific example, assuming also that you need to publish heartbeat and respond to some standard services, you should allocate about 2~4 KiB for the stack and ~8 KiB for the heap, which is very reasonable and hard to improve upon.

Generally, if you allocate message objects on the stack, then unused array items should not be a problem since they don’t contribute to the total memory use significantly (assuming that the lifetime of your object ends after it is serialized and pushed to the TX queue). If this is still a problem, then you can just edit the auto-generated code manually or even write the serialization routine by hand – such hard resource constraints call for drastic measures.

pavel.kirienko · February 19, 2021, 8:29pm

@JacobCrabill @dagar @PetervdPerk I would like to add one extra note on this as a follow-up to recent conversations on GitHub and Slack. While I don’t expect the memory footprint to become a problem, it is conceivable that one might desire to reduce it at some point for some reason. Shall such request arise, the first optimization prospect worth exploring is the migration of the UAVCAN driver from its dedicated O1Heap to the shared system heap (that is, the standard malloc/free). A large shared heap is likely to be substantially more space-efficient than small segregated heaps.

This change may introduce variable latency if the default system heap is not implementing a deterministic allocation strategy (in NuttX it doesn’t), but whether the resulting latency fluctuations are acceptable or not is to be decided on a per-application basis (it is outside of the scope of the standard or its implementation libraries). If at one point you get a chance to test this approach and share your findings, I think it would be valuable.

Ultimately it would be interesting to discuss why NuttX is not using a deterministic allocator by default, but I suspect it is not something that can be looked at in the foreseeable future.