Homepage GitHub

SocketCAN API on a RTOS

This is a follow up to the SocketCAN proposal discussion for the NuttX RTOS.

SocketCAN is allowing the CAN controllers to be addressed through the network socket layer (see wikipedia picture below - https://en.wikipedia.org/wiki/SocketCAN). The Nuttx architecture already supports the socket interface for networking, so a SocketCAN driver should nicely fit. The UAVCAN library already supports the SocketCAN interface, so it can easily be used in Linux systems. So an architecture wise nice solution for adding UAVCAN to the S32K14x chips would be to:1. Create a CAN controller driver for the S32K14x 2. Create a SocketCAN driver which would be benificial for all Nuttx targets.

We’ve received feedback from Pavel

I concur with the general idea that SocketCAN in Linux may be suboptimal for time-deterministic systems. At the same time, as far as I understand SocketCAN, the limitations are not due to its fundamental design but rather due to its Linux-specific implementation details. I admit that I did not dig deep but my initial understanding is that it should be possible to design a sensible API in the spirit of SocketCAN that would meet the design objectives I linked in this thread earlier.
https://forum.uavcan.org/t/queue-disciplines-and-linux-socketcan/548

Based on all the feedback, I decided to create a testbed running SocketCAN on a microcontroller. That’s when I found Zephyr RTOS which is partly posix compatible and provides a SocketCAN implementation.

The testbed consists of:

  • NXP FRDM-K64F board with
  • NXP DEVKIT-COMM CAN tranceiver.
  • Zephyr RTOS
  • CAN bus bitrate 1Mhz
  • PC sending a 8 byte CAN frame every 100ms

When the interrupt occurs the GPIO pin will be pulled up and when the userspace application receives the CAN frame the GPIO will be pulled down. which be shown by the yellow line, the pink line is the can frame.

1st Measurement Zephyr Native CAN implementation
Time from frame to interrupt is 12.8us
Time from interrupt to user space copy 10.85us
On my oscilloscope I don’t see variance in the “interrupt to user space copy” therefore jitter < 0.001us

2nd Measurement Zephyr SocketCAN implementation
Time from frame to interrupt is 12.8us
Time from interrupt to user space copy 65.32us
On my oscilloscope I don’t see variance in the “interrupt to user space copy” therefore jitter < 0.001us

Testbed conclusion

  1. Zephyr SocketCAN does increase the latency from interrupt to userspace by 54.47us, is this acceptable? I don’t know, furthermore I didn’t look in the specifics of the Zephyr network stack, fine tuning can be achieved.
  2. Zephyr SocketCAN seems to be deterministic and doesn’t increase jitter which is good realtime behavior.

Most of the points discussed in RTOS CAN API requirements can be covered by the SocketCAN api.

  1. Received/Transmitted frames shall be accurately timestamped. This can be achieved by adding a socket options which will enable this behavior.
  2. Transmit frames that could not be transmitted by the specified deadline shall be automatically discarded by the driver. Can also be achieved by a socket option, or an ioctl if libuavcan wants full control.
  3. Avoidance of the inner priority inversion problem I didn’t look into the specifics of the Zephyr CAN implementation but I assume a correct implementation will avoid this problem.
  4. SocketCAN allows easy porting of posix applications to an RTOS, I do have internal port of libuavcan (master branch) to Zephyr which was fairly simple. It’s working on the FRDM-K64F board but I have to look for better tooling to measure realtime performance of the full stack.

If the UAVCAN team can provide feedback on these results that would be great.

1 Like

It is hard to judge without having specific application requirements at hand but my educated guess is that 54 microseconds on a 120 MHz ARMCM4 as a starting point (i.e., design worst case) might be acceptable, keeping in mind also that NuttX-specific opportunities for optimization may become available.

Regarding the question raised in the email about whether it makes sense to keep the old API alongside SocketCAN: I can’t say for other applications, but we found the old interface to be unsuitable for the needs of distributed real-time control systems, which was the reason we ended up writing baremetal drivers instead of relying on the RTOS-native APIs. Given that, I don’t see much value in keeping the old API around other than to support legacy applications, if there are any.

1 Like

Thanks for all the due diligence here. It’s strong work but I’m not clear on the parameters of your testing. What interrupt are you measuring here? It looks like the receive interrupt? I’m actually more concerned with the latency and jitter of the transmitted messages because of the additional complexity introduced by intermediate tx buffers. I’ll explain this in a bit but first; we also need to look at jitter experienced when more than one user-space process is sending messages and with other load on the system that could interrupt the kernel (e.g. ethernet traffic). This is where socketCAN can lose the ability to provide the expected realtime guarantees that on-metal firmware enjoys.

Buffer Bloat and Virtual Arbitration

In this (very rough, sorry, didn’t have much time for this) diagram I’m showing what a Linux device using libuavcan v0 will have for tx queues and the path an enqueued message would traverse to get on-bus (I am omitting the peripheral and possible DMA queues to simplify our discussion). The revision cloud labeled (1) shows an unintended intermediate step that, for some queue disciplines, is completely incorrect (e.g. codel) and for others sub-optimal (e.g. pfifo_fast). In addition to the additional latency introduced by the kernel buffer we have priority inversions where the media layer’s expectation is that hardware-managed arbitration is taking place as soon as the next message in its priority buffer is sent to the system. Instead, the message sent to the system may get stuck behind a lower priority message sent by another process. One solution is to change the queueing discipline to replicate CAN arbitration but we already do this in libuavcan’s media layer so now we would be doing this twice. Even more, the software managed arbitration makes the tx buffer act like a virtual CAN bus on top of the real CAN bus. This changes the topology of the bus creating a star network on top of one (or more, if there are more posix systems on this bus) of the real CAN nodes. While I haven’t fully thought through the ramifications of this it does seem like modeling tools may fail to account for latencies or bandwidth utilization if we treat these posix nodes as if they were regular CAN nodes. Trying to model the virtual topology could help but it would require the UAVCAN standard to provide guidance for how routing between two networks should be handled.

A proposal

What if we avoided the problems of buffer bloat and multiple-virtual arbitration by moving the libuavcan media layer into the kernel and having each user-space transport layer interface with it directly. This would provide the exact same characteristics for two or more applications in an RTOS as two or more threads in a process that had direct and exclusive access to a CAN peripheral. This provides us with a single set of problems to solve in libuavcan and provides logic in the kernel optimized for the UAVCAN protocol. SocketCAN would still be the API but the socket would get setup differently; for example:

s = socket(PF_CAN, SOCK_RAW, CAN_UAVCAN);

The revision cloud (2) in this diagram still denotes the use of software managed arbitration but a single tx buffer means we have far more deterministic behaviour. Furthermore, we can optimize timestamping within the kernel layer reducing the amount of userspace CPU time each application will have to dedicate to this task. Finally, we are able to handle redundant interfaces in this kernel module which might otherwise become odd where each application had to manage redundancy independently.

2 Likes

Hi Pavel & Scott,

Thank you for the valuable feedback

@pavel.kirienko, I do agree 54 microseconds is “not great, not terrible” if we take latency and determinism in account when implementing SocketCAN we can aim to get this number down. About the NuttX mailing list discussion, I do agree that I see no use in the old CAN interface and don’t want its weight bear down the new implementation.

@scottdixon I’m measuring the receive interrupt, and then I’m measuring the time it takes to get the data to the userspace application. I didn’t measured the transmitted messages but based on your feedback I’ve decided to measure them as well.

The same setup as in my initial post, but now the MCU is sending a CAN message every 100ms. Before sending it will pull the GPIO up, that’s how we can measure the latency from send to the CAN frame on the CAN bus

(unfortunately I’m not allowed to upload images therefore I’ve uploaded them to imgur)

Zephyr Native CAN send
Has a latency of ~ 4us with a jitter ~ 0.5us

Zephyr SocketCAN send
Has a latency ~ 55us with a jitter ~ 0.5us

It’s true that numbers are without the influence from other processes/tasks on the system. And the question is how can we still provide realtime guarantees in these cases.

I think your proposal by moving the libuavcan transport/media layer into the kernel by adding a network family CAN_UAVCAN just like TCP/UDP is on top of IP is a great idea. (it even opens the possibility to add this network family to Linux). However this approach also has some caveats like:

  • libuavcan is c++, if we want to move this transport layer would mean a rewrite in C (we can use libcanard though I’m not sure if it feature complete)
  • I’m not sure how flexible libuavcan design is when it comes to move the transport layer, it might require some kind of fork
  • I’m not sure how big the libuavcan transport layer is but I would suppose that a kernel maintainer doesn’t want to maintain a big & complex network family
  • We’re still not sure if this approach would improve the realtime performance (high effort vs low reward)

However we do have an alternative, we can implement libuavcan on Nuttx using SocketCAN and use some kind of zero-copy queue discipline to avoid buffer bloat you’ve explained in your picture. Which will avoid these unnecessary buffers, then if the implementation is good enough we can always decide to move the libuavcan transport layer into kernel.

I like where this is going. Moving the transport layer into the kernel does sound like an interesting idea. I would also like to see it implemented in the Linux version of SocketCAN. With the transport layer implemented in the kernel, and the DSDL serialization implemented in Nunavut in an implementation-agnostic way, there will be very little logic left in Libuavcan itself, so I imagine that some applications would choose to work on the raw socket layer without any additional library logic on top.

I am working on Libcanard (slowly) and there’s not much left to be done. The transport layer is under 1k lines of C99 (in essence, Libcanard is just the transport layer itself with some very minimal serialization logic on top which is irrelevant for the kernel) so I imagine porting that into the kernel space should be fairly trivial.