file.Read timeout even though all frames on bus

deekay · December 13, 2019, 8:49am

Hi!

I am trying out the example class FirmwareLoader for handling background SW loading process.
During my SW update process, i.e. file.Read is fetching file from service every 50ms, I notice sometimes that status_ = ErrorTimeout in service response callback, even though all frames was actually sent on the bus (confirmed sniffing with GUI tool and separate CAN dongle) nehce completing both request and response on file.Read. The file.Read timeout is set to 5 sec, but all frames are sent on bus within much shorter time.

I am running the FirmwareLoader in main Node thread, so not much else is going on in that thread. My SubNode thread is casuing injected TX frames at the smae time and I can see a pattern when problem occurs. It happens when my node is broadcasting and invoking other nodes services at the same time.

I am thinking that there is a buffer issues somewhere but I have not figured out where. I am running on a Debian machine.

Adjusting socketcan txqueuelen does not make a difference.

Both threads are calling spin(50ms).

Any ideas what casing the file.Read timeout?
I am currently doing some trial and error to pinpoint the cause…

Kind regards

pavel.kirienko · December 13, 2019, 10:58am

Your RX queue is getting overrun, probably. Since you are using SocketCAN, you should make sure that all your frames get delivered to the application. Launch the UAVCAN GUI tool on your SocketCAN interface and make sure that all your frames are still there. If they are, consider dumping all incoming frames from your application to make sure none are lost in its internal buffers.

deekay · December 13, 2019, 11:28am

I am now running candump om target that has the timeout issue, and compare it with GUI tool that runs the .file servcie and bus monitor. I will try comparing them to see if frames are lost.

deekay · December 13, 2019, 12:23pm

Hi!

I discovered that response frames are reordered on target side, were file.Read was issued. In attached image, on left side, is the GUI tool bus monitor results on the file.Read response (service on PC). On right side is the candump result on target side.

I have marked the rows with x, y, z, w to indicate that the order is wrong on target side.

NOTE: PC is using 8devices CAN dongle.

I guess the reordering might happen on both sides? Nevertheless, it does not follow spec:
"All frames of a multi-frame transfer should be pushed to the bus at once, in the proper order from the first frame to the last frame."

deekay · December 13, 2019, 12:46pm

I will now experiment with txqueuelen on PC side.

deekay · December 13, 2019, 2:42pm

As a reference test, I setup a passive node between file server and target.
So a PC with can dongle is listening on bus with GUI tool and monitors bus traffic between a file server node (on different hardware) and file read client (target).

I can be shown on the passive monitoring PC that frames are in correct order, by looking at the toggle bit. And the file.Read respons can be decoded properly. But target still times out sometimes.

That concludes that frames are re-ordered on response receiving side (target). And since candump shows exactly that, as seen above, the reorder takes place below uavcan stack…

What can be the cause of this?
Perhaps multiple RX buffers?

deekay · December 13, 2019, 3:15pm

Further investigation on target shows:

root@arm:/home/debian# ip -det -stat link show can1
4: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
    link/can  promiscuity 0 
    can state ERROR-PASSIVE (berr-counter tx 0 rx 127) restart-ms 0 
	  bitrate 500000 sample-point 0.875 
	  tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
	  c_can: tseg1 2..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1
	  clock 24000000
	  re-started bus-errors arbit-lost error-warn error-pass bus-off
	  0          0          0          1          52         0         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
    RX: bytes  packets  errors  dropped overrun mcast   
    88828500   11343328 20      45      20      0       
    TX: bytes  packets  errors  dropped carrier collsns 
    18396508   2543787  0       0       0       0

RX overrun does not feel good…

pavel.kirienko · December 13, 2019, 3:23pm

What hardware are you using on the target system?

deekay · December 13, 2019, 3:47pm

At the moment I have this info:

processor	: 0
model name	: ARMv7 Processor rev 2 (v7l)
BogoMIPS	: 795.44
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x3
CPU part	: 0xc08
CPU revision	: 2

Hardware	: Generic AM33XX (Flattened Device Tree)
Revision	: 0000
Serial		: 0000000000000000

But will have to come back with actuam MCU vendor and model …

pavel.kirienko · December 13, 2019, 3:58pm

Sorry, I was referring to the CAN hardware. I suspect that its low-level driver is misbehaving, hence the reordering. From your answer I expect that it’s some kind of embedded system with a low-level interface between the CPU and the CAN controller, like SPI?

deekay · December 13, 2019, 4:17pm

I am not sure… I have to investigate and come back to you. My guess is that lowering the bitrate might solve the issue. But that is not my primary solution at this point.

Regards

deekay · December 14, 2019, 9:53am

The transceiver is SN65HVD251P.
Main MCU is AM3352BZCZA80
co MCU is a STM32F105

Reminds me of beagle bone black… The platform has two CAN interfaces, probably directly connected to main MCU.

Whild guess is that D_CAN drivers are used, but can verify that…
https://processors.wiki.ti.com/index.php/AM335X_DCAN_Driver_Guide

deekay · December 16, 2019, 2:54pm

Further findings:

My FirmwareLoader is running in main Thread while other lower prio boradcasts and services are managed in SubNode thread. I notice that if I throttle the SubNode flushing, i.e. tx_injector.injectTxFramesInto(...) in main thread, I can work around this issue.

I modify the flushTxQueueTo(...) function to only inject one SubNode frame in each call, no while() loop for flushing everything. The main thread is only blocking with spin() for 1-2ms instead to ensure injected frames does not stay in queue for too long. This is ofcource a workaround until figured out the root cause…

I do not know it the original flushing puts a heavy load on lower drivers (perhaps because of loopback?) and in turn causes misbehaviour in onloading RX buffers…

Happy to receive thoughts around this issue…

Regards

pavel.kirienko · December 18, 2019, 8:52am

Eh, I guess you can’t avoid digging into the driver implementation to fix this properly. I don’t know the specifics of your system, perhaps you could consider using a dedicated USB-CAN adapter that isn’t broken?

deekay · December 18, 2019, 9:12am

I believe my USB-CAN adapter is working properly…
And yes, it smells as a driver issue. While diggin into that, I am working on a solution to get around this.
Do you see any problems with modifying the flushTxQueueTo(...) as:

    void flushTxQueueTo(uavcan::INode& main_node, std::uint8_t iface_index, const bool inject_single_frame)
    {
        std::lock_guard<std::mutex> lock(mutex_);

        const std::uint8_t iface_mask = static_cast<std::uint8_t>(1U << iface_index);

        while (auto e = prioritized_tx_queue_.peek())
        {
            UAVCAN_TRACE("VirtualCanIface", "TX injection [iface=0x%02x]: %s",
                        unsigned(iface_mask), e->toString().c_str());

            const int res = main_node.injectTxFrame(e->frame, e->deadline, iface_mask,
                                                    uavcan::CanTxQueue::Qos(e->qos), e->flags);
            prioritized_tx_queue_.remove(e);
            if (res <= 0)
            {
                cout << "________________ TX INJECT FAIL __________: " << res << "\r\n";
                break;
            }
          else if(inject_single_frame)
            {
                break;
            }
        }
    }

I.e. added a boolean to control if oly a single frame shall be flushed or not?

pavel.kirienko · December 18, 2019, 9:37am

I suppose it should work as a tentative solution but it is very fragile and I advise against deploying it in production, whatever it might be.

deekay · April 6, 2021, 5:33pm

I have read about issues with rx overrun in older linux kernel version (such as 4.19) related to flexcan driver and shallow HW frame buffer.

But that relates to flexcan and not dcan though…