Pages: [1] 2
Author Topic: Any SocketCAN gurus around here?  (Read 11584 times)
woj
Hero Member
*****

Karma: +41/-3
Offline Offline

Posts: 500


« on: June 07, 2021, 10:57:16 AM »

I am suffering from a weird communication instability problem when flashing my car with my own cooked flasher on Linux / SocketCAN. Though my flasher is pushed to the limits when it comes to speed / bus load, it worked flawlessly for almost 2 years. With a recent Ubuntu upgrade (or so I assume) from 18.04 to 20.04 it became unstable, most probably lost frames or out of sequence frames, or bus overload. Results in flashing interruption.

Now for the weird part - works stable when on my bench ECU and no other CAN devices on the network, while the car with all the other ECUs spitting CAN traffic makes it crash. It becomes stable again on the car when I run candump in a parallel terminal. It still fails when I run equivalent frame dumping process of my own in parallel (essentially same thing, probably, but does not print to the terminal). So, clearly the extra bus load on the car contributes to the problem, while having a parallel traffic scanning process alleviates it. Finally, changing the way the frames are sent / received in the flasher to give them more time to breath (delays, extra delivery checking) does not help.

Now I know that internal Linux socket queues are a total nightmare and a million problems can sit in there, I do not want to start dissecting the Linux kernel, and I do have a working solution (but ugly), I am just looking for something obvious I might have missed that someone else could possibly have seen before and can give me a hint Wink

Logged
d3irb
Full Member
***

Karma: +131/-1
Offline Offline

Posts: 185


« Reply #1 on: June 07, 2021, 02:13:54 PM »

There are a few queues at play here:

* Depending on the CAN device, the CAN device's receive buffer needs to be drained by the kernel before it overflows. In the case of register-based (on-chip) CAN devices this will always happen, because the IRQ will cause the kernel to service the register immediately and copy the data into a socket structure. In the case of SPI based devices like MCP2515, you can have problems at this layer already, because the SPI request to drain the buffer can end up in a kernel workqueue which can get preempted and cause the CAN device's buffer to overflow. Make sure the spi tasks have realtime priority. And in the case of USB or other devices, it's at the mercy of the driver and hardware - most USB <-> CAN devices have their own sizeable internal memory buffer in firmware, but some don't and who knows if the firmware for each is implemented properly... anyway, it doesn't seem like this is your issue though. If it were, candump would probably make things worse (more preemption/contention for draining kernel-level queues) rather than better.

* OK, now the CAN data lives in a Linux packet structure. It then goes into the socket routing code. The socket routing code has a few queues in it which can fill and are subject to all kinds of wild and wacky Linux socket priority "disciplines." Try replacing those with the simplest one, pfifo_fast: `sudo tc qdisc replace dev can0 root pfifo_fast`(you can also try `noqueue` but then your application better be really fast...). I suspect this could be your problem as having another application (candump) draining the kernel queues seems to be helping you out.

* Next, your socket itself has a recv buffer/queue, called rmem. If you want to max it out, you can try  `echo "26214400" > /proc/sys/net/core/rmem_max && echo "26214400" > /proc/sys/net/core/rmem_default` . This makes up for application software that can't poll its open socket object fast enough to receive all data.

Hope this helps. For me, the last two (setting the qdisc to pfifo_fast and increasing rmem) made even full-speed CAN communication with Python (slowest runtime ever) pretty reasonable. It's funny that we can't handle 500kbit data transfer, which should be peanuts, but such is adding too many layers...
« Last Edit: June 07, 2021, 02:15:49 PM by d3irb » Logged
woj
Hero Member
*****

Karma: +41/-3
Offline Offline

Posts: 500


« Reply #2 on: June 08, 2021, 12:32:08 AM »

Good info, thanks! I also now found this: https://rtime.felk.cvut.cz/can/socketcan-qdisc-final.pdf. Seems there is some experimentation in front of me. My interface already has pfifo_fast by default, so I guess either rmem limits, or actually a different queue, pfifo perhaps. At first pfifo_head_drop sounded like it could do the trick, but thinking about it is probably almost guaranteed to make things much worse (network traffic overwriting the frames I need / wait for).

I will report back when there is something to report.

As for the USB-CAN device - this is actually also something of my own and I initially suspected the problem could be there, but then it worked before, even with Raspberry PI3 as host, so I doubt it: https://github.com/woj76/gs_usb_leonardo
Logged
woj
Hero Member
*****

Karma: +41/-3
Offline Offline

Posts: 500


« Reply #3 on: June 08, 2021, 02:43:20 AM »

No results so far, noqueue actually does work on the bench. Changing rmem parameters (I checked from within my application that it is factual) does not change anything, playing with queue disciplines neither (actually, doing noqueue when on the car makes the whole thing lock up immediately, but that was to be expected). And running candump in parallel keeps helping out when using increased rmem and / or different queue...

EDIT: Not only some other minor tweaks and attempts did not solve the problem, now I also got one instance of failed flashing when running candump did not help...
« Last Edit: June 08, 2021, 07:40:58 AM by woj » Logged
d3irb
Full Member
***

Karma: +131/-1
Offline Offline

Posts: 185


« Reply #4 on: June 08, 2021, 09:12:14 AM »

Hmm - very odd. I never quite managed to get the right kernel counters to figure out what was actually going on, I just kind of shotgunned networking stuff and hardware at my problems until my code started working. My Tricore BSL's "read" functionality just spams compressed data as quickly as the CAN node thinks the bus is ready to take it, so it's a good test for CAN client software.

As an example of how annoying the levels of complexity are - my code absolutely won't work with MCP2515 over SPI on RPi and friends, because the kernel fails to service SPI workqueues fast enough to drain the 2 Rx slots, even at 500kbaud. But, MCP2517 works perfectly since it has a deeper 32-slot system and the driver is a little bit smarter.

I've since started to move away from SocketCAN - because while I like the simplicity from the application's standpoint, the level of complexity outside the application has been quite painful to deal with.

Logged
woj
Hero Member
*****

Karma: +41/-3
Offline Offline

Posts: 500


« Reply #5 on: June 08, 2021, 01:13:13 PM »

I have been debugging / looking inside the traffic a bit, and my head now hurts. First of all, I got two single messages in my system logs over the last two days about unexpected echo id, which means that my CAN-USB firmware is sending a USB frame that has a wrong echo field, and that can, as far as I can see for now, only happen if I have a memory leak in my firmware, but I do not see it.

The other thing is, I now see that the flashing freezes after 10 consecutive write CAN frames from PC to the ECU. This I have seen before when developing my firmware, and it almost certainly means that the PC / host does not receive a corresponding echo USB frame. Now, knowing that this is triggered by the other CAN traffic from the car would indicate that the USB is not catching up (my firmware decides to skip the USB frame after some timeout if it can be pushed the host, but it is large) or something in the kernel decides to drop the USB frames. Not sure how running candump helps that, probably drains the USB queue quicker.

Since both observations mention echo frames, they could be related, but knowing my luck it is coincidental and I have two problems at hand.
Logged
d3irb
Full Member
***

Karma: +131/-1
Offline Offline

Posts: 185


« Reply #6 on: June 09, 2021, 11:38:25 AM »

The USB queue aspect makes sense. 10 frames is interesting - txqueuelen on the interface isn't playing a role here, right?

I wrote a userspace 8devices Korlan driver using libusb and I found that I needed to poll the USB bulk transfer endpoint in a free-running thread and then queue the data at my driver/app layer, or everything would go haywire very quickly - I couldn't rely on USB/kernel/libusb level buffering for anything. And - the behavior differed substantially based on host platform and even host hardware, which was a nightmare! When I tried to debug this I too ended up in a special layer of hell. It turns out that USB host driver implementation is _wildly_ inconsistent as well - with some drivers polling and buffering in different ways from others.

My final conclusion with SocketCAN, drivers, and Linux in general is that it is most reliable to run a background thread in your application software which drains data from the socket as quickly as possible. This seems to limit the amount of ways the data will get buffered and subsequently destroyed somewhere in the kernel Rube Goldberg apparatus.
Logged
nyet
Administrator
Hero Member
*****

Karma: +604/-166
Offline Offline

Posts: 12232


WWW
« Reply #7 on: June 09, 2021, 01:46:18 PM »

My final conclusion with SocketCAN, drivers, and Linux in general is that it is most reliable to run a background thread in your application software which drains data from the socket as quickly as possible. This seems to limit the amount of ways the data will get buffered and subsequently destroyed somewhere in the kernel Rube Goldberg apparatus.

I agree, or a separate process entirely that runs with high prio and you talk to via IPC... a local unix socket or even domain socket for higher level "commands" will do better than the ridiculous libusb stack... as long as your IPC "API" implementation you design is reasonable and relatively timing insensitive.
Logged

ME7.1 tuning guide (READ FIRST)
ECUx Plot
ME7Sum checksum checker/corrrector for ME7.x

Please do not ask me for tunes. I'm here to help people make their own.

Do not PM me technical questions! Please, ask all questions on the forums! Doing so will ensure the next person with the same issue gets the opportunity to learn from your experience.
nyet
Administrator
Hero Member
*****

Karma: +604/-166
Offline Offline

Posts: 12232


WWW
« Reply #8 on: June 09, 2021, 01:48:12 PM »

BTW most linux kernel USB drivers are written by interns that have no fucking clue what they are doing.

If you have any real understanding of kernel driver design, you might consider writing your own.
Logged

ME7.1 tuning guide (READ FIRST)
ECUx Plot
ME7Sum checksum checker/corrrector for ME7.x

Please do not ask me for tunes. I'm here to help people make their own.

Do not PM me technical questions! Please, ask all questions on the forums! Doing so will ensure the next person with the same issue gets the opportunity to learn from your experience.
jcsbanks
Full Member
***

Karma: +15/-3
Offline Offline

Posts: 125


« Reply #9 on: June 09, 2021, 01:58:32 PM »

Can you set some hardware receive filters in your CAN controller to avoid overflowing its buffers if the CAN to SPI to software path is delayed?

Can you try to shut up the other modules during the flash so they don't give you unwanted traffic?
Logged
woj
Hero Member
*****

Karma: +41/-3
Offline Offline

Posts: 500


« Reply #10 on: June 09, 2021, 02:13:11 PM »

No, txqueuelen has nothing to do with it, that's for host to USB-CAN direction, I have problems the other way (or so I am convinced).

Today I tried my firmware to be more persistent in sending USB frames to the host, no luck. What you proposed - the background thread of my own draining the queues - that did not help either, but there is more than one way to do it, I have to review what I have done there. In the meantime I prepped a USB stick with older Ubuntu and will try it on both my laptops tomorrow.

The really sad part of this is that I always have been a Unix / Linux guy, essentially from the moment it was decently runnable on a PC, which for me was mid / late 90-s. Having to patch almost everything at the early days to make my hardware work never put me off. Now I got too old and too busy to tolerate easily shit like this, yet it keeps popping up all the time, this is the point where one starts to question the faith...
Logged
nyet
Administrator
Hero Member
*****

Karma: +604/-166
Offline Offline

Posts: 12232


WWW
« Reply #11 on: June 09, 2021, 03:08:23 PM »

The sad thing is with 4GHz processors and 256G of memory, the already enormous latency of the USB stack has just completely exploded.
Logged

ME7.1 tuning guide (READ FIRST)
ECUx Plot
ME7Sum checksum checker/corrrector for ME7.x

Please do not ask me for tunes. I'm here to help people make their own.

Do not PM me technical questions! Please, ask all questions on the forums! Doing so will ensure the next person with the same issue gets the opportunity to learn from your experience.
woj
Hero Member
*****

Karma: +41/-3
Offline Offline

Posts: 500


« Reply #12 on: June 10, 2021, 04:53:05 AM »

So nyet was the closest to the truth with his statements, no final solution yet, but I nailed it down to where the problem is.

The older Ubuntu (not really older in fact, the kernel is 5.4.0-XX on both of them as it turns out) gave the following result - it still crashes on my current / newer laptop, but it worked flawlessly (I mean the exact same OS running from the same media) on my older laptop (both Dells from the same Latitude line, the older is from 2015, the newer from 2020). So it is either the USB controller hardware, or its Linux driver (kudos nyet!), or some of the settings for the said driver (in the past I remember some aggressive energy saving modes on USB controllers where doing wonders to the stability of my USB devices).

So in the positive mood, I went back to my current laptop with the permanent Ubuntu 20.04, grabbed the USB-C to USB3.0/Ethernet dongle, and connected my interface through that. And of course it just worked! Not only that, the flashing that was normally 1:04 ran for 1:02 now, could be a coincidence though, it does tend to have +/- 1s differences between runs.

Come to think of it, I am not entirely sure now if I ever flashed the car on this newer laptop before, I know I was using it with the car to read CAN, but perhaps not the flashing. 

So now knowing I can safely flash from my laptop as is I can go for the hunt on the USB settings in the kernel that crap this out...
Logged
prj
Hero Member
*****

Karma: +905/-420
Offline Offline

Posts: 5790


« Reply #13 on: June 10, 2021, 07:29:56 AM »

You could just use J2534...

*ducks for cover*
Logged

PM's will not be answered, so don't even try.
Log your car properly.
nyet
Administrator
Hero Member
*****

Karma: +604/-166
Offline Offline

Posts: 12232


WWW
« Reply #14 on: June 10, 2021, 09:02:34 AM »

So it is either the USB controller hardware

A bit of a digression here... in my previous job we used a lot of SPI/I2C with embedded systems. Pretty much every single ARM based CPU we used had built in SPI/I2C peripheral hardware, and a driver (written by an intern no doubt) to go with it. Not only was the SPI/I2C driver total garbage, the SPI/IC2 core itself was poorly designed and simply could not be made to work properly because of the way it handled both buffering and even the most basic timing of signaling, a lot of which had to be written in software (and/or done badly in the driver).

Moving to different CPUs (or even micro controllers) proved to be worthless since this same core design was shared by pretty much everyone, most likely because everybody just copy/pastes the same verilog model.

So we put our own SPI/I2C hardware in an FPGA, which took care of buffering and all the sensitive timing. Done. Problem solved.
Logged

ME7.1 tuning guide (READ FIRST)
ECUx Plot
ME7Sum checksum checker/corrrector for ME7.x

Please do not ask me for tunes. I'm here to help people make their own.

Do not PM me technical questions! Please, ask all questions on the forums! Doing so will ensure the next person with the same issue gets the opportunity to learn from your experience.
Pages: [1] 2
  Print  
 
Jump to:  

Powered by SMF 1.1.21 | SMF © 2015, Simple Machines Page created in 0.078 seconds with 16 queries. (Pretty URLs adds 0.001s, 0q)