Title: Any SocketCAN gurus around here? Post by: woj on June 07, 2021, 10:57:16 AM I am suffering from a weird communication instability problem when flashing my car with my own cooked flasher on Linux / SocketCAN. Though my flasher is pushed to the limits when it comes to speed / bus load, it worked flawlessly for almost 2 years. With a recent Ubuntu upgrade (or so I assume) from 18.04 to 20.04 it became unstable, most probably lost frames or out of sequence frames, or bus overload. Results in flashing interruption.
Now for the weird part - works stable when on my bench ECU and no other CAN devices on the network, while the car with all the other ECUs spitting CAN traffic makes it crash. It becomes stable again on the car when I run candump in a parallel terminal. It still fails when I run equivalent frame dumping process of my own in parallel (essentially same thing, probably, but does not print to the terminal). So, clearly the extra bus load on the car contributes to the problem, while having a parallel traffic scanning process alleviates it. Finally, changing the way the frames are sent / received in the flasher to give them more time to breath (delays, extra delivery checking) does not help. Now I know that internal Linux socket queues are a total nightmare and a million problems can sit in there, I do not want to start dissecting the Linux kernel, and I do have a working solution (but ugly), I am just looking for something obvious I might have missed that someone else could possibly have seen before and can give me a hint ;) Title: Re: Any SocketCAN gurus around here? Post by: d3irb on June 07, 2021, 02:13:54 PM There are a few queues at play here:
* Depending on the CAN device, the CAN device's receive buffer needs to be drained by the kernel before it overflows. In the case of register-based (on-chip) CAN devices this will always happen, because the IRQ will cause the kernel to service the register immediately and copy the data into a socket structure. In the case of SPI based devices like MCP2515, you can have problems at this layer already, because the SPI request to drain the buffer can end up in a kernel workqueue which can get preempted and cause the CAN device's buffer to overflow. Make sure the spi tasks have realtime priority. And in the case of USB or other devices, it's at the mercy of the driver and hardware - most USB <-> CAN devices have their own sizeable internal memory buffer in firmware, but some don't and who knows if the firmware for each is implemented properly... anyway, it doesn't seem like this is your issue though. If it were, candump would probably make things worse (more preemption/contention for draining kernel-level queues) rather than better. * OK, now the CAN data lives in a Linux packet structure. It then goes into the socket routing code. The socket routing code has a few queues in it which can fill and are subject to all kinds of wild and wacky Linux socket priority "disciplines." Try replacing those with the simplest one, pfifo_fast: `sudo tc qdisc replace dev can0 root pfifo_fast`(you can also try `noqueue` but then your application better be really fast...). I suspect this could be your problem as having another application (candump) draining the kernel queues seems to be helping you out. * Next, your socket itself has a recv buffer/queue, called rmem. If you want to max it out, you can try `echo "26214400" > /proc/sys/net/core/rmem_max && echo "26214400" > /proc/sys/net/core/rmem_default` . This makes up for application software that can't poll its open socket object fast enough to receive all data. Hope this helps. For me, the last two (setting the qdisc to pfifo_fast and increasing rmem) made even full-speed CAN communication with Python (slowest runtime ever) pretty reasonable. It's funny that we can't handle 500kbit data transfer, which should be peanuts, but such is adding too many layers... Title: Re: Any SocketCAN gurus around here? Post by: woj on June 08, 2021, 12:32:08 AM Good info, thanks! I also now found this: https://rtime.felk.cvut.cz/can/socketcan-qdisc-final.pdf. Seems there is some experimentation in front of me. My interface already has pfifo_fast by default, so I guess either rmem limits, or actually a different queue, pfifo perhaps. At first pfifo_head_drop sounded like it could do the trick, but thinking about it is probably almost guaranteed to make things much worse (network traffic overwriting the frames I need / wait for).
I will report back when there is something to report. As for the USB-CAN device - this is actually also something of my own and I initially suspected the problem could be there, but then it worked before, even with Raspberry PI3 as host, so I doubt it: https://github.com/woj76/gs_usb_leonardo Title: Re: Any SocketCAN gurus around here? Post by: woj on June 08, 2021, 02:43:20 AM No results so far, noqueue actually does work on the bench. Changing rmem parameters (I checked from within my application that it is factual) does not change anything, playing with queue disciplines neither (actually, doing noqueue when on the car makes the whole thing lock up immediately, but that was to be expected). And running candump in parallel keeps helping out when using increased rmem and / or different queue...
EDIT: Not only some other minor tweaks and attempts did not solve the problem, now I also got one instance of failed flashing when running candump did not help... Title: Re: Any SocketCAN gurus around here? Post by: d3irb on June 08, 2021, 09:12:14 AM Hmm - very odd. I never quite managed to get the right kernel counters to figure out what was actually going on, I just kind of shotgunned networking stuff and hardware at my problems until my code started working. My Tricore BSL's "read" functionality just spams compressed data as quickly as the CAN node thinks the bus is ready to take it, so it's a good test for CAN client software.
As an example of how annoying the levels of complexity are - my code absolutely won't work with MCP2515 over SPI on RPi and friends, because the kernel fails to service SPI workqueues fast enough to drain the 2 Rx slots, even at 500kbaud. But, MCP2517 works perfectly since it has a deeper 32-slot system and the driver is a little bit smarter. I've since started to move away from SocketCAN - because while I like the simplicity from the application's standpoint, the level of complexity outside the application has been quite painful to deal with. Title: Re: Any SocketCAN gurus around here? Post by: woj on June 08, 2021, 01:13:13 PM I have been debugging / looking inside the traffic a bit, and my head now hurts. First of all, I got two single messages in my system logs over the last two days about unexpected echo id, which means that my CAN-USB firmware is sending a USB frame that has a wrong echo field, and that can, as far as I can see for now, only happen if I have a memory leak in my firmware, but I do not see it.
The other thing is, I now see that the flashing freezes after 10 consecutive write CAN frames from PC to the ECU. This I have seen before when developing my firmware, and it almost certainly means that the PC / host does not receive a corresponding echo USB frame. Now, knowing that this is triggered by the other CAN traffic from the car would indicate that the USB is not catching up (my firmware decides to skip the USB frame after some timeout if it can be pushed the host, but it is large) or something in the kernel decides to drop the USB frames. Not sure how running candump helps that, probably drains the USB queue quicker. Since both observations mention echo frames, they could be related, but knowing my luck it is coincidental and I have two problems at hand. Title: Re: Any SocketCAN gurus around here? Post by: d3irb on June 09, 2021, 11:38:25 AM The USB queue aspect makes sense. 10 frames is interesting - txqueuelen on the interface isn't playing a role here, right?
I wrote a userspace 8devices Korlan driver using libusb and I found that I needed to poll the USB bulk transfer endpoint in a free-running thread and then queue the data at my driver/app layer, or everything would go haywire very quickly - I couldn't rely on USB/kernel/libusb level buffering for anything. And - the behavior differed substantially based on host platform and even host hardware, which was a nightmare! When I tried to debug this I too ended up in a special layer of hell. It turns out that USB host driver implementation is _wildly_ inconsistent as well - with some drivers polling and buffering in different ways from others. My final conclusion with SocketCAN, drivers, and Linux in general is that it is most reliable to run a background thread in your application software which drains data from the socket as quickly as possible. This seems to limit the amount of ways the data will get buffered and subsequently destroyed somewhere in the kernel Rube Goldberg apparatus. Title: Re: Any SocketCAN gurus around here? Post by: nyet on June 09, 2021, 01:46:18 PM My final conclusion with SocketCAN, drivers, and Linux in general is that it is most reliable to run a background thread in your application software which drains data from the socket as quickly as possible. This seems to limit the amount of ways the data will get buffered and subsequently destroyed somewhere in the kernel Rube Goldberg apparatus. I agree, or a separate process entirely that runs with high prio and you talk to via IPC... a local unix socket or even domain socket for higher level "commands" will do better than the ridiculous libusb stack... as long as your IPC "API" implementation you design is reasonable and relatively timing insensitive. Title: Re: Any SocketCAN gurus around here? Post by: nyet on June 09, 2021, 01:48:12 PM BTW most linux kernel USB drivers are written by interns that have no fucking clue what they are doing.
If you have any real understanding of kernel driver design, you might consider writing your own. Title: Re: Any SocketCAN gurus around here? Post by: jcsbanks on June 09, 2021, 01:58:32 PM Can you set some hardware receive filters in your CAN controller to avoid overflowing its buffers if the CAN to SPI to software path is delayed?
Can you try to shut up the other modules during the flash so they don't give you unwanted traffic? Title: Re: Any SocketCAN gurus around here? Post by: woj on June 09, 2021, 02:13:11 PM No, txqueuelen has nothing to do with it, that's for host to USB-CAN direction, I have problems the other way (or so I am convinced).
Today I tried my firmware to be more persistent in sending USB frames to the host, no luck. What you proposed - the background thread of my own draining the queues - that did not help either, but there is more than one way to do it, I have to review what I have done there. In the meantime I prepped a USB stick with older Ubuntu and will try it on both my laptops tomorrow. The really sad part of this is that I always have been a Unix / Linux guy, essentially from the moment it was decently runnable on a PC, which for me was mid / late 90-s. Having to patch almost everything at the early days to make my hardware work never put me off. Now I got too old and too busy to tolerate easily shit like this, yet it keeps popping up all the time, this is the point where one starts to question the faith... Title: Re: Any SocketCAN gurus around here? Post by: nyet on June 09, 2021, 03:08:23 PM The sad thing is with 4GHz processors and 256G of memory, the already enormous latency of the USB stack has just completely exploded.
Title: Re: Any SocketCAN gurus around here? Post by: woj on June 10, 2021, 04:53:05 AM So nyet was the closest to the truth with his statements, no final solution yet, but I nailed it down to where the problem is.
The older Ubuntu (not really older in fact, the kernel is 5.4.0-XX on both of them as it turns out) gave the following result - it still crashes on my current / newer laptop, but it worked flawlessly (I mean the exact same OS running from the same media) on my older laptop (both Dells from the same Latitude line, the older is from 2015, the newer from 2020). So it is either the USB controller hardware, or its Linux driver (kudos nyet!), or some of the settings for the said driver (in the past I remember some aggressive energy saving modes on USB controllers where doing wonders to the stability of my USB devices). So in the positive mood, I went back to my current laptop with the permanent Ubuntu 20.04, grabbed the USB-C to USB3.0/Ethernet dongle, and connected my interface through that. And of course it just worked! Not only that, the flashing that was normally 1:04 ran for 1:02 now, could be a coincidence though, it does tend to have +/- 1s differences between runs. Come to think of it, I am not entirely sure now if I ever flashed the car on this newer laptop before, I know I was using it with the car to read CAN, but perhaps not the flashing. So now knowing I can safely flash from my laptop as is I can go for the hunt on the USB settings in the kernel that crap this out... Title: Re: Any SocketCAN gurus around here? Post by: prj on June 10, 2021, 07:29:56 AM You could just use J2534...
*ducks for cover* Title: Re: Any SocketCAN gurus around here? Post by: nyet on June 10, 2021, 09:02:34 AM So it is either the USB controller hardware A bit of a digression here... in my previous job we used a lot of SPI/I2C with embedded systems. Pretty much every single ARM based CPU we used had built in SPI/I2C peripheral hardware, and a driver (written by an intern no doubt) to go with it. Not only was the SPI/I2C driver total garbage, the SPI/IC2 core itself was poorly designed and simply could not be made to work properly because of the way it handled both buffering and even the most basic timing of signaling, a lot of which had to be written in software (and/or done badly in the driver). Moving to different CPUs (or even micro controllers) proved to be worthless since this same core design was shared by pretty much everyone, most likely because everybody just copy/pastes the same verilog model. So we put our own SPI/I2C hardware in an FPGA, which took care of buffering and all the sensitive timing. Done. Problem solved. Title: Re: Any SocketCAN gurus around here? Post by: woj on June 11, 2021, 12:42:34 AM Can you set some hardware receive filters in your CAN controller to avoid overflowing its buffers if the CAN to SPI to software path is delayed? Can you try to shut up the other modules during the flash so they don't give you unwanted traffic? I missed that one. Yes, I could and it was one of my ideas. But since I want to have full SocketCAN compatibility and universal firmware and gs_usb does not support hardware filters, that was no option. As for shutting the other modules - that's precisely why I have developed all this software, so that I can go to the car, plug in the cable, press ENTER and have the ECU updated for the next experiment in 1 minute, not to bother with unplugging things, taking fuses out, and what not. Title: Re: Any SocketCAN gurus around here? Post by: woj on June 11, 2021, 12:45:37 AM You could just use J2534... *ducks for cover* You are not the kind of a person that ducks for cover, you won't fool us... ;P J2534 was on my to investigate list, I even found some open source software for it at some point, only I could not get good info on which of the cheap clones would perform good for a safe buy, got this gs_usb stuff working in the meantime, so it was forgotten. Title: Re: Any SocketCAN gurus around here? Post by: Basano on June 11, 2021, 08:18:41 AM I missed that one. Yes, I could and it was one of my ideas. But since I want to have full SocketCAN compatibility and universal firmware and gs_usb does not support hardware filters, that was no option. I don’t know your particular vehicle setup (I don’t think it was VAG) but typically if you are plugging into the external OBD connector, then that hangs off the CAN gateway which is already filtering out all the internal broadcasts and only allowing diagnostic traffic to traverse? If there indeed is a vehicle CAN gateway in your scenario minimising broadcasts and securing the inside comms, then the candump workaround is quite strange indeed. As for shutting the other modules - that's precisely why I have developed all this software, so that I can go to the car, plug in the cable, press ENTER and have the ECU updated for the next experiment in 1 minute, not to bother with unplugging things, taking fuses out, and what not. I think he may have meant UDS command $28 CommunicationControl “Mute”, to politely ask modules to stop broadcasting. Not actually pulling out fuses or physical interventions. *ducks for cover* Hahaha, you are not fooling us ;D Title: Re: Any SocketCAN gurus around here? Post by: woj on June 11, 2021, 08:59:04 AM It's a Fiat, ME7.9.10. The mute command is something new to me, I have to check it out, that would be ideal. Technically it is still KWP2000, not yet UDS, though I know they are very similar. I knew about reset / restart that I find also very useful.
BTW, thank all contributing, I again learned a lot just from chatting about a silly problem I got. Title: Re: Any SocketCAN gurus around here? Post by: woj on June 11, 2021, 12:00:42 PM So, you opened a small can of worms :D Surely there is a mute KWP2000 command, and it works for the Body, ABS, EPS, and engine ECUs. There is one more module on the high speed bus to which I cannot open a successful diag session (I still have to check a couple of ECU ids, perhaps I missed it), according to my service manual that would be the tilt/yawn module, and the messages (also quite dense) from that one remain on the bus. It also works for only 5 seconds so I will have to be doing this periodically when flashing, and periodic tester present to all involved ECUs too. Not an immediate thing to try though, need to do a little bit of coding for this.
The downside of this is that I get even a larger and more beautiful X-mass tree of DTCs in all modules, 3 on average in each, and also the Body/Dash ECU is very aggressive electrically on opening the diagnostic session, including turning on the headlights for a second or two. That is not a show-stopper of course, I have full DTC cleaning implemented in my flasher for some time now, just annoying. Title: Re: Any SocketCAN gurus around here? Post by: prj on June 12, 2021, 12:41:15 AM You are not the kind of a person that ducks for cover, you won't fool us... ;P J2534 is a combination of hardware, driver and Windows DLL provided by the author of the tool.J2534 was on my to investigate list, I even found some open source software for it at some point, only I could not get good info on which of the cheap clones would perform good for a safe buy, got this gs_usb stuff working in the meantime, so it was forgotten. So in many cases it works very well because the manufacturer has control of the entire process. It just stores the DLL location under a certain registry key, which you enumerate in your tool to get the device list, and after that you use a standardized API to access it. Windows only though of course. Title: Re: Any SocketCAN gurus around here? Post by: Basano on June 12, 2021, 04:21:12 AM I’m glad you liked the small can of worms ;)
But this got me thinking a bit and the discussion is quite interesting anyway. I looked at a handful of candump logs, old and new, of various tools when they do a flash. Yes, they send the mute command $28, but only to the target and NOT to everything else? So how does that help? What’s the point of muting the target if it’s going to be quiet anyway as soon as you descend into the bootloader? And all those other nodes will keep chattering away on the bus. So this is a bigger can of worms ;D Or will they? And then I remembered my OSEK Network Management. You have your physical ECU’s strung out on the bus, but logically there is a Network Management Ring Topology with a token being passed around the circle. The token contains various entries like previous station ID, next station ID, bus status (e.g. starting up, shutting down) and there are various operations like join the bus, leave the bus, elect a new master. A lot of these CAN nodes have permanent power (terminal 30 & 30a) and rely upon this OSEK network management to know when to switch on and when go dormant again. So I’m speculating now that by muting only the engine ECU, that in turn has a knock-on effect via OSEK Network Management on the other ECU’s and they may become dormant? This is purely speculation, I don’t have logs to be more conclusive. But it could explain why there are not multiple simultaneous diagnostic sessions open to each module all at the same time and all broadcasting a repetitive mute… Here is some more details on the Network Management albeit VAG specific. I took it off the internet many years ago, but sadly I can’t find the original page again now to give the author full credit. Whoever you are, thank you. Code:
Title: Re: Any SocketCAN gurus around here? Post by: d3irb on June 12, 2021, 01:04:21 PM J2534 is a combination of hardware, driver and Windows DLL provided by the author of the tool. So in many cases it works very well because the manufacturer has control of the entire process. It just stores the DLL location under a certain registry key, which you enumerate in your tool to get the device list, and after that you use a standardized API to access it. Windows only though of course. The other reason why J2534 works really well is that ISO15765 / ISO-TP is part of the API layer standard. Although it is not _strictly_ required to do so by the standard, most tools implement the timing sensitive ISO15765 flowcontrol messages and framing inside of the firmware at the end of the USB dumpster fire. So, the USB part becomes much less timing sensitive and much less noisy in terms of back-and-forth, which paints over bad USB drivers pretty nicely. That is, instead of sending hundreds of framed packets (and needing to wait between to see if a timing change is in the mix), with most J2534 devices, to send a 4k TransferData command, the DLL is just sending 4k down a bulk endpoint and the device does the rest. At least OpenPort and Panda work this way and I would be shocked if other vendors don't implement with the same strategy. This makes OpenPort hardware a good choice even with massive perversions like re-implementing the J2534 interface on Linux: https://github.com/NikolaKozina/j2534/blob/master/j2534.c . Title: Re: Any SocketCAN gurus around here? Post by: woj on June 13, 2021, 12:34:33 PM This makes OpenPort hardware a good choice even with massive perversions like re-implementing the J2534 interface on Linux: https://github.com/NikolaKozina/j2534/blob/master/j2534.c . That would be the piece of software I mentioned I found at some point when considering J2534. Something came up to be done on the house, so any car dev stuff will have to wait for now. Title: Re: Any SocketCAN gurus around here? Post by: prj on June 14, 2021, 08:29:05 AM At least OpenPort and Panda work this way and I would be shocked if other vendors don't implement with the same strategy. Say hello to PeaK PCAN.Title: Re: Any SocketCAN gurus around here? Post by: jcsbanks on June 14, 2021, 10:20:16 AM Although they are a bit physically fragile on the USB connector according to my colleague who uses them a lot in his other business, OpenPort 2.0 works well for the price unless you have a need for your own hardware is a good choice for most and will save you much time in debugging boring stuff and allow you to get onto the interesting stuff. I got mine after a Kvaser J2534 tool failed to adjust STMIN when asked and could not complete a flash through the frame dropping VAG gateways, even though the ECU advertises STMIN of 0 and does it on the bench. Bug reported to Kvaser and several months later they said they'd fixed it. But the Kvaser is just so convenient and easy for CAN bus analysis. Some criticise the J2534 spec for being too vague on many details and causing compatibility issues. For its age, J2534 04.04 is not bad as long as you are not allergic to Windows and many alternatives don't perform like a proper J2534 device should.
|