Embedded Systems: Last Breath Before Reboot

When something doesn’t fit well into embedded systems, I usually write the error to a special log file in flash and then reboot (there are not many options if, for example, you run out of memory).

I understand that this may go wrong, so I try to minimize it (not allocating any memory during the final recording and increasing the priority of the recording process).

But it depends on someone retrieving the log file. Now I was thinking about sending a message along the tubes to report an error before rebooting.

Secondly, of course, it would be better to send this message after a reboot, but it made me think ...

What things should I do if I find a fatal error and how can I make it as safe as possible in a system that is in an unstable state?

+4
source share
7 answers

One strategy is to use the RAM partition, which is not initialized during power on / reboot. This can be used to store data that survives a reboot, and then when your application restarts, at an early stage of the code, it can check this memory and see if it contains any useful data. If so, write it to the journal or send it over the communication channel.

How to reserve a RAM partition that is not initialized depends on the platform and depends on whether a fully functional OS (Linux) is running, which controls the initialization of RAM or not. If you are on a small system where RAM initialization is done using the C startup code, then your compiler probably has a way to put the data (file region variable) in a different section (besides the usual one, for example .bss ) that is not initialized by the startup code C.

If the data is not initialized, then it will likely contain random data when the power is turned on. To determine if it contains random data or valid data, use a hash, for example. CRC-32 to determine its validity. If your processor has a way to tell you if you are in reboot mode and not when you turn on the reset power, you should also use this to decide that the data is not valid after turning on the power.

+7
source

There is no single answer to this question. I would start with a watchdog timer. This reboots the system if everything goes horribly.

Something else to consider - that which is not included in the log file is also important. If you have regular updates from various tasks / actions recorded in the journal, you can find out what is missing.

Finally, in case everything goes wrong and you are still working: enter a critical section, disable as many OS as possible, disconnect peripheral devices, write down as much status information as possible, and then reboot!

+5
source

The only thing you want to make sure that you do this is not to damage the data that may be legitimately present in the flash memory, so if you are trying to write information in a failure situation, you need to do this carefully and with knowledge that the system can be in very poor condition, so everything you do should be done in such a way as not to get worse.

Typically, when I detect a failure condition, I try to spit out information from the serial port. The UART driver, accessible from the broken state, is usually quite simple - it just needs to be a simple polling driver that writes characters to the transmit data register when the bit is busy clear - the fault handler usually does not need good use with multitasking, so the polling is fine. And you don’t have to worry about incoming data at all; or at least you don’t have to worry about incoming data in a way that cannot be processed by polling. In fact, a failure handler usually cannot expect multitasking and interrupt handling to work because the system is confused.

I am trying to write a register file, part of the stack, and any important data structures of the OS (the current task control block or something else) that may be available and interesting. The watchdog timer is usually responsible for resetting the system in this state, so the failure handler may not be able to write everything, so first unload the most important material (do not watchdog crash handler - you do not want any error to interfere with the watchdog timer reset system).

Of course, this is most useful in setting up development, because when the device is released, it may not have anything attached to the serial port. If you want to capture these types of crash dumps after release, then they should be written somewhere suitable (for example, as a reserved flash partition, just make sure that it is not part of the normal data / file system area, make sure it is not may spoil this data). Of course, you will need to learn something in this area at boot so that it can be detected and sent somewhere useful, or it makes no sense if you can not return the units back after opening and can connect them to the debug setting, which can look at the data.

+3
source

I think the most famous example of correctly handling exceptions is rocket self-destruction. The exception was caused by arithmetic software overflow. Obviously, a lot of tracking / recording tools were involved because the main reason is known. It has been debugged.

Thus, each embedded design should include 2 functions: recording media, such as a log file, and a graceful stop, for example, disabling all timers / interrupts, closing all ports and sitting in an infinite loop or in the case of a rocket - self-destruction.

+1
source

Writing messages to flash before rebooting on embedded systems is often a bad idea. As you noticed, no one is going to read the message, and if the problem is not transient, you wear out the flash.

When the system is in an inconsistent state, you can do almost nothing reliably, and the best thing to do is to restart the system as soon as possible so that you can recover from short-term failures (time, special external events, etc. ) On some systems, I wrote a trap handler that uses some reserved memory so that it can configure the serial port and then emit a dump of the stack and register the contents without requiring additional stack space or resetting registers.

A simple restart with a similar dump is reasonable, because if the problem is unstable, restarting will solve the problem, and you want to keep it simple and let the device continue. If the problem is not transient, you still will not move forward, and someone may come and connect the diagnostic device.

A very interesting article about failures and recovery: WHY COMPUTERS STOPPED AND WHAT CAN DO ABOUT THIS?

+1
source

For a very simple system, do you have a pin that you can move? For example, at startup, set it to a high output, if everything goes south (i.e. Watchdog reset pending), set it to a low level.

+1
source

Have you ever considered using a garbage collector?

And I'm not joking.

If you perform dynamic allocation at runtime in embedded systems, why not reserve a tag hopper and tag and miss when excrement hits a spinning fan.

Perhaps you have a source for implementing malloc (or any other), right?

If you don’t have library sources for the embedded system, forget that I have ever offered this, but tell everyone else what equipment it is located so that we can avoid using it. Yikes (how do you debug without library sources?).

If you are already in the system, then already dead .... who cares about how long it takes. Obviously, it is not so important that he fulfill this moment; If that were the case, you still could not "die"?

0
source

Source: https://habr.com/ru/post/1300693/


All Articles