Preserving debugging breadcrumbs across reboots in Cortex-M

Debugging embedded systems during development even with the best tools can be hard. Certainly a good debug probe makes life easier, but what do you do after the product is shipped? What if the customer complains that something strange is happening sometimes or a bug makes the device reboot, but only once a week? You make the firmware gather diagnostic information for you. This is the first post in series.

Catastrophic firmware faults are probably impossible to handle by the MCU itself, but there are many kinds of faults that can be handled by the firmware and leave enough information to fix them afterwards. These include bugs like:

  1. Hard faults (on ARM)
  2. Invalid memory writes (on ARM)
  3. Runtime assertion failures
  4. Stack overflows (not all)

Intentional reboots (like configuration changes or bootloader requests) can also be logged by the firmware. This post is written with Cortex-M in mind, but some of it can also be applicable to other MCUs.

How to store information across reboots?

In order to diagnose a crash or reboot some information has to be made available to the firmware that is running after the crash. Very little data is actually needed to diagnose most of the faults, so storing debugging breadcrumbs in RAM is an easy option. Microcontroller RAM retains all information across reboots (only the values at power-on are undefined), however all “regular” variables will be initialized by the startup code (to either zeros or their specified values). So how can you “cheat” C code to not initialize some data?

The basic options are:

  1. make a special section in the linker file and place your debugging breadcrumbs there
  2. tell the linker that the RAM is smaller and starts somewhere else 🙂

I chose the second option, because of its simplicity. In my MCU I reserved 64 bytes at the beginning of RAM. The MEMORY RAM entry in the linker script has to be modified. In case of the GNU linker the ORIGIN must be 64 bytes larger (hex!) and the LENGTH must be 64 bytes smaller (decimal!).

EFM32GG example

STM32L151 example

Changing the start of RAM and its size makes the linker and C code totally ignore this part of memory. Of course the memory is still there, but untouched by any code. Accessing this block is as simple as assigning the start of RAM address to a pointer. For example: uint32_t *block = (uint32_t*)0x20000000;

It is important that the remaining RAM is word-aligned (ie. is multiple of 4), otherwise the linker may place variables in an unaligned way, which can lead to strange problems.

Diagnostic module

The diagnostic module is responsible for storing and restoring debugging breadcrumbs across reboots. It also handles intentional reboots and requests to start the bootloader.

Header

Central part is struct crash_info_t that holds a magic value, checksum and values that depend on the kind of fault. reg_CFSR and others are specific to Cortex-M. Each of the union members depends on the magic value.

At runtime any code can call crash_handler_get_info to get the debugging struct and store it permanently (eg. flash) or send it somewhere for processing (telemetry!).

The code

This module supports the following scenarios:

  1. Firmware requests reboot and bootloader entry
  2. Bootloader checks if the application requested its start
  3. Firmware requests reboot
  4. Stack overflow reported by FreeRTOS

More features will be described in future posts. The code comes from an EFM32 project, hence the em_dbg.h and em_device.h includes.

Most important data is the _crash_info_ram pointer to the uninitialized part of RAM (made by changes to the linker script). To distinguish junk from a proper debugging information the init code first checks the RAM address for one of the magic values and then computes an XOR checksum. If both checks are okay the data is copied to a regular variable and made available to the rest of the code, the original data is erased, so it is not counted twice (for example after a brief power cut – the RAM can retain data).

Reset request

The application may want to perform a valid reset, for example when applying some new settings. In this case the macro SYSTEM_RESET(), which adds caller line number to system_reset_func function. Line number is a very easy way to distinguish reset sources, as the chance is very low that a reset is called from the very same line in different source files. The function sets appropriate magic value, stores caller line number, calculates the checksum and requests system reset from the interrupt controller.

The DSB instruction makes the CPU wait until all data is fully written to RAM. Depending on the exact Cortex-M and final chip configuration. There may be some caching or buffering involved, so even though the code was executed and the data checksummed, it may not have enough time to get into RAM, while the whole MCU was being reset.

Bootloader entry request and check

The bootloader must also have RAM address modified in the linker script the same way as the main application. Bootloaders often do entry checks at power on (eg. press and hold a button while powering on to enter the bootloader). Being able to reset and enter the bootloader makes it easier for the user and also allows for fully remotely controller firmware updates.

At startup the bootloader checks the memory for valid magic value and checksum. If it does not match, then the memory is left alone to allow the application read it, so a reboot after a crash (which always leads through the bootloader) will not destroy debugging breadcrumbs.

Stack overflow detection

This feature uses FreeRTOS standard task stack overflow hook, stores task name, calculates the checksum and issues a software reset.

Summary

Most of this code has to be customized depending on your application type, but the principle of keeping a small piece of RAM uninitialized at startup stays the same. Especially the part dealing with reading crash info data when it becomes available has to be implemented. This depends a lot on the kind of device. Some ideas:

  • Device with a display and operator panel – make a menu called “Diagnostics” and display data from the struct. Whenever problems arise – ask the final user for a photo of the screen.
  • Device with lots of flash memory or an SD card – write data from the struct (as an ASCII text file) to the SD card.
  • Device with a GSM module or another wireless – send the crash info (alongside with device serial number) to the server.

In next posts I will show how to diagnose hard faults and memory corruption.