Practical FFT on microcontrollers using CMSIS DSP

Fourier transform is a vast domain of knowledge with many practical applications within signal processing. Virtually all communications protocols use Fourier transform at one step or another (including LTE, GPS and WiFi). Another popular example are the “jumping bars” in music players showing levels of low and high tones in real time. In this post I show the basics of obtaining spectrum of an audio signal on an EFM32 Cortex-M3 microcontroller. No scary math!

Fourier Transform basics

Fourier transform itself can be treated like a mathematical black box that takes any function or combination of functions as the input and outputs a set of sine and cosine functions with different amplitudes, frequencies and phases. Fourier transform “works” on mathematical formulas.

Real-world signals don’t come packaged as mathematical formulas but rather as series of samples from an analog-to-digital converter, so their processing requires a slightly different tool called a discrete Fourier transform. This transform “works” on numbers. Fast Fourier transform (FFT) is a practical implementation of the discrete Fourier transform with reduced computational complexity.

FFT takes a series of numbers (signal samples). Length of the series has to be a power of two. This dimension is commonly called FFT length or FFT size. FFT output is also a series of numbers of the same length. Each of the output numbers is called an FFT bin. A bin can be thought as a group of frequencies (or as a single vertical jumping bar in a music player).

Example

Let’s have a look at a simple system suitable for DTMF detection.

An ADC works with a specific sampling frequency. I will assume Fs = 8 kHz in this example. Per the sampling theorem the maximum input frequency (without aliasing) of the system can be 4 kHz – this is enough for voice communication and DTMF. FFT size is 256. This leads to the following numbers:

  • There are 256 frequency bins.
  • Each bin is 4000 Hz / 256 = 15.625 Hz wide (theoretically!).
  • 256 samples are required to compute the FFT.
  • Acquisition time is 0.032 seconds.

First important implication: close frequencies can not be distinguished (only groups of frequencies 15 Hz wide). Second important implication: within the acquisition time of 0.032 seconds it is not possible when each frequency started and ended. This is the fundamental limit of Fourier transform – the more bins (and better frequency resolution), the more samples have to be put it which leads to poor temporal resolution. On the other hand the smaller the size of the FFT – the better temporal resolution and worse frequency resolution. This is called the Gabor limit.

(Over)simplifications

For brevity I omitted phase information. FFT returns not only the amplitude per frequency bin but also the phase (a complex number), however for simple audio applications phase information is not needed. CMSIS DSP library has functions for both complex (with phase) and real (without phase) FFT. Real FFT is slightly faster.

Second simplification is that the frequency bin does not only “pick” adjacent frequencies but also frequencies further away (with diminishing amplitude). This is called spectral leakage and, if needed, can be dealt with by applying a window function to the incoming samples. In my example no window function is used, so the effective window is rectangular.

ARM CMSIS DSP library

ARM provides a free and open source library optimized for all Cortex cores with most commonly used math and DSP functions. It is called CMSIS DSP and is available on Github.

CMSIS DSP is often present in blank projects of vendor-provided IDEs, but is also easy to integrate with a Makefile project. The only tweak (for EFM32) I had to make was to add those two lines to arm_math.h:

The library provides functions in two flavors: floating-point and fixed-point. Without diving deep into the theory, floating-point functions use types like float or double. Using floats require a floating-point unit (coprocessor) or very costly software emulation using libraries. On the other hand – the fixed-point counterparts use only integer types and interpret them as floats. For example instead of storing and processing $1.23 as float (when it comes to money – bad idea anyway) it can be stored as (integer) 123 and interpreted as the original value, just scaled by a factor of 100. Scaling by powers of 10 is easy for humans, but scaling by powers of 2 is easier on CPUs. A typical fixed-point format is the Q number format.

My MCU is a Cortex-M3 and has no floating-point unit, so I obviously chose the fixed-point implementation. CMSIS DSP has an internal type called q15_t, that is simply an int16_t but interpreted as a real number between -1 and +1.

Fixed-point also has the benefit of avoiding conversion from int to float when data to be processed is acquired by the built-in ADC.

Making test data

Math functions can take any data in and output some numbers. But how am I going to figure out if they make sense? I start with test data, that I know what to expect in the end. In the case of FFT I started with pure sine waves to verify that I can actually see a proper spectrum after FFT.

CMSIS DSP library is optimized to work only on ARM processors, so it would be hard to test it on a PC. To keep the testing simple I decided to hardcode the test data in C arrays and return the spectrum to the PC via the debugger.

How to generate tones in Audacity:

  1. Select project rate 8000 (bottom left).
  2. Use Generate->Tone with the particular frequency.
  3. Length has to be 256 samples (does not have to be longer).
  4. File->Export Audio
  5. Select header “RAW”.
  6. Select encoding “Signed 16-bit PCM”

Exported .raw files contain binary data. I used xxd command (part of vim) to convert them to C code. Example:
xxd -i 697hz.raw > 697hz.inc

The output contains an unsigned char array and its length:

Test waveform has now type unsigned char (ie. 8-bits), the original export format was 16-bit signed and CMSIS DSP wants q15_t. Why this will work? Pointer to this array can be simply cast to q15_t (so as to treat two bytes as a single int16_t value). This only works because the raw format on x86 is little-endian and Cortex-M3 is also little endian (to be exact, the particular Cortex-M3 I am using is little-endian, just like 99% on the market).

FFT test code

The code uses SEGGER RTT for output. It processes all test signals in sequence, outputs (absolute value of) the spectrum and stops on a breakpoint.

All magic is done by a single call to arm_rfft_q15. The output is a vector (one dimensional array). Each of the elements represents a frequency bin. A positive value means that a particular bin is closer to a sine function, while a negative value, to cosine. Since I don’t care about phase in this example – an absolute value is taken (using arm_abs_q15). The vector is then sent to the terminal.

To measure performance I use the cycle counter of Cortex-M3 Data Watchpoint and Trace Unit. It is a register that… simply counts how many clock cycles have elapsed. That of course counts interrupt traffic, so the results are not always identical. This gives hard data on the execution time (better and easier than counting assembly output instructions). The cycle count register is read before doing the FFT and after. The difference is execution time in clock cycles.

FFT results

Raw output is not easy to follow. First column is waveform name, second column is execution time and all else are the frequency bins. After some gnuplot trickery everything looks clear (click to enlarge):

Wow! Exactly what I would expect:

  • Single tones have a clear prominent peak (or two).
  • The peaks move up as the frequency increases.
  • White noise looks like it has all possible frequencies (yay!).
  • DTMF tone has two peaks.

Performance

The average of all cycle counters is 31919. Remember that this function has to be eventually executed in real time on incoming ADC samples, all the time. Is that too much or is that acceptable? Let’s do the math:

  • Sampling frequency: 8 kHz (ie. 8000 samples per second)
  • FFT size: 256
  • CPU cycles to do the FFT: 31919
  • CPU frequency: 48 MHz (EFM32GG332)
  • Time to acquire 256 samples: 0.032 seconds (formula: 256/8000)
  • Time to process 256 samples: 0.000664979 seconds (formula: 31919/48’000’000)
  • Fraction of processing time to acquisition time: 2% (formula: 0.000664979/0.032)

What does the 2% figure mean in practice? While data is coming from the ADC to a memory buffer there is nothing yet to process (so the CPU can do something else or sleep). When a complete 256 sample chunk is ready it takes just 2% of the acquisition time to calculate the FFT (and another acquisition cycle is immediately started again). This is a very low CPU load for a non-trivial computation. All processing can be done in real-time and there is plenty of CPU power left. In my opinion, FFT on a fast Cortex-M3 is a justifiable solution, even for trivial tasks like tone detection. Tone detection could also be done by simpler algorithms (for example: banks of filters) that may be more power-efficient, however FFT gives the ultimate flexibility when it comes to analyzing the signal and drastically shortens development time (just two function calls!).

Rate this post