XMEGA high-performance SPI with DMA

I developed an universal SPI driver for XMEGA line of MCU for a battery powered device, where power efficiency was important. To get anything started on new hardware I have started with a simpler code first which uses interrupts and then I began looking at using XMEGA’s DMA controller (that was totally new to me) to improve speed and make the MCU sleep longer. This is a complete driver that can work with any kind of SPI peripherals.

It is also a nice practical introduction to DMA, because XMEGA DMA controller is one of the most simple you can find in microcontrollers (comparing let’s say to Kinetis Cortex-M), yet has all the necessary features.

Scroll all the way down for the complete code.


Let’s begin with the high level interface…

The first thing to configure is the USART name. This driver uses XMEGA’s USART instead of “regular” SPI peripheral, because the latter can not use DMA. Names of vectors and DMA triggers also have to be customized (just change the USARTC1 part everywhere).

Before spi_init() the pins have to be configured, for example (ATXmega32a4u):

I also have found out that SPI clock line has to be inverted to talk to an M25-series flash memory (SPI mode 1), try changing (disabling) inversion if your chip does not work.

The driver sets up hardware to do the transmission and returns immediately, so the main application does not block when the transmission is in progress. External buses run at much slower speed than the MCU so waiting for end of transmission would block execution for too long (busy-waiting is an anti-pattern in embedded programming). The complication is that application code has somehow to be notified when a transfer is complete to process incoming data – it is implemented as a callback. One of spi_transfer function arguments is a pointer to a function that will be called by the driver when hardware has completed its job. This is a typical embedded pattern – you set up the hardware to do something, start it, let it run (via interrupts or DMA) and when the job is done – call a function requested in the beginning.

Callback function has to follow the spi_transfer_complete_callback_t type (example: void my_callback(uint8_t *buffer, uint16_t length))

To initiate a transfer the spi_transfer function has to be called. Its arguments are:

  • pointer to data buffer that bytes will be sent from and received to (because each SPI transfer is an exchange between a master and slave) – it must be statically allocated (ie. defined outside of a function, not on a stack – otherwise the driver will crash the whole program)
  • length of the transfer (buffer can be larger than the length)
  • pointer to function that should be called at the end of the transfer
  • option to deliver the callback either from main context or interrupt – usually it is easier to handle callbacks in the main context, but a small and simple callback will give lower latency when called directly from interrupt context)

To deliver callbacks in main application context spi_task has to be called in the main loop periodically.

The driver does not handle the chip select line in any way – the application has to handle it. It is more convenient to leave that piece to higher level application. For example SPI flash memory operations require a command and address to be sent before data. A high level flash driver can provide functions like read_sector that will take a pointer to a buffer that should be filled with data (only data), but first has to send command and address bytes. If it can control the CS line directly then it can be done using two SPI transfers. If the SPI driver controlled the CS line it would have to be done in a single transfer – so the flash driver would have to implement another buffer and much copying. Memory is always a scarce resource in embedded systems.

Last of the public functions is spi_override_callback – it can be used to change the callback before a transmission has finished, for example to abort an operation after the transmission. There is no way to stop the transfer once it has begun.


I obtained the traces from an ATXmega32a4u running at 32MHz. They show SPI clock signal. Clock frequency is not relevant.



The striking difference between interrupt-based driver and DMA is that the first option is has long delays between sending consecutive bytes. Time from the beginning of one byte to the beginning of the next is around 6,1µs. The “silence” between is around 2,1µs, so the “wasted” time is around 42%!

42% of the time CPU is busy reading a byte from the USART and writing the next one. No matter how tight the ISR code is written, it still takes time to enter the interrupt and return. The higher the SPI clock frequency the bigger the “relative waste” is, because interrupt execution time is roughly the same, while SPI byte transmission takes less time.

Driver using DMA on the other hand transmits data continuously, without any delays AND without any CPU usage! (apart from setting the transmission in the first place) The only odd thing I can see in the trace is the non-exact high clock state every fifth bit, but that can be an artifact of my logic analyzer. The overall benefits are obvious – with DMA I get only a single interrupt at the end of a transfer, while the ISR driver would deliver as many interrupts as there were bytes in the transmission.

How it works

Let’s have a look at the internals (complete code is at the very end) piece by piece.

Includes and module private data

Some general includes. Macros check if only one of the modes has been enabled in the .h file. The likely() macro is used to instruct the compiler that a particular condition (eg. in a loop or if conditional) is …more likely than others. I use it within interrupt to optimize one path.

Private data holds the state of the driver between function calls and interrupts.


The spi_init function basically configures the USART for SPI mode, enables transmitter, receiver and, depending on the mode, enables either the DMA controller or RXC interrupt.

The spi_override_callback function can just change the callback function at any moment.

Transfer function

This is the most complex part. Apart from obvious setting of private data for further use this function configures the DMA controller.

DMA version

XMEGA DMA controller has 4 separate channels that can be configured at the same time. There is of course a single memory bus, so if more than one channel is enabled, the DMA controller will switch between channels (it can be configured for fixed priorities or round-robin).

Each channel basically does a bunch of very simple operations:

  • Wait for trigger condition
  • When this condition happens – copy a byte (or word) from source address to destination address (or multiple bytes at once in more advanced operation)
  • Increment, decrement or do nothing to the source address
  • Increment, decrement or do nothing to the destination address
  • Count the number of operations and if it matches the length – set a flag or generate an interrupt

SPI operation requires both writing and reading data from the USART, so two DMA channels are required.

For the TX side the destination address is fixed (USART DATA register), source address begins with the pointer to data buffer and source address is to be incremented. The trigger condition corresponds to the USART DRE (data register empty) flag/interrupt (I explain USART flags/interrupts further below) – when USART.

For the RX side the destination address starts with pointer to data buffer, the address is to be incremented, source address remains fixed and points to USART DATA register. Length is the same as of the TX channel.

Register/buffer 16-bit addresses have to be split into individual bytes. After the channels are set up they are enabled and the magic happens.

ISR version

Interrupt version is much simpler – it just transmits the very first byte and then everything is handled by USART receive interrupt.

Main loop task

The task function checks if the interrupt has set the complete flag and executes the callback function. As simple as that.

Interrupt service routines – DMA version

Each channel generates its own interrupt. I could not manage to make one working without an interrupt at all – it would crash the application, so to make it work reliably the channel is simply disabled in each interrupt handler. The RX channel also checks if a callback should be delivered from within the ISR (otherwise it will be delivered by main loop task).

Interrupt service routines – non-DMA version

XMEGA USART (all other AVRs too) has three flags/interrupts (each flag can generate an interrupt):

  • TXC – transmission complete – it is set when the last bit has actually been put on the wire
  • DRE – data register empty – the data register is buffered, so it can be written with the next byte before the previous one has been completely transmitted out
  • RXC – receive complete – a byte has been received

When running as a UART the meaning of the flags is obvious, as sending and reception works totally independently of each other, so the application can expect when TXC and DRE happen, but not when RXC happens. In SPI mode everything is tightly coupled, because you receive at the same time as you send. It simplified the driver because I can use the RXC interrupt only. When a byte has been received it also means that the previous one has been transmitted, so the interrupt service routine reads the received byte, stores it in the buffer, transmits the next byte if there are more, or finishes the transmission by setting a flag or delivering the callback. Using a single ISR for sending and reception saves CPU time (having the functions separate would require two interrupts per each byte and AVR does not have back-to-back interrupt chaining as ARM cores do, so it would just thrash the CPU).

That’s all 🙂

Complete code

Header file is at the beginning of the post.