M0AGX / LB9MG

Amateur radio and embedded systems

XMEGA high-performance SPI with DMA

I developed an universal SPI driver for XMEGA line of MCU for a battery powered device where power efficiency was important. To get anything started on new hardware I have started with a simpler code first which uses interrupts and then I began looking at using XMEGA's DMA controller (that was totally new to me) to improve speed and make the MCU sleep longer. This is a complete driver that can work with any kind of SPI peripherals.

It is also a nice practical introduction to DMA because XMEGA DMA controller is one of the most simple you can find in microcontrollers (comparing let's say to Kinetis Cortex-M), yet has all the necessary features.

Scroll all the way down for the complete code.

Interface

Let's begin with the high level interface...

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#ifndef SPI_H_
#define SPI_H_
#include <stdint.h>
#include <stdbool.h>

/* ----------- configuration ------------ */
/* Remember to configure USART pins as outputs before calling spi_init.
 *
 * I had to invert clock to communicate with M25 flash memory.
 * This is done in by calling:
 * PORTx.PINnCTRL = PORT_INVEN_bm;
 * x - port A-D
 * n - pin 1-7
 */

//#define SPI_USE_ISR
#define SPI_USE_DMA

#define SPIUSART USARTC1
#define SPIUSART_RXC_vect USARTC1_RXC_vect

#define DMA_SPI_TX_TRIGGER_SOURCE DMA_CH_TRIGSRC_USARTC1_DRE_gc
#define DMA_SPI_RX_TRIGGER_SOURCE DMA_CH_TRIGSRC_USARTC1_RXC_gc
#define DMA_SPI_TX DMA.CH2 //channels must not be used
#define DMA_SPI_RX DMA.CH3 //by other pieces of code
#define DMA_SPI_TX_vect DMA_CH2_vect
#define DMA_SPI_RX_vect DMA_CH3_vect
/* -------- end of configuration -------- */

typedef void (*spi_transfer_complete_callback_t)(uint8_t *buffer, uint16_t length);

void spi_init(void);
void spi_task(void);
void spi_override_callback(spi_transfer_complete_callback_t cb);
void spi_transfer(volatile uint8_t *data,
                  uint16_t length,
                  spi_transfer_complete_callback_t cb,
                  bool callback_from_isr);

#define spi_transfer_from_ISR spi_transfer

#define CB_FROM_ISR true   //use as callback_from_isr argument for spi_transfer
#define CB_FROM_MAIN false

#endif

The first thing to configure is the USART name. This driver uses XMEGA's USART instead of "regular" SPI peripheral because the latter can not use DMA. Names of vectors and DMA triggers also have to be customized (just change the USARTC1 part everywhere).

Before spi_init() the pins have to be configured, for example (ATXmega32a4u):

1
2
3
4
PORTC.DIRSET = PIN4_bm /*CS*/| PIN5_bm/*SCK*/ | PIN7_bm/*MOSI*/;
PORTC.DIRCLR = PIN6_bm; /*MISO*/
PORTC.OUTSET = PIN4_bm; /*CS high*/
PORTC.PIN5CTRL = PORT_INVEN_bm; /*CLOCK HAS TO BE INVERTED!!*/

I also have found out that SPI clock line has to be inverted to talk to an M25-series flash memory (SPI mode 1). Try changing (disabling) inversion if your chip does not work.

The driver sets up hardware to do the transmission and returns immediately so the main application does not block when the transmission is in progress. External buses run at much slower speed than the MCU so waiting for end of transmission would block execution for too long (busy-waiting is an anti-pattern in embedded programming). The complication is that application code has somehow to be notified when a transfer is complete to process incoming data - it is implemented as a callback. One of spi_transfer function arguments is a pointer to a function that will be called by the driver when hardware has completed its job. This is a typical embedded pattern - you set up the hardware to do something, start it, let it run (via interrupts or DMA) and when the job is done - call a function requested in the beginning.

Callback function has to follow the spi_transfer_complete_callback_t type (example: void my_callback(uint8_t *buffer, uint16_t length))

To initiate a transfer the spi_transfer function has to be called. Its arguments are:

  • pointer to data buffer that bytes will be sent from and received to (because each SPI transfer is an exchange between a master and slave) - it must be statically allocated (ie. defined outside of a function, not on a stack - otherwise the driver will crash the whole program)
  • length of the transfer (buffer can be larger than the length)
  • pointer to function that should be called at the end of the transfer
  • option to deliver the callback either from main context or interrupt - usually it is easier to handle callbacks in the main context but a small and simple callback will give lower latency when called directly from interrupt context)

To deliver callbacks in main application context spi_task has to be called in the main loop periodically.

The driver does not handle the chip select line in any way - the application has to handle it. It is more convenient to leave that piece to higher level application. For example SPI flash memory operations require a command and address to be sent before data. A high level flash driver can provide functions like read_sector that will take a pointer to a buffer that should be filled with data (only data) but first has to send command and address bytes. If it can control the CS line directly then it can be done using two SPI transfers. If the SPI driver controlled the CS line it would have to be done in a single transfer - so the flash driver would have to implement another buffer and much copying. Memory is always a scarce resource in embedded systems.

Last of the public functions is spi_override_callback - it can be used to change the callback before a transmission has finished, for example to abort an operation after the transmission. There is no way to stop the transfer once it has begun.

Performance

I obtained the traces from an ATXmega32a4u running at 32MHz. They show SPI clock signal. Clock frequency is not relevant.

Interrupt-driven

timing diagram using interrupts

DMA

timing diagram using DMA

The striking difference between interrupt-based driver and DMA is that the first option is has long delays between sending consecutive bytes. Time from the beginning of one byte to the beginning of the next is around 6.1µs. The "silence" between is around 2.1µs, so the "wasted" time is around 42%!

42% of the time CPU is busy reading a byte from the USART and writing the next one. No matter how tight the ISR code is written, it still takes time to enter the interrupt and return. The higher the SPI clock frequency the bigger the "relative waste" is because interrupt execution time is roughly the same, while SPI byte transmission takes less time.

Driver using DMA on the other hand transmits data continuously, without any delays AND without any CPU usage! (apart from setting the transmission in the first place) The only odd thing I can see in the trace is the non-exact high clock state every fifth bit, but that can be an artifact of my logic analyzer. The overall benefits are obvious - with DMA I get only a single interrupt at the end of a transfer, while the ISR driver would deliver as many interrupts as there were bytes in the transmission.

How it works

Let's have a look at the internals (complete code is at the very end) piece by piece.

Includes and module private data

Some general includes. Macros check if only one of the modes has been enabled in the .h file. The likely() macro is used to instruct the compiler that a particular condition (eg. in a loop or if conditional) is more likely than others. I use it within interrupt to optimize one path.

Private data holds the state of the driver between function calls and interrupts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include "spi.h"
#include <avr/io.h>
#include <avr/interrupt.h>
#include <stdio.h> //for NULL definition

#if !defined(SPI_USE_ISR) && !defined(SPI_USE_DMA)
    #error "SPI type not defined!"
#endif

#if defined(SPI_USE_ISR) && defined(SPI_USE_DMA)
    #error "Define only one type of SPI operation! DMA or ISR!"
#endif

#define likely(x)       __builtin_expect(!!(x),1)

/* --------- private data --------------- */
/* All variables are volatile, because a callback function executed
 * from SPI RXC interrupt may request another transfer that will
 * modify them all. */
static volatile spi_transfer_complete_callback_t _transfer_complete_callback;
static volatile uint16_t _transfer_length;
static volatile uint8_t *_transfer_buffer;
static volatile uint16_t _transfer_index;
static volatile uint16_t _transfer_rx_index;
static volatile bool _transfer_complete_flag = false;
static volatile bool _callback_from_isr = false;

Initialization

The spi_init function basically configures the USART for SPI mode, enables transmitter, receiver and, depending on the mode, enables either the DMA controller or RXC interrupt.

The spi_override_callback function can just change the callback function at any moment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* ----------- implementation ----------- */
void spi_init(void){
    cli();
    //use hardware USART for SPI
    SPIUSART.CTRLC = USART_CMODE_MSPI_gc | USART_CHSIZE_8BIT_gc; //SPI mode, MSB first
    SPIUSART.CTRLB = USART_TXEN_bm | USART_RXEN_bm; //enable receiver and transmitter

    SPIUSART.BAUDCTRLA = 0; //SPI clock is 2MHz
    SPIUSART.BAUDCTRLB = 0; //when F_CPU = 32MHz
    _transfer_complete_flag = false;

    #ifdef SPI_USE_DMA
    DMA.CTRL  = DMA_ENABLE_bm; //globally enable DMA, no double buffering, round-robin scheduling
    #endif

    #ifdef SPI_USE_ISR
    SPIUSART.CTRLA = USART_RXCINTLVL_LO_gc; //RX interrupt has low priority
    #endif

    sei();
}

void spi_override_callback(spi_transfer_complete_callback_t cb){
    _transfer_complete_callback = cb;
}

Transfer function

This is the most complex part. Apart from obvious setting of private data for further use this function configures the DMA controller.

DMA version

XMEGA DMA controller has 4 separate channels that can be configured at the same time. There is of course a single memory bus so if more than one channel is enabled, the DMA controller will switch between channels (it can be configured for fixed priorities or round-robin).

Each channel basically does a bunch of very simple operations:

  • Wait for trigger condition
  • When this condition happens - copy a byte (or word) from source address to destination address (or multiple bytes at once in more advanced operation)
  • Increment, decrement or do nothing to the source address
  • Increment, decrement or do nothing to the destination address
  • Count the number of operations and if it matches the length - set a flag or generate an interrupt

SPI operation requires both writing and reading data from the USART so two DMA channels are required.

For the TX side the destination address is fixed (USART DATA register) source address begins with the pointer to data buffer and source address is to be incremented. The trigger condition corresponds to the USART DRE (data register empty) flag/interrupt (I explain USART flags/interrupts further below) - when USART.

For the RX side the destination address starts with pointer to data buffer, the address is to be incremented source address remains fixed and points to USART DATA register. Length is the same as of the TX channel.

Register/buffer 16-bit addresses have to be split into individual bytes. After the channels are set up they are enabled and the magic happens.

ISR version

Interrupt version is much simpler - it just transmits the very first byte and then everything is handled by USART receive interrupt.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
//can also be called from ISR context
void spi_transfer(volatile uint8_t *data, uint16_t length, spi_transfer_complete_callback_t cb, bool callback_from_isr){
    _transfer_complete_callback = cb;
    _transfer_buffer = data;
    _transfer_length = length;
    _transfer_complete_flag = false;
    _callback_from_isr = callback_from_isr;
    #ifdef SPI_USE_DMA
    /* ------ TX DMA channel setup ------ */
    DMA_SPI_TX.CTRLA = DMA_CH_RESET_bm;
    DMA_SPI_TX.ADDRCTRL = DMA_CH_SRCDIR_INC_gc | DMA_CH_DESTDIR_FIXED_gc | DMA_CH_DESTRELOAD_TRANSACTION_gc; //source - increment, destination (USARTxx.DATA) fixed
    DMA_SPI_TX.TRIGSRC = DMA_SPI_TX_TRIGGER_SOURCE;
    DMA_SPI_TX.DESTADDR0 = (((uint16_t)&SPIUSART.DATA) >> 0) & 0xFF;
    DMA_SPI_TX.DESTADDR1 = (((uint16_t)&SPIUSART.DATA) >> 8) & 0xFF;
    DMA_SPI_TX.DESTADDR2 = 0x00;
    DMA_SPI_TX.SRCADDR0 = (((uint16_t)data) >> 0) & 0xFF;
    DMA_SPI_TX.SRCADDR1 = (((uint16_t)data) >> 8) & 0xFF;
    DMA_SPI_TX.SRCADDR2 = 0x00; //internal SRAM 
    DMA_SPI_TX.TRFCNT = length; //transfer length

    /* ------ RX DMA channel setup ------ */
    DMA_SPI_RX.CTRLA = DMA_CH_RESET_bm;
    DMA_SPI_RX.ADDRCTRL = DMA_CH_SRCDIR_FIXED_gc | DMA_CH_DESTDIR_INC_gc | DMA_CH_SRCRELOAD_TRANSACTION_gc; //source - increment, destination (USARTxx.DATA) fixed
    DMA_SPI_RX.TRIGSRC = DMA_SPI_RX_TRIGGER_SOURCE;
    DMA_SPI_RX.SRCADDR0 = (((uint16_t)&SPIUSART.DATA) >> 0) & 0xFF;
    DMA_SPI_RX.SRCADDR1 = (((uint16_t)&SPIUSART.DATA) >> 8) & 0xFF;
    DMA_SPI_RX.SRCADDR2 = 0x00;
    DMA_SPI_RX.DESTADDR0 = (((uint16_t)data) >> 0) & 0xFF;
    DMA_SPI_RX.DESTADDR1 = (((uint16_t)data) >> 8) & 0xFF;
    DMA_SPI_RX.DESTADDR2 = 0x00; //internal SRAM 
    DMA_SPI_RX.TRFCNT = length; //transfer length

    /* -------- trigger both channels at once ---------- */
    DMA_SPI_TX.CTRLA = DMA_ENABLE_bm | DMA_CH_SINGLE_bm | DMA_CH_BURSTLEN_1BYTE_gc; //enable channel, USART DRE will trigger the transmission, single shot mode - one byte per trigger
    DMA_SPI_RX.CTRLA = DMA_ENABLE_bm | DMA_CH_SINGLE_bm | DMA_CH_BURSTLEN_1BYTE_gc; //enable channel, USART DRE will trigger the transmission, single shot mode - one byte per trigger

    DMA_SPI_TX.CTRLB = DMA_CH_TRNINTLVL_LO_gc; //transfer complete interrupt enable
    DMA_SPI_RX.CTRLB = DMA_CH_TRNINTLVL_LO_gc; //transfer complete interrupt enable
    /* ------------------------------------------------- */
    #endif //#ifdef SPI_USE_DMA

    #ifdef SPI_USE_ISR
    cli();
    _transfer_index = 1;
    _transfer_rx_index = 0;
    SPIUSART.DATA = *data; //transfer first char, rest is handled by RXC ISR
    sei();
    #endif //#ifdef SPI_USE_ISR
}

Main loop task

The task function checks if the interrupt has set the complete flag and executes the callback function. As simple as that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
void spi_task(void){
    if (_transfer_complete_flag){
        _transfer_complete_flag = false;
        if (_transfer_complete_callback){
            _transfer_complete_callback((uint8_t*)_transfer_buffer, _transfer_length);
        } else {
            //no callback
        }
    }
}

Interrupt service routines - DMA version

Each channel generates its own interrupt. I could not manage to make one working without an interrupt at all - it would crash the application so to make it work reliably the channel is simply disabled in each interrupt handler. The RX channel also checks if a callback should be delivered from within the ISR (otherwise it will be delivered by main loop task).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#ifdef SPI_USE_DMA
ISR(DMA_SPI_TX_vect){
    DMA_SPI_TX.CTRLA = 0; //disable channel
    DMA_SPI_TX.CTRLB = 0; //disable channel interrupts
}

ISR(DMA_SPI_RX_vect){
    DMA_SPI_RX.CTRLA = 0; //disable channel
    DMA_SPI_RX.CTRLB = 0; //disable channel interrupts
    if (_callback_from_isr){
        _transfer_complete_callback((uint8_t*)_transfer_buffer, _transfer_length);
    } else {
        _transfer_complete_flag = true;
    }
}
#endif

Interrupt service routines - non-DMA version

XMEGA USART (all other AVRs too) has three flags/interrupts (each flag can generate an interrupt):

  • TXC - transmission complete - it is set when the last bit has actually been put on the wire
  • DRE - data register empty - the data register is buffered so it can be written with the next byte before the previous one has been completely transmitted out
  • RXC - receive complete - a byte has been received

When running as a UART the meaning of the flags is obvious, as sending and reception works totally independently of each other so the application can expect when TXC and DRE happen, but not when RXC happens. In SPI mode everything is tightly coupled, because you receive at the same time as you send. It simplified the driver because I can use the RXC interrupt only. When a byte has been received it also means that the previous one has been transmitted so the interrupt service routine reads the received byte, stores it in the buffer, transmits the next byte if there are more, or finishes the transmission by setting a flag or delivering the callback. Using a single ISR for sending and reception saves CPU time (having the functions separate would require two interrupts per each byte and AVR does not have back-to-back interrupt chaining as ARM cores do so it would just thrash the CPU).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#ifdef SPI_USE_ISR
//RXC interrupt is fired when an SPI byte transfer is complete
ISR(SPIUSART_RXC_vect){
    _transfer_buffer[_transfer_rx_index] = SPIUSART.DATA;
    _transfer_rx_index++;

    if (likely(_transfer_index < _transfer_length)){
        SPIUSART.DATA = _transfer_buffer[_transfer_index];
        _transfer_index++;
    } else {
        _transfer_complete_flag = true;
    }
}
#endif

That's all :)

Complete code

Header file is at the beginning of the post.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
#include "spi.h"
#include <avr/io.h>
#include <avr/interrupt.h>
#include <stdio.h> //for NULL definition

#if !defined(SPI_USE_ISR) && !defined(SPI_USE_DMA)
    #error "SPI type not defined!"
#endif

#if defined(SPI_USE_ISR) && defined(SPI_USE_DMA)
    #error "Define only one type of SPI operation! DMA or ISR!"
#endif

#define likely(x)       __builtin_expect(!!(x),1)

/* --------- private data --------------- */
/* All variables are volatile, because a callback function executed
 * from SPI RXC interrupt may request another transfer that will
 * modify them all. */
static volatile spi_transfer_complete_callback_t _transfer_complete_callback;
static volatile uint16_t _transfer_length;
static volatile uint8_t *_transfer_buffer;
static volatile uint16_t _transfer_index;
static volatile uint16_t _transfer_rx_index;
static volatile bool _transfer_complete_flag = false;
static volatile bool _callback_from_isr = false;

/* ----------- implementation ----------- */
void spi_init(void){
    cli();
    //use hardware USART for SPI
    SPIUSART.CTRLC = USART_CMODE_MSPI_gc | USART_CHSIZE_8BIT_gc; //SPI mode, MSB first
    SPIUSART.CTRLB = USART_TXEN_bm | USART_RXEN_bm; //enable receiver and transmitter

    SPIUSART.BAUDCTRLA = 0; //SPI clock is 2MHz
    SPIUSART.BAUDCTRLB = 0; //when F_CPU = 32MHz
    _transfer_complete_flag = false;

    #ifdef SPI_USE_DMA
    DMA.CTRL  = DMA_ENABLE_bm; //globally enable DMA, no double buffering, round-robin scheduling
    #endif

    #ifdef SPI_USE_ISR
    SPIUSART.CTRLA = USART_RXCINTLVL_LO_gc; //RX interrupt has low priority
    #endif

    sei();
}

void spi_override_callback(spi_transfer_complete_callback_t cb){
    _transfer_complete_callback = cb;
}

//can also be called from ISR context
void spi_transfer(volatile uint8_t *data, uint16_t length, spi_transfer_complete_callback_t cb, bool callback_from_isr){
    _transfer_complete_callback = cb;
    _transfer_buffer = data;
    _transfer_length = length;
    _transfer_complete_flag = false;
    _callback_from_isr = callback_from_isr;
    #ifdef SPI_USE_DMA
    /* ------ TX DMA channel setup ------ */
    DMA_SPI_TX.CTRLA = DMA_CH_RESET_bm;
    DMA_SPI_TX.ADDRCTRL = DMA_CH_SRCDIR_INC_gc | DMA_CH_DESTDIR_FIXED_gc | DMA_CH_DESTRELOAD_TRANSACTION_gc; //source - increment, destination (USARTxx.DATA) fixed
    DMA_SPI_TX.TRIGSRC = DMA_SPI_TX_TRIGGER_SOURCE;
    DMA_SPI_TX.DESTADDR0 = (((uint16_t)&SPIUSART.DATA) >> 0) & 0xFF;
    DMA_SPI_TX.DESTADDR1 = (((uint16_t)&SPIUSART.DATA) >> 8) & 0xFF;
    DMA_SPI_TX.DESTADDR2 = 0x00;
    DMA_SPI_TX.SRCADDR0 = (((uint16_t)data) >> 0) & 0xFF;
    DMA_SPI_TX.SRCADDR1 = (((uint16_t)data) >> 8) & 0xFF;
    DMA_SPI_TX.SRCADDR2 = 0x00; //internal SRAM 
    DMA_SPI_TX.TRFCNT = length; //transfer length

    /* ------ RX DMA channel setup ------ */
    DMA_SPI_RX.CTRLA = DMA_CH_RESET_bm;
    DMA_SPI_RX.ADDRCTRL = DMA_CH_SRCDIR_FIXED_gc | DMA_CH_DESTDIR_INC_gc | DMA_CH_SRCRELOAD_TRANSACTION_gc; //source - increment, destination (USARTxx.DATA) fixed
    DMA_SPI_RX.TRIGSRC = DMA_SPI_RX_TRIGGER_SOURCE;
    DMA_SPI_RX.SRCADDR0 = (((uint16_t)&SPIUSART.DATA) >> 0) & 0xFF;
    DMA_SPI_RX.SRCADDR1 = (((uint16_t)&SPIUSART.DATA) >> 8) & 0xFF;
    DMA_SPI_RX.SRCADDR2 = 0x00;
    DMA_SPI_RX.DESTADDR0 = (((uint16_t)data) >> 0) & 0xFF;
    DMA_SPI_RX.DESTADDR1 = (((uint16_t)data) >> 8) & 0xFF;
    DMA_SPI_RX.DESTADDR2 = 0x00; //internal SRAM 
    DMA_SPI_RX.TRFCNT = length; //transfer length

    /* -------- trigger both channels at once ---------- */
    DMA_SPI_TX.CTRLA = DMA_ENABLE_bm | DMA_CH_SINGLE_bm | DMA_CH_BURSTLEN_1BYTE_gc; //enable channel, USART DRE will trigger the transmission, single shot mode - one byte per trigger
    DMA_SPI_RX.CTRLA = DMA_ENABLE_bm | DMA_CH_SINGLE_bm | DMA_CH_BURSTLEN_1BYTE_gc; //enable channel, USART DRE will trigger the transmission, single shot mode - one byte per trigger

    DMA_SPI_TX.CTRLB = DMA_CH_TRNINTLVL_LO_gc; //transfer complete interrupt enable
    DMA_SPI_RX.CTRLB = DMA_CH_TRNINTLVL_LO_gc; //transfer complete interrupt enable
    /* ------------------------------------------------- */
    #endif //#ifdef SPI_USE_DMA

    #ifdef SPI_USE_ISR
    cli();
    _transfer_index = 1;
    _transfer_rx_index = 0;
    SPIUSART.DATA = *data; //transfer first char, rest is handled by RXC ISR
    sei();
    #endif //#ifdef SPI_USE_ISR
}

void spi_task(void){
    if (_transfer_complete_flag){
        _transfer_complete_flag = false;
        if (_transfer_complete_callback){
            _transfer_complete_callback((uint8_t*)_transfer_buffer, _transfer_length);
        } else {
            //no callback
        }
    }
}

#ifdef SPI_USE_DMA
ISR(DMA_SPI_TX_vect){
    DMA_SPI_TX.CTRLA = 0; //disable channel
    DMA_SPI_TX.CTRLB = 0; //disable channel interrupts
}

ISR(DMA_SPI_RX_vect){
    DMA_SPI_RX.CTRLA = 0; //disable channel
    DMA_SPI_RX.CTRLB = 0; //disable channel interrupts
    if (_callback_from_isr){
        _transfer_complete_callback((uint8_t*)_transfer_buffer, _transfer_length);
    } else {
        _transfer_complete_flag = true;
    }
}
#endif

#ifdef SPI_USE_ISR
//RXC interrupt is fired when an SPI byte transfer is complete
ISR(SPIUSART_RXC_vect){
    _transfer_buffer[_transfer_rx_index] = SPIUSART.DATA;
    _transfer_rx_index++;

    if (likely(_transfer_index < _transfer_length)){
        SPIUSART.DATA = _transfer_buffer[_transfer_index];
        _transfer_index++;
    } else {
        _transfer_complete_flag = true;
    }
}
#endif

I release the code into public domain.