ARM atomic operations

Edit: I’ve fixed parts of the atomic addition example following feedback from Nicolas Pitre. There were a couple of things that were just sloppy 4am stuff on my part, and a couple from just trying to be too much like the rest of the OpenMPI code. I’ve cleaned up the example and re-tested.

Modern 32-bit ARM processors are becoming increasingly more complex than was the case with the implementation of previous generations of the architecture. Processors implementing version 5 of the architecture do not support Symmetric Multi-Processing (SMP), while newer architecture versions add not only full capability for SMP (and its related supporting architectural extensions), but even go as far as to enable the implementation of designs based upon AMP (Assymmetric Multi-Processing[0]) – commonly known as Big.LITTLE. These newer features require changes to the underlying architecture, to support memory ordering operations, cache coherent access to shared memory, atomic operations, and so on. This article summarizes some of these changes, with a view toward Fedora developers needing to modify ARM code, especially to implement support for atomic operations. Much greater documentation is available online, especially on ARM’s own website.

Whenever multiple processor cores will share physical access to system memory, there is a need to be concerned with unintended interactions between those processors. Fundamental Computer Science principals teach us about the importance of atomicity (and we will come to that) but there are deeper and more complex interactions that must be considered when introducing SMP, as ARMv6 did. Not only is the atomicity of operations important, but so is the ordering of their visibility to other processors. Modern CPUs execute instructions speculatively (meaning they may or may not actually take effect, based upon predicted-taken branches or the interaction of exceptions, and so on), out-of-order (meaning operations may not strictly occur always in program order – the visible impact of this upon program execution depends upon various other factors), and may not have the same guarantees concerning visibility of these operations that programmers familiar with older architectures have come to expect.

In essence, ARM makes a number of tradeoffs in the interest of optimal performance (and lower energy). As a weakly ordered memory architecture, a local ARM processor core guarantees only self-consistent ordering of program operations. That is, the executing program will appear to manipulate external memory in strict program order only from the point of view of the local processor. External viewers (processors or devices) that independently observe the actions of the processor may see the effects of load and store re-ordering or other microarchitectural optimizations that are normally hidden in the non-SMP case. Thus one drawback of being weakly ordered is that the processor must be explicitly informed of memory dependencies that it cannot automatically determine. The outcome of a memory operation may otherwise be sitting in a CPU drain buffer waiting to be written back to memory (and thus the altered content is only visible to code running on the local processor node), may be sitting in a local cache within a CPU or CPU cluster (and thus visible only to a subset of processors), or may be sitting in an L2 cache that has yet to hit physical memory, among many other possibilities.

To ensure that a specific memory operation has completed in program order prior to the execution of subsequent operations, memory barriers are inserted. These instruct the CPU microarchitecture to ensure the result of certain prior memory operations (in terms of program order) have been fully committed back to system memory, or at least as far as the system cache configuration will guarantee to be the case for the specific operation. Memory barriers don’t ensure that another CPU cannot interfere with the value in memory (they are not an atomicity guarantee) but they do serve to ensure that actions occur in a defined known order. For example, writing to device memory configuration registers requires a memory barrier in order to ensure that the write has been fully completed.

ARMv7 provides three forms of memory barrier:

  • DMB (Data Memory Barrier)
  • DSB (Data Synchronization Barrier)
  • ISB (Instruction Synchronization Barrier)

The first, DMB, ensures that all explicit memory operations have completed. It can be used to ensure that a memory store has taken effect. By default, it performs a “system” DMB operation – a full memory barrier. This can also be specified through the optional parameter “SY” to the DMB instruction, but it is assumed as the default if omitted. DMB supports several other options, including “ST” (wait only for stores), and various others pertaining to the extent of shareability domains that will be omitted here for simplification (extensive documentation exists). If in doubt, a “DMB” operation without a parameter is the strongest and safest.

The second and third memory barrier operations are related. DSB ensures that all instructions before the DSB have completed (with optional parameters similar to the DMB operation), while the ISB does a full system level DSB but also ensures that the CPU pipeline is completely flushed and all instructions reloaded. ISB is used in self modifying code sequences to ensure that the new code is executed. It technically has an option, but that can only be “SY”.

Memory barriers of this form were first introduced in ARMv6, where they are actually implemented as control register writes into the cp15 (co-processor number 15, the VMSA memory management control co-processor in the 32-bit ARM architecture, not present in the 64-bit ARMv8 architecture). ARM have now deprecated this approach in favor of specific ISA extensions in ARMv7, but for reasons of backward compatibility, these cp15 writes are still supported:

  • DMB: MCR p15, 0, r0, c7, c10, 5
  • DSB: MCR p15, 0, r0, c7, c10, 4
  • ISB: MCR p15, 0, r0, c7, c5, 4

ARMv7 simply requires use of the “DMB”, “DSB”, or “ISB” instructions for memory barrier insertion. An example of their implementation can be found in the OpenMPI Fedora package, within the opal/include/opal/sys/arm/atomic.h C language header include file, that contains the following usage on ARMv7:

#define MB()  __asm__ __volatile__ ("dmb" : : : "memory")
#define RMB() __asm__ __volatile__ ("dmb" : : : "memory")
#define WMB() __asm__ __volatile__ ("dmb" : : : "memory")

And the following usage on ARMv6 systems:

#define MB()  __asm__ __volatile__ ("mcr p15, 0, r0, c7, c10, 5" : : : "memory")
#define RMB() MB()
#define WMB() MB()

The reader may wonder whether there is a need for memory barriers on ARMv5 systems. The answer is that there is such a need because the kernel provided emulated atomic operations (explained further below) assume that calls to dmb are being made as is required architecturally following certain other operations (and may perform its own context synchronizing operations within the helper). The ARM Linux kernel provides __kuser_helpers for emulation of missing features on older CPUs. They behave as if they were VDSOs (Virtual Dynamic Shared Objects) – which are effectively shared libraries provided automatically by the kernel for every running process – but they are not full VDSOs on 32-bit ARM systems.

To use the __kuser_memory_barrier helper operation, which will do the right thing no matter the version of the ARM architecture, one need only have a kernel version higher than that providing __kuser_helper_version 3 (which was 2.6.15) and then define a function as follows (an OpenMPI example):

#define MB() (*((void (*)(void))(0xffff0fa0)))()

Calls to MB() will result in a call to a void function provided directly by the kernel within the process memory map, as if it were a library.

Barriers are useful for program correctness in weakly ordered systems, but they are not sufficient to provide atomicity guarantees of the kind needed to implement atomic operations. To provide for atomic access to a given memory location, ARM processors implement a reservation engine model. A given memory location is first loaded using a special “load exclusive” instruction that has the side-effect of setting up a reservation against that given address in the CPU-local reservation engine. When the modified value it is later written back into memory, using the corresponding “store exclusive” processor instruction, the reservation engine verifies that it has an outstanding reservation against that given address, and furthermore confirms that no external agents have interfered with the memory commit. A register returns success or failure.

The CPU can typically have only one outstanding reservation at a given time (so a load-exclusive must be typically followed shortly by a corresponding store-exclusive – typical of a lock implementation or atomic variable access implementation for which it was created), and it will flush outstanding reservations at context switch time. This means that a load-exclusive/store-exclusive pair may fail and be require repeated attempts until it is successful. Thus, a small loop is typically utilized, with the “store exclusive” operation returning a success or fail code in a register, and a comparison before loop on failure.

ARMv6 and later processors implement:

  • lrdex (Load Exclusive)
  • strex (Store Exclusive)

Ldrex is used similarly to “ldr”, with the following loading a memory location:

  • ldrex r0, [address]

Strex takes an additional parameter over a usual store operation:

  • strex r0, r1, [address]

With r0 containing the return result, and r1 containing the value to write back.

These operations are typically used in a tight loop to implement either an atomic variable manipulation routine (such as an increment or decrementation operation), or a local acquisition and reliquishment pair of routines. In the case of OpenMPI, an atomic 32-bit addition is performed as follows on ARMv6+:

static inline int32_t opal_atomic_add_32(volatile int32_t* v, int inc)
{
   int32_t t;
   int tmp;

   __asm__ __volatile__(
                         "1:  ldrex   %0, [%2]        \n"
                         "    add     %0, %0, %3      \n"
                         "    strex   %1, %0, [%2]    \n"
                         "    cmp     %1, #0          \n"
                         "    bne     1b              \n"

                         : "=&r" (t), "=&r" (tmp)
                         : "r" (v), "r" (inc)
                         : "cc", "memory");

   return t;
}

Notice how the ldrex operation is repeated until the corresponding strex reports that the addition was performed before another processor (or another task – which causes a flush of all reservations) interferred. As is the case for memory barriers, the ARM Linux kernel provides emulation of such atomic operations in the case that they are not supported by the hardware (on ARMv5 and earlier processors). The kernel-provided __kuser_cmpxchg atomic compare-and-exchange “helper” can be used to implement an atomic addition as in the following example, taken from the Documentation/arm/kernel_user_helpers.txt file in the kernel:

typedef int (__kuser_cmpxchg_t)(int oldval, int newval, volatile int *ptr);
#define __kuser_cmpxchg (*(__kuser_cmpxchg_t *)0xffff0fc0)

int atomic_add(volatile int *ptr, int val)
{
        int old, new;

        do {
                old = *ptr;
                new = old + val;
        } while(__kuser_cmpxchg(old, new, ptr));

        return new;
}

In the case of OpenMPI, the actual implementation needs to be implemented in assembly code due to the design of the OpenMPI codebase. This is not typical, but is included here in the interest of completenss. Therefore, the following contrived call to the kernel user helper (as in the above much more readible and generally suitable example) is actually performed:

START_FUNC(opal_atomic_add_32)
        push    {r4, lr}
        mov     r4, r1
        mov     r2, r0
        LSYM(4)
        ldr     r0, [r2]
        ldr     r3, REFLSYM(5)
        add     r1, r0, r4
        blx     r3
	bcc     REFLSYM(4)
	pop     {r4, lr}
        bx      lr
        .align  2
        LSYM(5)
        .word   0xffff0fc0
END_FUNC(opal_atomic_add_32)

The “START_FUNC”, “END_FUNC”, “REFLSYM”, and “LSYM” macros are pre-processed (by generate-asm.pl, using data provided by asm-data.txt and other files) during the build to produce code resembling the following:

opal_atomic_add_32:
        push    {r4, lr}
        mov     r4, r1
        mov     r2, r0
.L4:
        ldr     r0, [r2]
        ldr     r3, .L5
        add     r1, r0, r4
        blx     r3
        bcc     .L4
        pop     {r4, lr}
        bx      lr
        .align  2
.L5:
        .word   0xffff0fc0
        .align  2
        .global opal_atomic_add_32
        .type   opal_atomic_add_32, %function

All this code actually does is to save the r4 and lr registers at function entry (as defined by the ARM Architecture Procedure Calling Standard or AAPCS) before re-organizing the parameters passed in (which are in the ordering specific to OpenMPI’s internal implementation) into the order needed to call the kernel’s __kuser_cmpxchg function to perform an atomic swap of the old and newly incremented values. The address of this kernel function is stored in the “L5″ symbol (0xffff0fc0), and it is used to atomically swap the “old” value read into r0 with the incremented value passed in r1. As is the case for most cmpxchg routines, the kernel will return a failure whenever the “old” value passed does not match the actual value of the memory prior to manipulation. The kernel is able to use special means to do this compare atomically that non-kernel code on ARMv5 systems cannot do alone, guarding against interference. A result is returned in r0 and compared against an expected zero value. The comparison to zero is actually done by the kernel helper, which sets the “C” (carry) condition code accordingly (set if result zero, clear otherwise) prior to returning (saving a subsequent compare operation). If non-zero (carry is clear), the operation is repeated with a conditional (bcc) jump back to the beginning of the atomic add function.

Modifying software to cope simultaneously with older ARM architectures, as well as more modern ones is vastly simplified through the use of compatibility features, such as the __kuser helpers in the ARM Linux kernel. For more information, see the Documentation/arm/kernel_user_helpers.txt file within the Linux kernel source. For examples of use, refer to packages such as OpenMPI, which have already been modified for ARM systems.

[0] Assymmetric Multi-Processing (AMP) – aka Big.LITTLE – uses relatively different performing cores, all ISA compatible, to provide for different performance/power characteristics, depending upon the workload.

Leave a Reply