The BeagleBoard [part 1]

So I’ve been dissecting the BeagleBoard-xM OMAP3 port for a book project I’m working on in my spare time. This has also seen me brush up on my ARM assembler (PowerPC, POWER, and so forth are my first loves), in particular stuff that happens in “SVC” (Supervisor) mode. Here are some notes that might be of interest to others getting into Beagle or ARM porting (part 1).

ARM essentials

First, you should know that ARM (as it stands, in v7 of the architecture, we’ve yet to see the new 64-bit ISA released in v8 or whatever that will be) is a 32-bit processor IP core that uses fixed-width 32-bit instructions with various forms of instruction encoding. The ISA is designed to be simple, but it does include some very flexible (simple) instructions. For example, almost every instruction can be conditionalized based upon 4 bits of condition state (and that state is typically not updated automatically within the ALU unless it is specifically encoded into the affecting instruction) and some instructions (additions, etc.) can include shifts of one of the operands (giving you DSP-ish instructions for free). Using conditionalized instructions keeps the pipeline full as opposed to having a lot of branching, and can render very small and tight sequences. Modern ARM cores implement a modified Harvard architecture, as indeed are most “Harvard architecture” processors these days.

There are 30 general purpose registers, but not all of these are available at any one time. In fact, it is easier to think of ARM as having 15 “general purpose” registers with some banking being used in certain modes (such as FIQ or IRQ – Fast or regular IRQs) to replace a subset of the registers with a context-specific special set. Even better, think of it as 12 general purpose registers, with r13 used as the stack pointer (sp), r14 as the link register (lr), and r15 used as the program counter (pc). Even even better, think of according to the ABI calling convention, with r0-r3 used for argument passing, r4-r11 used for local variables, r12 as a function call scratch register, r13 as sp, r14 as lr, and r15 as pc. Some instructions assume the use of sp, lr, and pc (write something into r15 and that will be taken as a branch address), but generally register use is flexible. Of the 4 different stack modes supported, ARM conventionally uses a “Full Descending” stack in which the sp points to the last value written, but you can use any of the “store multiple” or “load multiple” instructions either with their “stack” friendly names (e.g. stmfd), or directly (stmdb). Processor state is encoded in the CPSR (Current Processor Status Register), and saved in the SPSR (Saved Processor Status Register) when handling various exceptions (traps).

The ARM ISA is intentionally extensible. Either ARM, certain third parties (in limited situations), or co-processors may extend it. Co-processors are a means for the ARM core to remain simple while offloading certain functions to additional cores (that might actually be part of the same die). For example, the original FPA (Floating Point Accelerator) was implemented as a co-processor, and has long since been deprecated by the VFP (Vector Floating Point) co-processor (currently at version 3 thereof). Another co-processor you might care about is cp15, the MMU or VMSA (Virtual Memory Systems Architecture). When an instruction the ARM doesn’t recognize is seen, it will try to find one of 16 possible co-processors that can implement it (each has a special pre-determined number, such as cp15 for the MMU) before hitting the illegal instruction handler. The latter allows some instructions to be implemented in software, via traps, such as the older floating point emulation in the Linux kernel. There are special instructions for copying data to/from co-processors, and each has its own self-contained/self-defined register set.

Modern ARM cores implement a very long pipeline internally, but the original design was three stage. And this leaks out into the visible value of, for example, the pc (which is generally 8 bytes away from where you think it should be – different pipeline stage). This is the reason why PC relative addressing can be confusing (though assemblers help to deal with this). Another issue is that the ARM fixed-width instructions didn’t traditionally have a direct parallel to the PPC “li” “ori” address loading sequence. Instead, a 4K pc-relative address could be easily loaded in one instruction, a 64K pc-relative in two, and certain other addresses could be loaded by means of the built-in barrel shifter used in (e.g.) the addition instruction or by using a “literal pool”, depending upon the assembler. This means that traditionally, only certain addresses could be loaded with a “LDR” “meta” assembly instruction, and that the assembler would try to replace your LDR with appropriate shifts (in possibly several instructions), etc. If it were unable to do so, it would generate a warning. You could also load addresses directly from memory and jump to those that way by means of loading them in to the pc. ARM “bl” relative branches (which use a direct label address) have a 32-MB limit, traditionally requiring trampolines to handle larger jumps[0]. Since ARMv6T2, there is a movt instruction for loading the top 16 bits of a register (leaving the lower alone), and since ARMv7 there is even a movw (load a value into lower 16 bits and zero the upper part), which now allows a movw/movt combination akin to the li/ori combination on PowerPC. A mov/movt combo is used by, e.g. GCC when generating longer jumps (in which it loads the pc directly).

The ABI for ARM was changed a few years ago from the OABI (old ABI) to the EABI (not to be confused with the PowerPC ABI of the same name). In particular, system calls are no longer implemented by passing the system call number in the “comment” field of the system call instruction, but are instead passed by means of register (faster than having the kernel poke at the memory to find what number was contained within that instruction). Also, 64-bit values are passed using sequential registers, beginning on an evenly aligned register number. The ARM implements a limited number of trap vectors, traditionally located at 0×0000_0000 but today possibly relocated high at 0xffff_0000 (this is the preference of the kernel, if it is possible, since it avoids having the NULL physical page over-used).

Memory alignment is extremely important to ARM. Not only did it not have things like floating point for the longest time, but to keep the design simple, unaligned memory accesses weren’t really supported historically. Modern processors do have some support for handling unaligned access, but you generally want to avoid that. ARM deals in Bytes, Half-Words (16-bit), and Words (32-bit), none of this IA32 historical nonsense of “double words”. 64-bit values can be handled by means of multiple word instructions, and also manipulated in user code on some systems by use of the VFP, and so forth.

ARM don’t make processors directly. They define the ISA (currently v7, though processors still implement v4 and v5, etc.) and provide reference designs in the form of soft and hard (netlists, gates, etc.) cores to third parties. Depending upon the licensing (whether it is a foundry or merely a regular licensee), that third party might only have the right to implement what they are given. But if they have special foundry access, they can actually change the design. In any case, certain CPU functionality is optional. In the past, these were designated by letters after the CPU (e.g. “TDMI” for older cores implementing the then-optional Thumb instruction set, supporting virtual memory, debug, etc.), but today profiles are generally used instead. Cortex (the latest generation) provides for A, R, and M profiles, of which the Application profile (A) is what we care about.

ARM kernel

Execution of the ARM kernel begins in the inferred standard location of arch/arm/kernel/head.S, at the place very obvious labeled with “Kernel startup entry point”. At this point, the MMU must be off, the D-cache must be off, I-cache can be on or off, r0 must contain 0, r1 must contain the “machine number” (an ARM Linux standard assigned number, one per machine port, passed from the bootloader code), and r2 must contain the “ATAGS” pointer (a flexible data structure precursor to things like fdt and device trees that allows a bootloader to pass parameters). First, the processor mode is quickly set to ensure interrupts (FIQ and IRQ) are off, and that the processor is properly in Supervisor (SVC) mode. Then, MMU co-processor register c0 is copied into ARM register r9 to obtain the processor ID. This is followed by a call to __lookup_processor_type (contained within head-common.S, the common file for both MMU-enabled and non-MMU enabled ARM kernels – the latter are not covered by this document).

__lookup_processor_type behaves like a number of other functions in the early kernel setup. First, you should note that it has a C-ABI companion (used later, in higher level code) called lookup_processor_type that also preserves stack and calling semantics for use from C-code, then calls into __lookup_processor_type. Like so many other of these functions, __lookup_processor_type has a __lookup_processor_type_data friend that contains three data items:

The address of the structure
A pointer to the beginning of an array of “processor info” structures
A pointer to the end of an array of “processor info” structures

The function loads the address of __lookup_processor_type_data into r3, then uses an ldmia instruction to load the members of that structure into r4-r6. Since the first member of the structure contains the (relocated) virtual address of the kernel, and r3 contains the (currently) physical address, a simple subtraction is used to store into r3 the offset between the linked address of __lookup_processor_type_data and the actual address. This offset is then applied to registers r5 and r6 (containing the second and third members of that structure – __proc_info_begin and __proc_info_end). A loop is then entered to iterate over the proc info structure array and find a matching known-processor. If a matching processor is found, the physical address is preserved in r5 on return, otherwise #0 (NULL) is written into it so that the calling code can determine whether a processor was found.

Once the processor type has been determined, a very similar function called __lookup_machine_type is called to find the machine type passed by the bootloader, and get the “machinfo” structure describing the machine in question. If either of these functions fails, a call is made to __error_p, which can be compiled (in debug mode) to output an error ASCII message to the UART (byte by byte) but at least gives a point to pick up in a hardware debugger. Next, a call is made to __vet_atags (which checks that the ATAGS in r2 begin with the ATAG_CORE magic marker[1], and so forth), then sometime later we call __create_page _tables to fudge up entries in cp15 that cover just the kernel code and data. After that, we call through a few functions beginning with __enable_mmu that actually cause the MMU to get enabled and for the return to be to __mmap_switched (also in the head-common.S file). __mmap_switched also has an __mmap_switched_data, which is used to store various global values (such as processor_id, __machine_arch_type, __atags_pointer, and so forth). After zeroing out the BSS, __mmap_switched causes the high-level kernel start_kernel function to be entered as usual.

start_kernel (init/main.c) does a lot of things on every architecture/platform. These include calls to setup_arch, which on ARM calls setup_machine. This latter function returns an mdesc (machine descriptor) which is used in setting up any ATAGS, and for paging_init (which calls devicemaps_init that uses the mdesc to see if any special device IO maps are needed early on, for example for debugging and so forth). If ATAGS are not specified, a default init_tags set is used that defaults the machine to e.g. MEMSIZE of 16MB RAM, and other limited (but sane) defaults. After running through other generic early startup code, start_kernel causes the kernel_init thread to run, which calls do_basic_setup. do_basic_setup calls various initcalls, including the special customize_machine initcall. Customize machine calls init_machine (e.g. omap3_beagle_init), which adds various platform devices and does other board specific setup.

When the board specific init function is run, it will typically call platform_device_register for each known platform device (you can see these in /sys/devices/platform for example). Some of these platform entries include a driver_name field, which will cause the actual device driver to be loaded, but others contain more limited data. For example, in the case of EHCI (USB) on the Beagle, an ehci platform device is registered, but what causes the actual USB driver to load (if not built-in) is that the OMAP EHCI driver contains a MODULE_ALIAS(“platform:omap-ehci”) that udev and modprobe will cause to load on boot. In the case of OMAP, it might seem like a lot of hardware addresses and so forth are missing from the platform registration, but this is because the hardware is standardized (EHCI is always in the same place for a given OMAP, and hard-coded into the USB driver). One thing that does differ is GPIO MUX assignment, and these pins/values are poked in the init_machine code for the Beagle.

That’s all for now. More later.

Jon.

[0] Modern ARM also implements the Thumb instruction “BLX” (Branch with Link and optional eXchange) that can be used in combination with a series of assembler-generated mov/orr sequences to load an address into a register and branch to it. The older “bl” requires a pc-relative label address to jump to. In fact, when you use a sequence like “ldr reg, address” and “blx reg” the assembler may in fact generate 5 of more actual instructions using mov, orr, and bx or blx to perform the actual load and branch. Rather than switching to Thumb mode, the modern mov/movt combination is preferable for loading arbitrary addresses.

[1] I do wonder what significance this particular number has to rmk. If anyone knows why it was chosen – a birthday? or other date? etc. please do let me know.

This entry was posted on Friday, December 31st, 2010 at 4:44 pm and is filed under BeagleBoard, Embedded Linux, Fedora, General, Linux Kernel. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Responses to “The BeagleBoard [part 1]”

Urgentissimo says:

2011/01/03 at 11:04 am

Interesting article!
It would be nice to have an ARMv7 instruction table (and a clean list of execution privileges/limits) without having to follow the thousands and thousands pages of the CortexA8 documentation. Can you point me to something? Thanks!
notzed says:

2011/01/03 at 9:44 pm

You may have seen this, but I was last year working on a project that did some booting and hardware banging. I haven’t had time to work on it for a while though & I hit some roadblocks to what I wanted to do.

Link in ‘website’ box.
Loc says:

2011/01/05 at 11:35 am

Russel King might know the history of the magic number. I believe he did the original port to ARM. Wonder if it is a permutation of his birthday.

jcm's blog

3 Responses to “The BeagleBoard [part 1]”

Leave a Reply