Archive for the ‘Embedded Linux’ Category

The BeagleBoard [part 1]

Friday, December 31st, 2010

So I’ve been dissecting the BeagleBoard-xM OMAP3 port for a book project I’m working on in my spare time. This has also seen me brush up on my ARM assembler (PowerPC, POWER, and so forth are my first loves), in particular stuff that happens in “SVC” (Supervisor) mode. Here are some notes that might be of interest to others getting into Beagle or ARM porting (part 1).

ARM essentials

First, you should know that ARM (as it stands, in v7 of the architecture, we’ve yet to see the new 64-bit ISA released in v8 or whatever that will be) is a 32-bit processor IP core that uses fixed-width 32-bit instructions with various forms of instruction encoding. The ISA is designed to be simple, but it does include some very flexible (simple) instructions. For example, almost every instruction can be conditionalized based upon 4 bits of condition state (and that state is typically not updated automatically within the ALU unless it is specifically encoded into the affecting instruction) and some instructions (additions, etc.) can include shifts of one of the operands (giving you DSP-ish instructions for free). Using conditionalized instructions keeps the pipeline full as opposed to having a lot of branching, and can render very small and tight sequences. Modern ARM cores implement a modified Harvard architecture, as indeed are most “Harvard architecture” processors these days.

There are 30 general purpose registers, but not all of these are available at any one time. In fact, it is easier to think of ARM as having 15 “general purpose” registers with some banking being used in certain modes (such as FIQ or IRQ – Fast or regular IRQs) to replace a subset of the registers with a context-specific special set. Even better, think of it as 12 general purpose registers, with r13 used as the stack pointer (sp), r14 as the link register (lr), and r15 used as the program counter (pc). Even even better, think of according to the ABI calling convention, with r0-r3 used for argument passing, r4-r11 used for local variables, r12 as a function call scratch register, r13 as sp, r14 as lr, and r15 as pc. Some instructions assume the use of sp, lr, and pc (write something into r15 and that will be taken as a branch address), but generally register use is flexible. Of the 4 different stack modes supported, ARM conventionally uses a “Full Descending” stack in which the sp points to the last value written, but you can use any of the “store multiple” or “load multiple” instructions either with their “stack” friendly names (e.g. stmfd), or directly (stmdb). Processor state is encoded in the CPSR (Current Processor Status Register), and saved in the SPSR (Saved Processor Status Register) when handling various exceptions (traps).

The ARM ISA is intentionally extensible. Either ARM, certain third parties (in limited situations), or co-processors may extend it. Co-processors are a means for the ARM core to remain simple while offloading certain functions to additional cores (that might actually be part of the same die). For example, the original FPA (Floating Point Accelerator) was implemented as a co-processor, and has long since been deprecated by the VFP (Vector Floating Point) co-processor (currently at version 3 thereof). Another co-processor you might care about is cp15, the MMU or VMSA (Virtual Memory Systems Architecture). When an instruction the ARM doesn’t recognize is seen, it will try to find one of 16 possible co-processors that can implement it (each has a special pre-determined number, such as cp15 for the MMU) before hitting the illegal instruction handler. The latter allows some instructions to be implemented in software, via traps, such as the older floating point emulation in the Linux kernel. There are special instructions for copying data to/from co-processors, and each has its own self-contained/self-defined register set.

Modern ARM cores implement a very long pipeline internally, but the original design was three stage. And this leaks out into the visible value of, for example, the pc (which is generally 8 bytes away from where you think it should be – different pipeline stage). This is the reason why PC relative addressing can be confusing (though assemblers help to deal with this). Another issue is that the ARM fixed-width instructions didn’t traditionally have a direct parallel to the PPC “li” “ori” address loading sequence. Instead, a 4K pc-relative address could be easily loaded in one instruction, a 64K pc-relative in two, and certain other addresses could be loaded by means of the built-in barrel shifter used in (e.g.) the addition instruction or by using a “literal pool”, depending upon the assembler. This means that traditionally, only certain addresses could be loaded with a “LDR” “meta” assembly instruction, and that the assembler would try to replace your LDR with appropriate shifts (in possibly several instructions), etc. If it were unable to do so, it would generate a warning. You could also load addresses directly from memory and jump to those that way by means of loading them in to the pc. ARM “bl” relative branches (which use a direct label address) have a 32-MB limit, traditionally requiring trampolines to handle larger jumps[0]. Since ARMv6T2, there is a movt instruction for loading the top 16 bits of a register (leaving the lower alone), and since ARMv7 there is even a movw (load a value into lower 16 bits and zero the upper part), which now allows a movw/movt combination akin to the li/ori combination on PowerPC. A mov/movt combo is used by, e.g. GCC when generating longer jumps (in which it loads the pc directly).

The ABI for ARM was changed a few years ago from the OABI (old ABI) to the EABI (not to be confused with the PowerPC ABI of the same name). In particular, system calls are no longer implemented by passing the system call number in the “comment” field of the system call instruction, but are instead passed by means of register (faster than having the kernel poke at the memory to find what number was contained within that instruction). Also, 64-bit values are passed using sequential registers, beginning on an evenly aligned register number. The ARM implements a limited number of trap vectors, traditionally located at 0×0000_0000 but today possibly relocated high at 0xffff_0000 (this is the preference of the kernel, if it is possible, since it avoids having the NULL physical page over-used).

Memory alignment is extremely important to ARM. Not only did it not have things like floating point for the longest time, but to keep the design simple, unaligned memory accesses weren’t really supported historically. Modern processors do have some support for handling unaligned access, but you generally want to avoid that. ARM deals in Bytes, Half-Words (16-bit), and Words (32-bit), none of this IA32 historical nonsense of “double words”. 64-bit values can be handled by means of multiple word instructions, and also manipulated in user code on some systems by use of the VFP, and so forth.

ARM don’t make processors directly. They define the ISA (currently v7, though processors still implement v4 and v5, etc.) and provide reference designs in the form of soft and hard (netlists, gates, etc.) cores to third parties. Depending upon the licensing (whether it is a foundry or merely a regular licensee), that third party might only have the right to implement what they are given. But if they have special foundry access, they can actually change the design. In any case, certain CPU functionality is optional. In the past, these were designated by letters after the CPU (e.g. “TDMI” for older cores implementing the then-optional Thumb instruction set, supporting virtual memory, debug, etc.), but today profiles are generally used instead. Cortex (the latest generation) provides for A, R, and M profiles, of which the Application profile (A) is what we care about.

ARM kernel

Execution of the ARM kernel begins in the inferred standard location of arch/arm/kernel/head.S, at the place very obvious labeled with “Kernel startup entry point”. At this point, the MMU must be off, the D-cache must be off, I-cache can be on or off, r0 must contain 0, r1 must contain the “machine number” (an ARM Linux standard assigned number, one per machine port, passed from the bootloader code), and r2 must contain the “ATAGS” pointer (a flexible data structure precursor to things like fdt and device trees that allows a bootloader to pass parameters). First, the processor mode is quickly set to ensure interrupts (FIQ and IRQ) are off, and that the processor is properly in Supervisor (SVC) mode. Then, MMU co-processor register c0 is copied into ARM register r9 to obtain the processor ID. This is followed by a call to __lookup_processor_type (contained within head-common.S, the common file for both MMU-enabled and non-MMU enabled ARM kernels – the latter are not covered by this document).

__lookup_processor_type behaves like a number of other functions in the early kernel setup. First, you should note that it has a C-ABI companion (used later, in higher level code) called lookup_processor_type that also preserves stack and calling semantics for use from C-code, then calls into __lookup_processor_type. Like so many other of these functions, __lookup_processor_type has a __lookup_processor_type_data friend that contains three data items:

  • The address of the structure
  • A pointer to the beginning of an array of “processor info” structures
  • A pointer to the end of an array of “processor info” structures

The function loads the address of __lookup_processor_type_data into r3, then uses an ldmia instruction to load the members of that structure into r4-r6. Since the first member of the structure contains the (relocated) virtual address of the kernel, and r3 contains the (currently) physical address, a simple subtraction is used to store into r3 the offset between the linked address of __lookup_processor_type_data and the actual address. This offset is then applied to registers r5 and r6 (containing the second and third members of that structure – __proc_info_begin and __proc_info_end). A loop is then entered to iterate over the proc info structure array and find a matching known-processor. If a matching processor is found, the physical address is preserved in r5 on return, otherwise #0 (NULL) is written into it so that the calling code can determine whether a processor was found.

Once the processor type has been determined, a very similar function called __lookup_machine_type is called to find the machine type passed by the bootloader, and get the “machinfo” structure describing the machine in question. If either of these functions fails, a call is made to __error_p, which can be compiled (in debug mode) to output an error ASCII message to the UART (byte by byte) but at least gives a point to pick up in a hardware debugger. Next, a call is made to __vet_atags (which checks that the ATAGS in r2 begin with the ATAG_CORE magic marker[1], and so forth), then sometime later we call __create_page _tables to fudge up entries in cp15 that cover just the kernel code and data. After that, we call through a few functions beginning with __enable_mmu that actually cause the MMU to get enabled and for the return to be to __mmap_switched (also in the head-common.S file). __mmap_switched also has an __mmap_switched_data, which is used to store various global values (such as processor_id, __machine_arch_type, __atags_pointer, and so forth). After zeroing out the BSS, __mmap_switched causes the high-level kernel start_kernel function to be entered as usual.

start_kernel (init/main.c) does a lot of things on every architecture/platform. These include calls to setup_arch, which on ARM calls setup_machine. This latter function returns an mdesc (machine descriptor) which is used in setting up any ATAGS, and for paging_init (which calls devicemaps_init that uses the mdesc to see if any special device IO maps are needed early on, for example for debugging and so forth). If ATAGS are not specified, a default init_tags set is used that defaults the machine to e.g. MEMSIZE of 16MB RAM, and other limited (but sane) defaults. After running through other generic early startup code, start_kernel causes the kernel_init thread to run, which calls do_basic_setup. do_basic_setup calls various initcalls, including the special customize_machine initcall. Customize machine calls init_machine (e.g. omap3_beagle_init), which adds various platform devices and does other board specific setup.

When the board specific init function is run, it will typically call platform_device_register for each known platform device (you can see these in /sys/devices/platform for example). Some of these platform entries include a driver_name field, which will cause the actual device driver to be loaded, but others contain more limited data. For example, in the case of EHCI (USB) on the Beagle, an ehci platform device is registered, but what causes the actual USB driver to load (if not built-in) is that the OMAP EHCI driver contains a MODULE_ALIAS(“platform:omap-ehci”) that udev and modprobe will cause to load on boot. In the case of OMAP, it might seem like a lot of hardware addresses and so forth are missing from the platform registration, but this is because the hardware is standardized (EHCI is always in the same place for a given OMAP, and hard-coded into the USB driver). One thing that does differ is GPIO MUX assignment, and these pins/values are poked in the init_machine code for the Beagle.

That’s all for now. More later.

Jon.

[0] Modern ARM also implements the Thumb instruction “BLX” (Branch with Link and optional eXchange) that can be used in combination with a series of assembler-generated mov/orr sequences to load an address into a register and branch to it. The older “bl” requires a pc-relative label address to jump to. In fact, when you use a sequence like “ldr reg, address” and “blx reg” the assembler may in fact generate 5 of more actual instructions using mov, orr, and bx or blx to perform the actual load and branch. Rather than switching to Thumb mode, the modern mov/movt combination is preferable for loading arbitrary addresses.

[1] I do wonder what significance this particular number has to rmk. If anyone knows why it was chosen – a birthday? or other date? etc. please do let me know.

Using OpenOCD with BeagleBoard-xM

Monday, December 13th, 2010

NOTE: This is an extremely technical blog post intended only for very serious embedded Linux developers. It was written mostly for the benefit of Google. I wish that this posting had existed over the weekend when I was looking at this.

So I now have several Texas Instruments BeagleBoard-xM boards based upon the TI DM37x series Cortex-A8 based SoC. The “xM” is an improvement on the original BeagleBoard that features the improved DM3730 in place of the original OMAP3530. Like its predecessor, the DM37x series includes Texas Instruments’ on-chip “ICEpick” JTAG TAP controller that can be used to selectively expose various additional JTAG TAPs provided by other on-chip devices. This facilities selective exposure of these devices (which may not be present in the case of the “AM” version of the chip, or otherwise might be in a low-power or otherwise disabled state), because it is nice to be able to ignore devices we don’t want to play with today. The ICEpick itself can handle up to 16 devices, although the majority of those are “reserved” for the scarily more complex days of tomorrow. If you want some of the details, refer to chapter 27 of the DaVinci Digital Media Processor Technical Reference Manual (TRM) (under “User Guides”).

The full possible DM37x JTAG chain contains the following devices (in OpenOCD ordering, from TDO to TDI – the inverse of the physical ordering of the bus, which starts with the ICEpick at 0 and ends with the DAP at 3): DAP (Debug Access Port), Sequencer (ARM968), DSP, D2D, ICEpick. It’s the last device in the chain that we really care about because it is through the Debug Access Port that we will poke at the system memory address space and registers within the Cortex-A8 processor. At Power ON, the default JTAG chain configuration will depend upon the strapping of the EMU0 and EMU1 lines (which are exposed on the Flyswatter as jumpers adjacent to the JTAG ribbon cable). If both of these lines are high (pins 1-2 not 0-1 on the Rev-B Flyswatter, i.e. nearest to the ribbon cable) – which they had better be if you want an “out of the box experience” – then the only device to appear in the chain will be the ICEpick. OpenOCD is expecting this, because it contains special macro functions that will instruct the ICEpick through its control command registers to expose any additional TAPs after initialization (we actually only want the DAP in the default case, which is how OpenOCD will behave for you if you don’t change it). Each time a new TAP is exposed, the OpenOCD JTAG logic will do the necessary state transition for it to take effect (for the chain to lengthen or contract according to devices appearing and disappearing in the chain).

The default upstream version of OpenOCD does not work with BeagleBoard-xM at the moment (as of v0.4.0-651-gc6e0705 – last modification on December 8th) because some logic was recently added to master that attempts to handle busticated DAP ROM addresses on a Freescale (not Texas Instruments) IMX51 processor, which is very similar to the DM37x. Due to this “fixup”, the DM37x’s correctly provided ROM table information will be “fixed” to use the wrong DAP address, which won’t work. For the moment, find the following loop in src/target/arm_adi_v5.c:

        /* Some CPUs are messed up, so fixup if needed. */
        for (i = 0; i < sizeof(broken_cpus)/sizeof(struct broken_cpu); i++)
                if (broken_cpus[i].dbgbase == dbgbase &&
                        broken_cpus[i].apid == apid) {
                        LOG_WARNING("Found broken CPU (%s), trying to fixup "
                                "ROM Table location from 0x%08x to 0x%08x",
                                broken_cpus[i].model, dbgbase,
                                broken_cpus[i].correct_dbgbase);
                        dbgbase = broken_cpus[i].correct_dbgbase;
                        break;
                }

Wrap it with a #if 0 (and don't forget to also comment or define away the definition of "i") such that it won't take effect. Then, if you have a Rev-B xM you will also need to edit the file tcl/target/amdm37x.cfg to add a new TAPID for the undocumented revision recently made to the ICEpick on these TI parts (this is backward compatible with the Rev-A because it will now look for both):

   ...
   switch $CHIPTYPE {
      dm37x {
         # Primary TAP: ICEpick-C (JTAG route controller) and boundary scan
         set _JRC_TAPID  0x0b89102f
         set _JRC_TAPID1 0x1b89102f
      }
   ...
...
# Primary TAP: ICEpick - it is closest to TDI so last in the chain
jtag newtap $_CHIPNAME jrc -irlen 6 -ircapture 0x1 -irmask 0x3f \
   -expected-id $_JRC_TAPID -expected-id $_JRC_TAPID1

Then make and install the OpenOCD binaries (ensure you have the free FTDI libraries installed, and configure with “configure –enable-maintainer-mode –enable-ft2232_libftdi”). You can now use OpenOCD with the xM revA or B.

Jon.