ARM assembler examples

Sections

Numeric examples

Games

List examples

I/O examples

Numeric examples

The following examples illustrate the use of the assembler with integer parameters. Many of them are based on the uLisp Benchmarks.

16th March 2020: Added the example Sum of integers in a tree.

19th June 2020: Added the example Population count.

Fibonacci sequence

The Fibonacci sequence is:

1, 1, 2, 3, 5, 8, 13, 21 ...

where the first two terms are 1, and each subsequent term is the sum of the two previous terms. The following recursive function finds the nth term, counting from 0:

(defun fib (n)
  (if (< n 3) 1
    (+ (fib (- n 1)) (fib (- n 2)))))

Running the Lisp version on an Adafruit Metro M4:

> (time (fib 27))
196418
Time: 24.6 s

Here's the assembler version:

; Fibonacci sequence
(defcode fib (x)
  fib
  ($push '(r4 r5 r6 lr))
  ($mov 'r5 'r0)
  ($mov 'r4 0)
  mainloop
  ($cmp 'r5 2)
  ($ble add)
  ($sub 'r0 'r5 1)
  ($bl fib)
  ($sub 'r5 2)
  ($add 'r4 'r4 'r0)
  ($b mainloop)
  add
  ($add 'r0 'r4 1)
  ($pop '(r4 r5 r6 pc)))

Running the assembler version on an Adafruit Metro M4:

> (time (fib 27))
196418
Time: 61 ms

The assembler version is approximately a factor of 400 faster.

Takeuchi function

The Takeuchi function is a classic benchmark for comparing implementations of Lisp, originally used by Ikuo Takeuchi of Japan. Here's the Lisp version:

(defun tak (x y z)
  (if (not (< y x))
      z
    (tak
     (tak (1- x) y z)
     (tak (1- y) z x)
     (tak (1- z) x y))))

For example:

> (time (tak 18 12 6))
7 
Time: 5.2 s

Here's the assembler version:

(defcode tak (x y z)
  tak
  ($push '(lr))
  ($sub 'r3 'r1 'r0)
  ($bcc less)
  ($mov 'r0 'r2)
  ($pop '(pc))
  less
  ($push '(r2 r1 r0))
  ($sub 'r0 1)
  ($bl tak)
  ($mov 'r3 'r0)
  ($pop '(r2))
  ($pop '(r0 r1))
  ($push '(r3 r2 r1 r0))
  ($sub 'r0 1)
  ($bl tak)
  ($mov 'r3 'r0)
  ($pop '(r2))
  ($pop '(r0 r1))
  ($sub 'r0 1)
  ($push '(r3))
  ($bl tak)
  ($mov 'r2 'r0)
  ($pop '(r1))
  ($pop '(r0))
  ($bl tak)
  ($pop '(pc)))

Run it as follows:

> (time (tak 18 12 6))
7 
Time: 13 ms

On a Adafruit Metro M4 the assembler version is a factor of 400 faster.

Hofstadter Q sequence

This is one of several recursive sequences described in Douglas Hofstadter's book "Gödel, Escher, Bach: an Eternal Golden Braid". It is related to the Fibonacci sequence, except that in this case the two preceding terms specify how far to go back in the sequence to find the two terms to be summed:

(defun q (n)
  (if (<= n 2) 1
    (+
     (q (- n (q (- n 1))))
     (q (- n (q (- n 2)))))))

Running the Lisp version:

> (time (q 21))
12 
Time: 6.8 s

Here's the assembler version:

; Hofstadter Q sequence
(defcode q (x)
  q
  ($push '(r4 r5 r6 lr))
  ($mov 'r4 'r0)
  ($mov 'r5 0)
  compare
  ($cmp 'r4 2)
  ($ble add)
  ($sub 'r0 'r4 1)
  ($bl q)
  ($sub 'r0 'r4 'r0)
  ($bl q)
  ($mov 'r6 'r0)
  ($sub 'r0 'r4 2)
  ($bl q)
  ($add 'r5 'r5 'r6)
  ($sub 'r4 'r4 'r0)
  ($b compare)
  add
  ($add 'r0 'r5 1)
  ($pop '(r4 r5 r6 pc)))

Running the assembler version:

> (time (q 21))
12 
Time: 24 ms

In other words, about 270 times faster.

Factor

This function takes a simple approach to finding the least prime factor of a number:

(defun factor (n)
  (let ((d 2) (i 1))
    (loop
     (when (> (* d d) n) (return n))
     (when (zerop (mod n d)) (return d))
     (incf d i) (setq i 2))))

If the number is prime, factor will print the number itself. To find the least prime factor of 2146654199 (46327 x 46337):

> (time (factor 2146654199))
46327 
Time: 8.2 s

Here's an equivalent assembler version:

; Least prime factor
(defcode factor (x)
  #| Main program |#
  factor
  ($push '(r4 r5 lr))
  ($mov 'r5 'r0)
  ($mov 'r4 2)
  test
  ($mov 'r3 'r4)
  ($mul 'r3 'r4)
  ($cmp 'r3 'r5)
  ($bgt returnn)
  ($mov 'r1 'r4)
  ($mov 'r0 'r5)
  ($bl rem)
  ($cmp 'r0 0)
  ($beq returnd)
  ($add 'r4 1)
  ($b test)
  returnn
  ($mov 'r0 'r5)
  ($pop '(r4 r5 pc))
  returnd
  ($mov 'r0 'r4)
  ($pop '(r4 r5 pc))
  #| Remainder subroutine |#
  rem
  ($push '(r4 lr))
  ($mov 'r3 0)
  ($mov 'r2 32)
  loop
  ($lsl 'r0 1)
  ($adc 'r3 'r3)
  ($cmp 'r3 'r1)
  ($blt skip)
  ($sub 'r3 'r1)
  ($add 'r0 1)
  skip
  ($sub 'r2 1)
  ($bne loop)
  ($mov 'r0 'r3)
  ($pop '(r4 pc)))

It uses a subroutine, rem, to calculate the remainder when an unsigned 32-bit number is divided by a 32-bit divisor. On an ARM Cortex M4 board you could replace this with a udiv instruction (and a mul).

Doing the same test to find the least prime factor of 2146654199 (46327 x 46337):

> (time (factor 2146654199)
46327 
Time: 145 ms

In other words, approximately 60 times faster.

Factorize

You can use the above function as the basis for a simple recursive routine to factorize a number into a list of its prime factors:

(defun factorize (n)
  (let ((f (factor n)))
    (if (= n f) (list n) (cons f (factorize (/ n f))))))

For example:

> (factorize 731731731)
(3 17 43 333667)

Population count

This function counts the number of '1' bits in a 32-bit integer. There are several efficient ways to calculate this; the following routine is efficient when there are relatively few '1' bits in the number [1]

Here's a Lisp version:

(defun popcount (n)
  (let ((count 0))
    (when (minusp n) (setq n (logand n #x7FFFFFFF)) (incf count))
    (loop
     (when (zerop n) (return count))
     (incf count)
     (setq n (logand n (1- n))))))

Because uLisp works with signed integers we need to handle the case of n negative separately.

For example:

> (popcount #x87654321)
13
Here's the assembler version:
; Population count
(defcode popcount (n)
  ($mov 'r1 0)
  ($cmp 'r0 0)
  ($beq ret)
  loop
  ($add 'r1 1)
  ($sub 'r2 'r0 1)
  ($and 'r0 'r2)
  ($bne loop)
  ret
  ($mov 'r0 'r1)
  ($bx 'lr))

Reversebits

This is a function to efficiently reverse the order of bits in a 32-bit number [2]. Here's the uLisp version:

(defun reversebits (x)
  (setq x (logior (ash (logand x #x55555555) 1)
                  (logand (ash x -1) #x55555555)))
  (setq x (logior (ash (logand x #x33333333) 2)
                  (logand (ash x -2) #x33333333)))
  (setq x (logior (ash (logand x #x0f0f0f0f) 4)
                  (logand (ash x -4) #x0f0f0f0f)))
  (setq x (logior (ash x 24) (ash (logand x #xff00) 8)
                  (logand (ash x -8) #xff00) (ash x -24))))

For example:

> (reversebits #x12345678)
510274632

where 510274632 is #x1E6A2C48, which you'll see is the right answer.

Here's the assembler version, which demonstrates the use of in-line 32-bit constants:

; Reverse bits in a 32-bit word
(defcode reversebits (n)
  rev
  ($push '(r4 lr))
  ; Swap odd and even
  ($ldr 'r2 fives)
  ($lsr 'r1 'r0 1)
  ($lsl 'r0 1)
  ($and 'r1 'r2)
  ($bic 'r0 'r2)
  ($orr 'r0 'r1)
  ; Swap adjacent pairs
  ($ldr 'r1 threes)
  ($lsr 'r2 'r0 2)
  ($lsl 'r0 2)
  ($and 'r2 'r1)
  ($bic 'r0 'r1)
  ($orr 'r2 'r0)
  ; Swap adjacent nibbles
  ($ldr 'r0 nibbles)
  ($lsr 'r1 'r2 4)
  ($lsl 'r2 4)
  ($and 'r1 'r0)
  ($bic 'r2 'r0)
  ($orr 'r1 'r2)
  ; Reverse all bytes
  ($rev 'r0 'r1)
  ($pop '(r4 pc))
  fives
  ($word #x55555555)
  threes
  ($word #x33333333)
  nibbles
  ($word #x0f0f0f0f))

Testing it:

> (reversebits #x12345678)
510274632

Modulo 10

The following neat routine returns its argument modulo 10:

(defcode mod10 (n)
  ($ldr 'r1 magic)
  ($mul 'r1 'r0)
  ($lsr 'r1 19)
  ($mov 'r2 10) 
  ($mul 'r1 'r2)
  ($sub 'r0 'r1)
  ($bx 'lr)
  magic
  ($word 52429))

It is valid for arguments up to 81919. It avoids the need for division by multiplying by 52429/2^19, which is a very close approximation to 0.1.

Games

Bulls & Cows

The following bullcow function calculates the score between a guess and a code, represented as hexadecimal numbers with a specified number of digits. The score is returned as a two-digit hexadecimal number, bulls followed by cows, where bulls are digits correct and in the correct position (bull's-eyes), and cows are digits correct but not in the right position. Here's the Lisp version:

(defvar *spectrum* (let (lst) (dotimes (i 16 lst) (push 0 lst))))

(defun bullcow (digits guess code)
  (let ((score 0))
    (dotimes (i 16) (setf (nth i *spectrum*) 0))
    (dotimes (d digits)
      (let ((da (mod guess 16))
            (db (mod code 16)))
        (cond
         ((= da db) (incf score 16))
         (t 
          (when (<= (incf (nth da *spectrum*)) 0) (incf score))
          (when (>= (decf (nth db *spectrum*)) 0) (incf score))))
        (setq guess (truncate guess 16))
        (setq code (truncate code 16))))
    score))

For example:

> (bullcow 6 #x123456 #x456430)

gives the result #x13 (19) because there is one bull (the 4), and three cows (the 3, 5, and 6).

Here's the assembler version:

; Bulls & Cows
(defcode bullcow (d guess code)
  ($push '(r4 r5 r6 r7 lr))
  ($sub 'sp 16)
  ($mov 'r7 'sp)
  ($mov 'r4 15)
  ($mov 'r3 0)
  clear
  ($strb 'r3 '(r7 r4))
  ($sub 'r4 1)
  ($bcs clear)
  digits
  ($mov 'r4 #xf)
  ($and 'r4 'r1)
  ($mov 'r5 #xf) 
  ($and 'r5 'r2)
  ($cmp 'r4 'r5)
  ($bne cows)
  bulls
  ($add 'r3 16)
  ($b no2)
  cows
  ($ldrsb 'r6 '(r7 r4))
  ($add 'r6 1)
  ($strb 'r6 '(r7 r4))
  ($bgt no1)
  ($add 'r3 1)
  no1
  ($ldrsb 'r6 '(r7 r5))
  ($sub 'r6 1)
  ($strb 'r6 '(r7 r5))
  ($blt no2)
  ($add 'r3 1)
  no2
  ($lsr 'r1 4)
  ($lsr 'r2 4)
  ($sub 'r0 1)
  ($bne digits)
  ($mov 'r0 'r3)
  ($add 'sp 16)
  ($pop '(r4 r5 r6 r7 pc)))

For more information about the game Bulls & Cows, and a uLisp program to play the game, see Bulls & Cows game.

List examples

Any of the arguments to a machine-code function can be a list, in which case the address of the list is passed to the routine in the corresponding register r0 to r3.

For example, if the list is the first parameter its address will be in r0, and you can then load the car of the list into r1 with:

($ldr 'r1 '(r0 0))

and the cdr of the list into r1 with:

($ldr 'r1 '(r0 4))

Product of list elements

The following example returns the product of all the numbers in a list:

; Product of list elements
(defcode product (x)
  ($mov 'r2 1)
  repeat
  ($cmp 'r0 0)
  ($beq finished)
  ($ldr 'r1 '(r0 0))
  ($ldr 'r1 '(r1 4))
  ($mul 'r2 'r1)
  ($ldr 'r0 '(r0 4))
  ($b repeat)
  finished
  ($mov 'r0 'r2)
  ($bx 'lr))

For example:

> (product '(1 2 3 4 5 6))
720

It works by multiplying the number pointed to by the car of the list by the running product (in r2), and then taking the cdr of the list, until the list is nil (zero), indicating the end of the list. To understand how this works see Objects.

Sum of integers in a tree

The following recursive Lisp program returns the sum of all the integers in a tree:

(defun sum (tree)
  (cond
   ((null tree) 0)
   ((numberp tree) tree)
   (t (+ (sum (car tree)) (sum (cdr tree))))))

For example:

> (sum '(1 2 (3 4 (5 6 (7 8)) 9 10) 11 12))
78

Here's the equivalent assembler version:

; Sum of tree
(defcode sum (tree)
  sum
  ($push '(r4 r5 lr))
  ($mov 'r4 'r0)
  ($cmp 'r0 0)
  ($beq return)
  ($ldr 'r0 '(r4 0))
  ($cmp 'r0 6)
  ($bne notnumber)
  ($ldr 'r0 '(r4 4))
  ($b return)
  notnumber
  ($bl sum) ; car
  ($mov 'r5 'r0)
  ($ldr 'r0 '(r4 4)) ;cdr
  ($bl sum)
  ($add 'r0 'r5)
  return
  ($pop '(r4 r5 pc)))

I/O examples

LED

The following examples show how to turn on and off the LED provided on most ATSAMD21 boards. The LED is usually on PA17.

The LED is controlled by writing to bit 17 in a register with the appropriate offset from the port base address, #x41004400. The following offsets are available to clear or set the direction of the pin, and clear, set, or toggle the output state:

  Offset Action
DIRCLR #x04 Pin as input
DIRSET #x08 Pin as output
OUTCLR #x14 Set pin low
OUTSET #x18 Set pin high
OUTTGL #x1C Toggle pin

Here are the routines:

(defvar dirset #x08)
(defvar outset #x18)
(defvar outclr #x14)

(defcode on ()
  ($ldr 'r2 base)
  ($mov 'r1 1)
  ($lsl 'r3 'r1 17)
  ($str 'r3 '(r2 dirset))
  ($str 'r3 '(r2 outset))
  ($bx 'lr)
  base
  ($word #x41004400))

(defcode off ()
  ($ldr 'r2 base)
  ($mov 'r1 1)
  ($lsl 'r3 'r1 17)
  ($str 'r3 '(r2 outclr))
  ($bx 'lr)
  base
  ($word #x41004400))

The (on) routine defines the pin as an output and takes it low to turn the LED on. The (off) routine takes the pin high.

Blink

The following blink routine generates a delay using a loop. The innermost loop takes 3 cycles, so the delay is 3 x 8000000 cycles, or approximately half a second with a 48MHz clock. The parameter specifies the number of toggles:

(defvar dirset #x08)
(defvar outtgl #x1c)

(defcode blink (n)
  ($ldr 'r2 base)
  ($mov 'r1 1)
  ($lsl 'r3 'r1 17)
  ($str 'r3 '(r2 dirset))
  count
  ($ldr 'r1 delay)
  loop
  ($sub 'r1 1)
  ($bne loop)
  ($str 'r3 '(r2 outtgl))
  ($sub 'r0 1)
  ($bne count)
  ($bx 'lr)
  base
  ($word #x41004400)
  delay
  ($word 8000000))

  1. ^ See Hamming weight on Wikipedia.
  2. ^ From "Hacker's Delight" by Henry S. Warren, Jr., page 129.