OpenBSD Journal

Developer blog: marco

Contributed by marco on from the hppa-is-weird dept.

As a regular user of hppa hardware I have been wanting to use afs for a while now on hppa. Sadly afs requires LWP threads and the current code in there simply does not work. It's some sort of hpux pile of doo doo so I didn't bother. Anyway, Todd told me that there is a pthread alternative but alas we don't have any threading support on hppa as of today. All this combined with Ted's recent flurry of commits in rthreads got me excited and I started to scratch my itch. Shouldn't have...

The first part, getting the _atomic_lock.c file in was simple (and wrong as I found out later, but that's for another time). So now I am feeling all 1337 to find out that it needs another piece called rfork_thread. Crap, that one is actually hard so I decided that it was great for someone else. Needless to say that by midday the itch *had* to be scratched so I started looking at various implementations (macppc, i386, amd64). The i386 one was simple enough so I decided to *only* have a look.

Hours later...

I am still wading through the hppa documentation. Wow this is not CISCy at all. This is some crazy ass assembly. For example, one branches and the stupid cpu executes one more instruction regardless of hitting it or not. That's stupid! well at least on first sight it is. Now that I have seen it in action it is actually quite clever.

Check this out:


int moo(int i)
        printf("%d\n", i);
        return (i);

int main(int argc, char *argv[])
        int i = 0;

        for (i = 0; i < 10; i++) {
                printf("%d\n", i);


        return (0x0a);

Compile this like so: # cc moo.c -Wall -ggdb3 -O2

And lets look at it: # objdump -S a.out > a && vim a
...  ...
00001694 :

int moo(int i)
    1694:       6b c2 3f d9     stw rp,-14(,sp)
    1698:       6f c4 00 80     stw,ma r4,40(,sp)
    169c:       37 44 00 02     ldo 1(r26),r4
        printf("%d\n", i);
    16a0:       23 41 00 00     ldil 2000,r26
    16a4:       37 5a 0e a8     ldo 754(r26),r26
    16a8:       e8 5f 1a b5     b,l 1408 <__init+0x24>,rp
    16ac:       08 04 02 59     copy r4,r25
        return (i);
    16b0:       4b c2 3f 59     ldw -54(,sp),rp
    16b4:       08 04 02 5c     copy r4,ret0
    16b8:       e8 40 c0 00     bv r0(rp)
    16bc:       4f c4 3f 81     ldw,mb -40(,sp),r4

: int main(int argc, char *argv[]) { 16c0: 6b c2 3f d9 stw rp,-14(,sp) 16c4: 6f c4 00 80 stw,ma r4,40(,sp) int i = 0; for (i = 0; i < 10; i++) { 16c8: 20 81 00 00 ldil 2000,r4 16cc: 6b c3 3f 89 stw r3,-3c(,sp) 16d0: 34 03 00 00 ldi 0,r3 printf("%d\n", i); 16d4: 08 03 02 59 copy r3,r25 16d8: 34 9a 0e a8 ldo 754(r4),r26 16dc: e8 5f 1a 4d b,l 1408 <__init+0x24>,rp 16e0: 34 63 00 02 ldo 1(r3),r3 16e4: 8c 72 5f dd cmpib,>= 9,r3,16d8 16e8: 08 03 02 59 copy r3,r25 } moo(15); 16ec: e8 5f 1f 45 b,l 1694 ,rp 16f0: 34 1a 00 1e ldi f,r26 Ha look at this right here ^^^^^^^^^ The 15 parameter gets loaded right here, after the branch! Crazy return (0x0a); } 16f4: 4b c2 3f 59 ldw -54(,sp),rp 16f8: 4b c3 3f 89 ldw -3c(,sp),r3 16fc: 34 1c 00 14 ldi a,ret0 1700: e8 40 c0 00 bv r0(rp) 1704: 4f c4 3f 81 ldw,mb -40(,sp),r4 Here it goes again ^^^^^^^^^^^^^^^^^^ one more instruction after the branch.

I'll leave it to the reader to find the others.

rfork_thread(int flags, void *stack, int (*func)(void *arg), void *arg);

Now I *only* need to figure out the following (give or take):

* function prologue
* replace current stack with the one provided by rthread_fork
* setup the rfork call on the new stack
* call rfork
* for the parent:
	- fix stack
	- function epilogue
	- return to caller
* for the child
	- jump into thread
	- upon return call _exit()

Sometimes an itch should remain unscratched...

(Comments are closed)

  1. By Anonymous Coward ( on

    I remember this from the advanced computer architectures course I took last year. It is called "Delayed branching", and can even be extended to multiple instructions that are always executed after a branch, though HPPA probably doesn't do this. This is one of the methods that can be used to speed up branches on pipelined microprocessors. On a pipelined CPU, there are many stages an instruction has to go through, for example fetch, decode, execute, writeback. There may be more/less stages in practice; that depends on the CPU design. This means that several instructions are in different stages of the pipeline at the same time. This gives rise to a problem with respect to branches. In order to keep the pipeline full, we should fetch the next instruction after the branch before we know the outcome of the branch. The easiest solution to this is to just stop filling the pipeline until we know the outcome of the branch, but this slows down the CPU. We could also guess the outcome of the branch (using a simple rule like forward branches are not taken, backwards branches are), and flush the pipeline and start over if we guessed wrong. This also costs extra time when we guess wrong. (There exist better prediction algorithms that achieve very high rates of success, by either taking hints from the compiler/programmer, or by caching the results of each branch). Delayed branching is another mechanism to solve this problem. In short it works like this: We search some instructions that can be moved to right behind the branch. This means that the result of the branch must not depend on the results of these instructions. Now we can fill up the pipeline with these instructions, that are always executed regardless of the outcome of the branch, while the branch instruction is being processed. The number of instructions that are always executed after a branch is chosen so that when we have to fetch the first instruction that's really after the branch, the branch instruction itself has been executed, so we know which instruction to fetch. Of course this is quite demanding upon the compiler/programmer. They have to make sure such instructions can be found. But in many cases they can easily be found (eg. simple for-loops which hardly have any coupling between the loop-control variable and the operations executed). If they can't be found, an appropriate number of NOP's are inserted, and we're back at "stop filling the pipeline".

    1. By Anonymous Coward ( on

      I remember with PowerPC, when you use an instuction that will set the condition registers ( e.g. compares, ) you get a number of independant instructions, effectively for free, before branching. If you were to say, compare and then branch straight away, you get a stall.

  2. By Anonymous Coward ( on

    I remember first encountering this on sparc. It's really odd when you fix code by adding NOPs. There's a nice article about this at

  3. By Anonymous Coward ( on

    yeah it's hard. but it's harder to just shuddup

  4. By anon et al ( on

    The Linux hppa port has supported pthreads for ages - but performace vise, it's a joke.

    What's needed is a proper thread implementation for the hppa,

    1. By Marco Peereboom ( on

      Their locking is broken.

  5. By Ober ( on

    Yeah been working with todd to get openafs working on 3.8 without core dumping every 3 seconds. Looks like he will have a package ready shortly.

  6. By sand ( on

    keep up with the developer's blog! :)


Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]