Developer blog: marco

Contributed by marco on 2005-12-17 from the hppa-is-weird dept.

As a regular user of hppa hardware I have been wanting to use afs for a while now on hppa. Sadly afs requires LWP threads and the current code in there simply does not work. It's some sort of hpux pile of doo doo so I didn't bother. Anyway, Todd told me that there is a pthread alternative but alas we don't have any threading support on hppa as of today. All this combined with Ted's recent flurry of commits in rthreads got me excited and I started to scratch my itch. Shouldn't have...

The first part, getting the _atomic_lock.c file in was simple (and wrong as I found out later, but that's for another time). So now I am feeling all 1337 to find out that it needs another piece called rfork_thread. Crap, that one is actually hard so I decided that it was great for someone else. Needless to say that by midday the itch *had* to be scratched so I started looking at various implementations (macppc, i386, amd64). The i386 one was simple enough so I decided to *only* have a look.

Hours later...

I am still wading through the hppa documentation. Wow this is not CISCy at all. This is some crazy ass assembly. For example, one branches and the stupid cpu executes one more instruction regardless of hitting it or not. That's stupid! well at least on first sight it is. Now that I have seen it in action it is actually quite clever.

Check this out:

#include 
#include 

int moo(int i)
{
        i++;
        printf("%d\n", i);
        return (i);
}

int main(int argc, char *argv[])
{
        int i = 0;

        for (i = 0; i < 10; i++) {
                printf("%d\n", i);
        }

        moo(15);

        return (0x0a);
}

Compile this like so: # cc moo.c -Wall -ggdb3 -O2

And lets look at it: # objdump -S a.out > a && vim a
...  ...
00001694 :
#include 
#include 

int moo(int i)
{
    1694:       6b c2 3f d9     stw rp,-14(,sp)
    1698:       6f c4 00 80     stw,ma r4,40(,sp)
        i++;
    169c:       37 44 00 02     ldo 1(r26),r4
        printf("%d\n", i);
    16a0:       23 41 00 00     ldil 2000,r26
    16a4:       37 5a 0e a8     ldo 754(r26),r26
    16a8:       e8 5f 1a b5     b,l 1408 <__init+0x24>,rp
    16ac:       08 04 02 59     copy r4,r25
        return (i);
}
    16b0:       4b c2 3f 59     ldw -54(,sp),rp
    16b4:       08 04 02 5c     copy r4,ret0
    16b8:       e8 40 c0 00     bv r0(rp)
    16bc:       4f c4 3f 81     ldw,mb -40(,sp),r4

000016c0 :

int main(int argc, char *argv[])
{
    16c0:       6b c2 3f d9     stw rp,-14(,sp)
    16c4:       6f c4 00 80     stw,ma r4,40(,sp)
        int i = 0;

        for (i = 0; i < 10; i++) {
    16c8:       20 81 00 00     ldil 2000,r4
    16cc:       6b c3 3f 89     stw r3,-3c(,sp)
    16d0:       34 03 00 00     ldi 0,r3
                printf("%d\n", i);
    16d4:       08 03 02 59     copy r3,r25
    16d8:       34 9a 0e a8     ldo 754(r4),r26
    16dc:       e8 5f 1a 4d     b,l 1408 <__init+0x24>,rp
    16e0:       34 63 00 02     ldo 1(r3),r3
    16e4:       8c 72 5f dd     cmpib,>= 9,r3,16d8 
    16e8:       08 03 02 59     copy r3,r25
        }

        moo(15);
    16ec:       e8 5f 1f 45     b,l 1694 ,rp
    16f0:       34 1a 00 1e     ldi f,r26
Ha look at this right here      ^^^^^^^^^
The 15 parameter gets loaded right here, after the branch!  Crazy

        return (0x0a);
}
    16f4:       4b c2 3f 59     ldw -54(,sp),rp
    16f8:       4b c3 3f 89     ldw -3c(,sp),r3
    16fc:       34 1c 00 14     ldi a,ret0
    1700:       e8 40 c0 00     bv r0(rp)
    1704:       4f c4 3f 81     ldw,mb -40(,sp),r4
Here it goes again              ^^^^^^^^^^^^^^^^^^
one more instruction after the branch.

I'll leave it to the reader to find the others.

pid_t
rfork_thread(int flags, void *stack, int (*func)(void *arg), void *arg);

Now I *only* need to figure out the following (give or take):

* function prologue
* replace current stack with the one provided by rthread_fork
* setup the rfork call on the new stack
* call rfork
* for the parent:
	- fix stack
	- function epilogue
	- return to caller
* for the child
	- jump into thread
	- upon return call _exit()

Sometimes an itch should remain unscratched...

(Comments are closed)

Comments

By Anonymous Coward (134.58.253.131) on 2005-12-17 23:18

I remember this from the advanced computer architectures course I took last year. It is called "Delayed branching", and can even be extended to multiple instructions that are always executed after a branch, though HPPA probably doesn't do this. This is one of the methods that can be used to speed up branches on pipelined microprocessors. On a pipelined CPU, there are many stages an instruction has to go through, for example fetch, decode, execute, writeback. There may be more/less stages in practice; that depends on the CPU design. This means that several instructions are in different stages of the pipeline at the same time. This gives rise to a problem with respect to branches. In order to keep the pipeline full, we should fetch the next instruction after the branch before we know the outcome of the branch. The easiest solution to this is to just stop filling the pipeline until we know the outcome of the branch, but this slows down the CPU. We could also guess the outcome of the branch (using a simple rule like forward branches are not taken, backwards branches are), and flush the pipeline and start over if we guessed wrong. This also costs extra time when we guess wrong. (There exist better prediction algorithms that achieve very high rates of success, by either taking hints from the compiler/programmer, or by caching the results of each branch). Delayed branching is another mechanism to solve this problem. In short it works like this: We search some instructions that can be moved to right behind the branch. This means that the result of the branch must not depend on the results of these instructions. Now we can fill up the pipeline with these instructions, that are always executed regardless of the outcome of the branch, while the branch instruction is being processed. The number of instructions that are always executed after a branch is chosen so that when we have to fetch the first instruction that's really after the branch, the branch instruction itself has been executed, so we know which instruction to fetch. Of course this is quite demanding upon the compiler/programmer. They have to make sure such instructions can be found. But in many cases they can easily be found (eg. simple for-loops which hardly have any coupling between the loop-control variable and the operations executed). If they can't be found, an appropriate number of NOP's are inserted, and we're back at "stop filling the pipeline".
Comments
1. By Anonymous Coward (195.224.109.30) on 2005-12-19 14:34
  
  I remember with PowerPC, when you use an instuction that will set the condition registers ( e.g. compares, ) you get a number of independant instructions, effectively for free, before branching. If you were to say, compare and then branch straight away, you get a stall.
By Anonymous Coward (64.180.110.15) on 2005-12-18 08:08

I remember first encountering this on sparc. It's really odd when you fix code by adding NOPs. There's a nice article about this at http://en.wikipedia.org/wiki/Branch_delay_slot
By Anonymous Coward (212.99.198.91) on 2005-12-18 10:16

yeah it's hard. but it's harder to just shuddup
By anon et al (80.213.134.72) on 2005-12-18 13:37

The Linux hppa port has supported pthreads for ages - but performace vise, it's a joke.

What's needed is a proper thread implementation for the hppa,
Comments
1. By Marco Peereboom (67.64.89.177) wbx@openbsd-geek.de on 2005-12-18 18:24 http://openbsd-geek.de
  
  Their locking is broken.
By Ober (67.79.5.180) ober@linbsd.org on 2005-12-18 18:25

Yeah been working with todd to get openafs working on 3.8 without core dumping every 3 seconds. Looks like he will have a package ready shortly.
By sand (80.181.228.129) daniel@spatof.org on 2005-12-19 23:35 http://www.spatof.org

keep up with the developer's blog! :)

Latest Articles

Sat, Jun 28
- 05:57 Game of Trees 0.115 released (0)
Tue, Jun 24
- 07:48 Game of Trees 0.114 released (0)
- 07:23 Call for testing: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix (0)
Mon, Jun 16
- 08:22 j2k25 hackathon report from kn@: installer, low battery, and more (0)
Fri, Jun 13
- 11:18 dhcpd(8): use UDP sockets instead of BPF (1)
Thu, Jun 12
- 12:32 clang(1)/llvm/lld(1) updated to version 19 (0)
Wed, Jun 11
- 12:22 Source code sandboxing (0)
Tue, Jun 10
- 06:50 TearFree option backported to modesetting(4) driver (0)
Mon, Jun 09
- 07:32 FFS optimizations with dirhash, as blogged by rsadowski@ (1)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]