OpenBSD Journal

autoconf/clang (No) Fun and Games

Contributed by rueda on from the autoconfusion dept.

Robert Nagy (robert@) wrote in with a fascinating story of hunting down a recent problem with ports:

You might have been noticing the amount of commits to ports regarding autoconf and nested functions and asking yourself… what the hell is this all about?

I was hanging out at my friend Antoine (ajacoutot@)'s place just before EuroBSDCon 2017 started and we were having drinks and he told me that there is this weird bug where Gnome hangs completely after just a couple of seconds of usage and the gnome-shell process just sits in the fsleep state. This started to happen at the time when inteldrm(4) was updated, the default compiler was switched to clang(1) and futexes were turned on by default.

The next day we started to have a look at the issue and since the process was hanging in fsleep, it seemed clear that the cause must be futexes, so we had to start bisecting the base system, which resulted in random success and failure. In the end we figured out that it is neither futex nor inteldrm(4) related, so the only thing that was left is the switch to clang.

Now the problem is that we have to figure out what part of the system needs to be build with clang to trigger this issue, so we kept on going and systematically recompiled the base system with gcc until everything was ruled out … and it kept on hanging.

We were drunk and angry that now we have to go and check hundreds of ports because gnome is not a small standalone port, so between two bottles of wine a build VM was fired up to do a package build with gcc, because manually building all the dependencies would just take too long and we had spent almost two days on this already.

Next day ~200 packages were available to bisect and figure out what's going on. After a couple of tries it turned out that the hang is being caused by the gtk+3 package, which is bad since almost everything is using gtk+3. Now it was time to figure out what file the gtk+3 source being built by clang is causing the issue. (Compiler optimizations were ruled out already at this point.) So another set of bisecting happened, building each subdirectory of gtk+3 with clang and waiting for the hang to manifest … and it did not. What the $f?

Okay so something else is going on and maybe the configure script of gtk+3 is doing something weird with different compilers, so I quickly did two configure runs with gcc and clang and simply diff'd the two directories. Snippets from the diff:

-GDK_HIDDEN_VISIBILITY_CFLAGS = -fvisibility=hidden
+GDK_HIDDEN_VISIBILITY_CFLAGS = 

-lt_cv_prog_compiler_rtti_exceptions=no
+lt_cv_prog_compiler_rtti_exceptions=yes

-#define _GDK_EXTERN __attribute__((visibility("default"))) extern

-lt_prog_compiler_no_builtin_flag=' -fno-builtin'
+lt_prog_compiler_no_builtin_flag=' -fno-builtin -fno-rtti -fno-exceptions'

Okay, okay that's something, but wait … clang has symbol visibility support so what is going on again? Let's take a peek at config.log:

configure:29137: checking for -fvisibility=hidden compiler flag
configure:29150: cc -c -fvisibility=hidden  -I/usr/local/include -I/usr/X11R6/include conftest.c >&5
conftest.c:82:17: error: function definition is not allowed here
int main (void) { return 0; }
               ^
1 error generated.

Okay that's clearly an error but why exactly? autoconf basically generates a huge shell script that will check for whatever you throw at it by creating a file called conftest.c and putting chunks of code into it and then trying to compile it. In this case the relevant part of the code was:

| int
| main ()
| {
| int main (void) { return 0; }
|   ;
|   return 0;
| }

That is a nested function declaration which is a GNU extension and it is not supported by clang, but that's okay, the question is why the hell would you use nested functions to check for simple compiler flags. The next step was to go and check what is going on in configure.ac to see how the configure script is generated. In the gtk+3 case the following snippet is used:

   AC_MSG_CHECKING([for -fvisibility=hidden compiler flag])
   AC_TRY_COMPILE([], [int main (void) { return 0; }],
                  AC_MSG_RESULT(yes)
                  enable_fvisibility_hidden=yes,
                  AC_MSG_RESULT(no)
                  enable_fvisibility_hidden=no)

According to the autoconf manual the AC_TRY_COMPILE macro accepts the following parameters:

AC_TRY_COMPILE (includes, function-body, [action-if-found], [action-if-not-found])
Create a test program in the current language (see Language Choice) to see whether a function whose
body consists of function-body can be compiled. If the file compiles successfully,
run shell commands action-if-found, otherwise run action-if-not-found.

That clearly states that a function body has to be specified because the function definition is already provided automatically, so doing AC_TRY_COMPILE([], [int main (void) { return 0;}], instead of AC_TRY_COMPILE([],[] will result in a nested function declaration, which will work just fine with gcc, even though the autoconf usage is wrong.

A quick example:

AC_INIT(foobar, 1.0)
AC_PROG_CC

CFLAGS="-Wall"
AC_MSG_CHECKING([for -Wall compiler flag])
AC_TRY_COMPILE([], [int main (void) { return 0; }],
       AC_MSG_RESULT(yes),
       AC_MSG_RESULT(no))

GCC output:

checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for -Wall compiler flag... *yes*

Clang output:

checking for gcc... clang
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether clang accepts -g... yes
checking for clang option to accept ISO C89... none needed
checking for -Wall compiler flag... *no*

The above example clearly shows that switching to clang as the default compiler triggered an undefined behaviour in autoconf due to the fact that people do not use autoconf the way it was intended and they only got away with it because they were using GCC.

After fixing the autoconf macro in gtk+3 and rebuilding the complete port from scratch with clang, the hang completely went away as the proper CFLAGS and LDFLAGS were picked up by autoconf for the build.

At this point we realized that most of the ports tree uses autoconf so this issue might be a lot bigger than we thought, so I asked sthen@ to do a grep on the ports object directory and just search for "function definition is not allowed here", which resulted in about ~60 additional ports affected.

Out of the list of ports there were only two false positive matches. These were actually trying to test whether the compiler supports nested functions. The rest were a combination of several autoconf macros used in a wrong way, e.g: AC_TRY_COMPILE, AC_TRY_LINK. Most of them were fixable by just removing the extra function declaration or by switching to other autoconf macros like AC_LANG_SOURCE where you can actually declare your own functions if need be.

Another gem from one of the ports as the last example :)

| int
| main ()
| {
| 
| #include "stdio.h"

The conclusion is that this issue was a combination of people not reading documentation and just copy/pasting autoconf snippets, instead of reading their documentation and using the macros in the way they were intended, and the fact that switching to a new compiler is never easy and bugs or undefined behaviour are always lurking in the dark.

Thanks to everyone who helped fixing all the ports up this quickly! Hopefully all of the changes can be merged upstream, so that others can benefit as well.

Thanks very much Robert!

(Comments are closed)


Comments
  1. By brynet (Brynet) on https://brynet.biz.tm/

    Amazing write-up from Robert, hard to belief this went by unnoticed on other operating systems! :-)

  2. By Billy Larlad (137.229.105.41) larladtech@gmail.com on

    It will be good to have GNOME3 working again! Thanks to you and aja@ (and others) for the hard work fixing these problems.

  3. By Grzegorz Kulewski (194.1.144.110) on

    But... why did it hang even without that visibility option? Last time I checked processes weren't expected to hang just because their symbols weren't hidden...

    Did you debug it further?

    Comments
    1. By Nathan (76.102.253.61) nathan@braiwerk.org on

      I was wondering the same thing...

      Comments
      1. By Anonymous Coward (37.76.58.208) on

        The symbol visibility flag was just one thing that was missing in gtk+3, the others like -fno-rtti, was not mentioned in the article.
        Symbol visibility is also important for compiler optimizations as
        the optimizer will produce different code with and without symbol visibility, and if you don't hide your symbols you might end up
        with symbol collision where different libraries would have the
        same symbol for doing different stuff.

        There was just no time to look further, but feel free to go and look :)

  4. By Anonymous Coward (193.175.69.12) on

    This problem taught me once again, that when running primarily on -current with GNOME, one should always keep a productive setup of a non-3D backup DE at hand. As much as I like GNOME, this Xenocara/3D graphics drivers/dbus/*kit/gnome-*-daemon mess is impossible do debug to us mere mortals. Wasted myself many hours without even getting a clue about what was going wrong.

    So big thanks to robert@ and aja@ for digging into it and keeping the port running despite the above problems, systemd dependencies and whatnots!

  5. By Ross (173.27.144.33) on

    For us not experienced in the ways of the Force (or the Gnome), would this fix Gnome-shell crashing (with a core dump) every 5 minutes or so in 6.1?

    I'm using an older Sandybridge desktop with the basic Intel on board graphics. Gnome-shell isn't usable at all on OpenBSD for me thanks to those crashes. I've been using XFCE instead.

    Comments
    1. By Anonymous Coward (81.82.229.158) on

      Can you test and report your findings? These guys put a lot of effort into finding and fixing the issue. That would be a great way to express your appreciation for their work, I think.

      Comments
      1. By Anonymous Coward (173.27.144.33) on

        Working on that with the 11/3 snapshot of OpenBSD and a CVS pull of ports from 10/4. Unfortunately it's been building Gnome from 'scratch' for 11 or so hours now, so it'll be a while to see if it works on said system. If it stops the crashes I'll report back, if it doesn't I'll see about asking for help from those far more knowledgeable than I.

      2. By Not Ross (193.175.69.12) on

        I can confirm, that for me Gnome works again with the 2017-10-05 snapshot.

      3. By Ross (173.27.144.33) on

        Yeah, the recently uploaded 6.2 release image w/ Gnome 3.24 installed seems to be stable again. I don't know if it's because of a Gnome update 3.24 in 6.2 v. 3.22 in 6.1 or the recently updated IntelDRM driver in 6.2 or both. It's tentative as I haven't been running it long, but it has been up for about an hour or so without a crash or hang.

        Using a Sandy Bridge Core i5, 8 GB system RAM, 1 GB dedicated to VRAM.

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]