OpenBSD Journal

LZMA challenge

Contributed by marco on from the dept.

Today Steven committed the LZMA port and unlike bzip2 the algorithm doesn't seem to be patented. I have to admit that I was very impressed with the performance of the compressor after Todd showed me his results. The only downside seems to be the time it takes to compress something. Decompression speed is about the same as gzip or bzip2.

The issues at hand here are that this code is GPLd and is written in C++ for added obfuscation. What would make this algorithm very useful is if someone could write an actual free BSD licensed version with the exact same API as libz. In other words create a drop-in replacement for libz.

If you intend to take on this challenge, here are some guidelines:

* BSD licensed
* written per style(9)
* Written in C
* Should have a full regression test
* It should follow the OpenBSD coding and design practices
Happy coding!

Update: I should have done some more research. I was incorrect about bzip2 being patented. I searched the patent offices and only found references to it as being public domain. My bad, sorry for the confusion. Thanks to tedu to bring this to my attention.

(Comments are closed)


Comments
  1. By Noryungi (82.127.29.248) n o r y u n g i @ y a h o o . c o m on

    Quick question...

    The OpenBSD web site states the following:

    Integrate good code from any source with acceptable copyright (ISC or Berkeley style preferred, GPL acceptable as a last recourse but not in the kernel, NDA never acceptable).

    Since I don't think LZMA will be integrated in the kernel anytime soon, is it really necessary to create another implementation the GPLed LZMA? I can understand the interest of a "pure C" LZMA, but rewriting the whole thing may be overkill... For instance, has anyone contacted the author of the LZMA utilities to arrange a release under the BSD license? Is this possible?

    Comments
    1. By Anonymous Coward (143.166.226.16) on

      A few things:
      * The code is in C++ and its quite ugly.
      * GPL is a virus and won't make it into the tree.
      * The utilities are not well written either.
      
      A drop in replacement would be awesome.

      Comments
      1. By Noryungi (82.127.29.248) n o r y u n g i @ y a h o o . c o m on

        Thanks, I guess that answers my question.

      2. By Anonymous Coward (204.245.224.15) on

        * GPL is a virus and won't make it into the tree.

        Not even as a "last recourse" ?

        Comments
        1. By Fábio Olivé Leite (200.248.155.122) on

          It has been stated several times, in several places and in several ways already that no additional piece of GPL code will ever make it into the tree again, and the remaining GPL portion is expected to gradually be replaced by BSD/ISC/license.template :) alternatives when possible. Someone should definitely update that goals.html...

    2. By Anonymous Coward (62.252.32.11) on

      The 7Zip source download includes a C-implementation which is LGPL/CPLd.

      Comments
      1. By Anonymous Coward (143.166.226.16) on

        GPL is not acceptable as noted several times now.

        Comments
        1. By Anonymous Coward (129.215.13.83) on

          Obviously you can't read. GPL != LGPL.

          The LGPL license terms do not require software which uses an LGPL-ed library (say) to be redistributed under the LGPL.

          Also:
          "SPECIAL EXCEPTION: Igor Pavlov, as the author of this code, expressly permits you to statically or dynamically link your code (or bind by name) to the files from LZMA SDK without subjecting your linked code to the terms of the CPL or GNU LGPL. Any modifications or additions to files from LZMA SDK, however, are subject to the GNU LGPL or CPL terms."
          [http://www.7-zip.org/sdk.html]

          Comments
          1. By Anonymous Coward (66.11.66.41) on

            That's still doesn't make it free. They don't want to include software in openbsd that is not BSD licensed or more free than that.

          2. By Anonymous Coward (143.166.255.16) on

            GPL ~ LGPL is the problem. Maybe this'll make more sence to you *GPL* of ANY version is not acceptable. Does that help?

        2. By Anonymous Coward (131.251.0.11) on

          Well excuse me, Mr Smarty Pants. The reason why I pointed out this C version, is so that people could use it as a reference rather than the "obfuscated" C++ version.

        3. By Anonymous Coward (24.80.111.103) on

          As noted several times now, you have noted that it's not acceptable.

      2. By gwyllion (134.58.253.130) on

        There is only C code for decompression, but not for compression. From http://www.7-zip.org/sdk.html
        LZMA SDK includes:
            * C++ source code of LZMA Encoder and Decoder
            * ANSI-C compatible source code for LZMA decompressing with example
            * C# source code for LZMA compressing and decompressing
            * Java source code for LZMA compressing and decompressing
            * Compiled file->file LZMA compressing/decompressing program for Windows system
        

      3. By sthen (81.168.66.226) on

        fwiw: LZMA is the main compression method used by 7-Zip. One thing I haven't seen mentioned in this discussion yet is that it doesn't need a huge amount of resources to decompress (e.g. significantly less RAM than bzip2) so it's fairly reasonable to use as a file-distribution format (as is done in the popular win32 software packager, NSIS).

    3. By Rembrandt (84.188.234.142) on

      Well let me tell ya some effects of LZMA:

      1st it takes a lot time to process... yeah but
      2nd: The compression is awesome.

      I compressed the sourcode of a bot (malware) wich is abiut ~52.3MB (bzip2). Let`s call this Bot "Phatbot"...

      So guess how small LZMA compressed the file.
      lzmash -9 -> 7.2 MB!
      normal lzma: 9.3MB

      The source is about 153MB uncompressed.
      So the compressionratios is realy.. unbeliveable.

      This is incredible, realy.
      Ok the Phatbot-Source may contain a lot of familiar stuff (like Plugins and bla).

      But even the OpenBSD src.tar.gz is with a normal lzma (no tuning) just 60MB (~100MB as tar.gz provided by the FTP/HTTP-Mirrors).


      Just to show some people how much space this algorithm could save...
      The decompression speed is also much better then bzip2 (with all source-codes-Tarballs I tested)

      Kind regards,
      Rembrandt

  2. By tedu (71.139.175.127) on

    bzip2 isn't patented. it's totally free.

    Comments
    1. By Anonymous Coward (151.41.11.190) on

      Maybe I'm wrong, but I remember a sort of math algorithm used by bzip2 was patented. Something like what happened with .gif

      Comments
      1. By Thorsten Glaser (81.173.171.215) tg@mirbsd.de on http://mirbsd.de/

        That was bzip, not bzip2. Arithmetic encoding/compression.
        Not patented everywhere, though. (I don't care ;)

    2. By Anonymous Coward (128.171.90.200) on

      That is correct, but it is very slow.

      Is LZMA any faster or better at compressing ?

      Comments
      1. By Anonymous Coward (71.145.130.63) on

        Dude, read the posting for shits sake.

        Comments
        1. By Anonymous Coward (128.171.90.200) on

          to quote :

          "I have to admit that I was very impressed with the performance of the compressor after Todd showed me his results. The only downside seems to be the time it takes to compress something. Decompression speed is about the same as gzip or bzip2."

          That tells me very little

  3. By Thorsten Glaser (81.173.175.132) tg@mirbsd.de on http://mirbsd.de/

    The zlib API sucks.

    Compare zopen(3) as on, for example,
    http://cvs.mirbsd.de/src/usr.bin/compress/zopen.c
    with gzopen(3) as in, for example,
    http://cvs.mirbsd.de/src/lib/libz/gzio.c

    The latter requires you to use gzread/write/seek/close.

    I found it ugly. My solution:
    http://cvs.mirbsd.de/src/lib/libz/gzfopen.c

    It's a non-standard extension, admittedly, but works
    well enough to allow for savecore(8) being operable
    on gzipped kernels.

  4. By Richard (195.212.199.56) on

    ...and is written in C++ FOR ADDED OBFUSCATION... [my emphasis]

    An interesting and somewhat different angle to take I suppose :o)

  5. By Anonymous Coward (202.45.99.226) on

    Sorry for posting this here, but I didnt know where else to ask. Does anybody know any good OpenBSD books?

    Comments
    1. By arndt (129.143.29.10) on

      http://www.openbsd.org/books.html

      Comments
      1. Comments
        1. By Fábio Olivé Leite (15.227.249.72) on

          Wikiped-a-holics beware!

          Speaking of Wikipedia, I just found this fairly interesting bit of information in Pufferfish:

          Due to some unknown selection pressure, intronic and extragenic sequences have been drastically reduced within this family. As a result, they have the smallest-known genomes yet found amongst the vertebrate animals, while containing a genetic repertoire very similar to other fishes and thus comparable to vertebrates generally. Since these genomes are relatively compact it is relatively fast and inexpensive to compile their complete sequences, as has been done for two species (Takifugu rubripes and Tetraodon nigroviridis).

          The similarities with OpenBSD's source tree are amazing.

          Damn, every hit on Wikipedia means at least half an hour of unbillable time. ;-)

  6. By Anonymous Coward (202.45.99.226) on

    The code should be secure of course. Remember the recent heap overflow in many RAR libraries.

  7. By Anonymous Coward (63.237.125.191) on

    * It should follow the OpenBSD coding and design practices
    I've always wondered... Are these documented somewhere?

    Comments
    1. By Nate (65.94.61.3) on

      Well, for one, man style

  8. By Chas (147.154.235.51) on

    I have a big data warehouse in HP-UX Oracle 7 and I like compressing everything down to one DLT. gzip takes 36 hours if I don't spread it out over multiple cpus (240Mhz is getting pretty creaky).

    I wish that there was a C compressor as well as a decompressor.

    Comments
    1. By tedu (69.12.168.114) on

      you think lzma is gonna help? it'd take a year to compress that with lzma.

  9. By Anonymous Coward (24.37.81.6) on

    I have a small question regarding the "must be written in C" thing. Why accept C, but not C++? I'm not familiar with development on OpenBSD, I'm merely a user, but is it because more people know C? Because it reduces the number of dependencies? Problems with C++?

    Thanks for the info.

    Comments
    1. By Ray (192.193.220.141) on

      I think this is obvious enough: The issues at hand here are that this code is GPLd and is written in C++ for added obfuscation.

  10. By Nick Holland (68.43.117.34) nick@holland-consulting.net on http://www.openbsd.org/faq/

    Another interesting compression system in the ports tree that doesn't get enough attention, I think, is rzip.

    Three copies of the same file:
    -r--r--r-- 3 archive archive 654813222 Jan 23 11:43 smf
    -r--r--r-- 3 archive archive 654813222 Jan 23 11:43 smf1
    -r--r--r-- 3 archive archive 654813222 Jan 23 11:43 smf2

    These are mbox mail spools...yes, 650M in size. LOTS of binary attachments. gzip did relatively little for me for compression on these files.

    $ time rzip smf1
    3m26.64s real 3m5.39s user 0m6.34s system

    $ time lzma e smf2 smf2.lz
    17m53.29s real 17m30.78s user 0m2.18s system

    And the results:

    $ ls -l smf*
    -r--r--r-- 2 archive archive 654813222 Jan 23 11:43 smf
    -r--r--r-- 1 archive archive 293921881 Feb 1 08:35 smf1.rz
    -r--r--r-- 2 archive archive 654813222 Jan 23 11:43 smf2
    -rw-r--r-- 1 archive archive 359881838 Feb 1 09:00 smf2.lz

    IN THIS CASE, rzip was the clear winner in compression performance and results. HOWEVER, it used huge amounts of memory to accomplish this task, memory the amd64 system I did this on had. If I had only 256M, I suspect the comparative performance would be significantly different due to swap.

    When setting up this system, I tried gzip and bzip2. bzip2 impressed me in how much more time it took than gzip, but provided very little improvement in compression. For this application, bzip2 was just an insulting joke, not a solution. rzip took about as long as bzip2, but gave much greater compression.

    rzip has its quirks: rzip is also not pipable, and the uncompression will thrash the disk rather more than expected, and for that reason, I suspect the uncompression speeds for rzip would be much slower than most other uncompression processes. This creates some problems on a machine with one disk system when you try to unpack multiple rzip files at once (hint: just don't). The lack of pipe-ability is annoying at times, if you want to confirm the md5 file of a compressed file, you must uncompress it to disk. Very "anti-Unix" in style.

    Moral: there are many kinds of compression systems out there. There is all kinds of data. Experiment with your data. Don't use my results for anything other than "hey, maybe we should look at ..."

    Nick.

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]