Contributed by jj on from the difference engine dept.
I've just committed changes to pkg_create that will help mirrors synch by using much less bandwidth.
I just ran a final test.
Rsynching a full amd64 snapshot now says something like:
sent 7,315,796,510 bytes received 40,292,721 bytes 4,517,095.01 bytes/sec total size is 28,752,806,019 speedup is 3.91
A few months ago, after the "reorder files in packages", Stuart Henderson commented that this would not help mirrors, but just the end user, which got me thinking...
(Reminder: archives are compressed files. rsync does not peek inside the compressed data, so its comparison algorithms don't work so well with them, as the first different byte will change everything for the rest of the archive, so no speed-up for compressed files).
I looked at the --rsyncable patch for zlib/gzip, and talked it over with sthen@ and millert@, but pretty soon we discarded that idea. That patch is brittle (every zlib version has got its own flavor of it, with wild differences) and a nightmare to maintain. Plus it won't work at all with other compression formats.
The solution was low-tech: simply cut the archive into more gzip chunks (signatures already split the package into two parts, so we know the tools work). I chose 16 files as a simple guideline to experiment with. There were still some discrepancies, such as tar timestamps metadata, which is why those migrated to the plist a few weeks ago (side-effect: the tarball effectively says everything dates back to the epoch... not so bad).
I was pleasantly surprised: the size increase is minimal (very much under 1%).
I also wacked on gzip timestamps, which don't serve any useful purpose either, especially since the plist signature also contains a timestamp (and that one is signed, so it's ways more trustworthy).
Obviously, the first snapshot out will still copy everything. But from the second one, mirror owners should see a difference.
To benefit: - mirror owners must now use rsync algorithms. Turn off -W / --whole-file if you were using it. - turn on -y / --fuzzy, as this will "track" minor package version changes.
Note that this only applies to the "package snapshots" part of OpenBSD.
My test was a bit extreme: I did build two snaps with the exact same ports tree, so the similarities are maximal. Nevertheless, there are lots of *huge* packages in the ports tree. So I expect the bandwidth gain to be very significant anyway, especially for fast architectures which turn up one snapshot a week or more. e.g., bandwidth use should be more than halved, I expect.
(Comments are closed)