OpenBSD Journal

Administrative: RSS rate limiting

Contributed by dhartmei on from the bean-counter dept.

As some readers might remember, we switched to a commercial hosting provider about a month ago. Among other things, this means I now get billed for each GB of traffic generated. While the monthly sum is quite modest, analysis of the logs showed one primary bandwidth consumer, RSS.

If you're using undeadly.org's RSS feed in any way, please read on.
In December 2005, about 17,000 unique IP addresses requested the RSS feed more than 1,000,000 times in total. The returned XML document is about 7kB in size, so that's roughly 7GB of traffic, almost half of the entire traffic generated.

Further investigation showed that about half of the clients do provide an If-Modified-Since: HTTP header, and the CGI is now using that to return a brief 304 Not Modified status code where appropriate (84 bytes vs. 7,000 bytes).

However, of the remaining 50 percent of RSS readers which don't provide this header, some are polling with a very high frequency, like once every minute. Such a client is adding up to 300MB/month, an unproportionally high number, which we couldn't afford to spend for a large number of clients.

I'd like to ask everyone who has an RSS reader/aggregator set up to please:

  • Enable the use of the If-Modified-Since: header to prevent the client from fetching old redundant documents in full, or to switch to a client which supports this. With this enabled, you are free to poll as often as once every 5-15 minutes, as most replies will be small (Not Modified).
  • If this is not possible, and your client fetches the full document on each request, limit polling to once every 15-60 minutes.

There were 214 stories in 2005, that means the average time between two stories posted was about 40 hours. It's a waste to fetch the very same document hundreds of times, just to get the rare changed ones with a latency of less than 15 minutes.

The CGI now (experimentally) keeps track of RSS requests per client IP address. If a client (who is not using If-Modified-Since:) fetches the full document more than 96 times within a 24h sliding window, it receives a 503 Service Unavailable status code instead. If this is hitting someone unfairly, please contact me.

The more bandwidth we can save, the more we can use for more valuable content, like stories with images.

Happy new year to everyone!

(Comments are closed)


Comments
  1. By Anonymous Coward (213.202.214.156) on

    I subscribe to the RSS feed and had a 30 minute polling time up to reading this article, changed that to 120 minutes now.

    I do all my RSS fetching (as well as most of my web browsing) over tor and just wanted to request that you take that special situation into account in your rate-limiting. I have no control over which exit node my RSS requests route through to your server, nor how many others use the same exit node, and it'd be a pity if my well-behaved requests were denied because of that.

    The list of current tor exit nodes can be downloaded and is dynamically updated, so you could put in a special policy for these nodes and either

    1.) Not block them at all

    2.) Only allow RSS download with a username and block subscribers who abuse the service via that username

    Thanks.

    Comments
    1. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

      Interesting, I forgot about Tor, and it looks like it's being used by several people (1445 requests from Tor exit nodes within the last 24h, with about 12 unique user agent identifiers).

      Note that if your reader DOES supply the If-Modified-Since: header, it's not counted at all towards rate limiting (rate limiting isn't applied at all to it). Of the user agents seen from Tor, only the following don't send that header:

      MagpieRSS/0.72 (+http://magpierss.sf.net)
      RssReader/1.0.91.0 (http://www.rssreader.com) Microsoft Windows NT 5.1.2600.0
      Snownews/1.5.4 (OpenBSD; http://snownews.kcore.de/)

      If you're not using either of those, you're safe.

      The latter two clients seem to support the feature, I see requests (from other IPs) using it, so I assume they can be configured to use it.

      Comments
      1. By Rembrandt (193.201.54.32) on

        What's about Mozilla?
        I'm suing Firefox for the RSS-Stuff.
        And I didn't found any RSS-Related Options but I'm no X/GUI-Wizzard either...

        And because Tor: Some guys don't use privoxy ;-)

        Comments
        1. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

          Mozilla, as an interactive client, is more complex. If you simply press Reload, it sends 'If-Modified-Since: last-time-of-full-fetch', while Shift-Reload omits the header. It doesn't seem to be storing the last fetch time between invokations. There's also the issue of cache expiry, etc.

          But Mozilla is not a problem, as far as I can see. Not a great many people manually click Shift-Reload once a second for hours, and its automatic reloading (when you pull an RSS feed URL into the bookmarks bar, so it appears like a folder with the latest stories underneath) sends the header.

          If you really want to know, run something like

            # tcpdump -s 1600 -nvvvpXi $ext_if tcp and host 66.51.111.60 and port 80

          and check the HTTP request(s).

          Comments
          1. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

            Ah, it's cleverly hidden. The RSS bookmarks that automatically update themselves are called Live Booksmarks in Mozilla parlance:

            1. Open the URL about:config
            2. Search for the entry browser.bookmarks.livemark_refresh_seconds
            3. If it doesn't exist (which is by default), create the entry (right-click on any existing entry for the context menu, New, Integer)
            4. The default value (if the entry doesn't exist) is 3600 seconds (one hour), which is fine

          2. By waldo (24.180.148.115) on

            NetNewsWire from Ranchero Software (on the Mac) sends the correct headers.

      2. By Christian Kellermann (85.31.186.61) Christian.Kellermann@nefkom.net on

        According to snownews author Oliver Feiler, snownews uses the header since version 1.0. Unless people reload manually that is. In this case the whole feed gets downloaded again.

        So please don't hit that 'R' key too often.

        Cheers,

        Christian

  2. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

    BTW, you can select how many stories you want in the RSS document, using the CGI parameter &items=n, for instance

    http://undeadly.org/cgi?action=rss&items=3

    will return the newest 3 stories. The default, if the parameter is not supplied, is now 4. The maximum is 10. Obviously, the more stories the RSS document contains, the larger it becomes.

    If you poll with a high frequency to mail yourself every new story as soon as possible, you can safe bandwidth by requesting less items, based on the assumption that, within 15 minutes, not more than 3 stories are posted (which has been mostly true, afaik).

    On the other hand, if want something like a daily digest, and have been polling more than once a day because you feared missing stories (if there were ever more than four on one day), you could safely poll once a day with &items=10.

    In short, the more often you poll, the less items per document you generally need.

  3. By Anonymous Coward (62.252.32.11) on

    If you want to save some bandwidth, you could easily get rid of all those *very ugly* font tags using some basic CSS ..

    Comments
    1. By Anonymous Coward (69.18.177.10) on

      But then Undeadly wouldn't work in Dillo & Links....

      Comments
      1. By Anonymous Coward (80.65.225.229) on

        Not really.
        Console clients don't interpret most html tag "font" properties (like the implied "type" and "size" ones).
        As for table properties, by the way (width, cellspacing, border ...), all this would gain to be factored on a css file, for bandwidth sake (and for a faster browsing experience).
        I usually find that css based sites (layout and design) are usually really more lynx friendly, at the end of the day.
        Dillo is a special case though.

        Comments
        1. By Anonymous Coward (69.18.177.10) on

          Well, considering that Dillo is a Graphical client that doesn't use CSS, and that both the graphical & command line version of links do not like CSS too much, adding to the fact that it was a slight joke, really makes this rather pointless. Neither really uses the font tags, I'll give you that, but try loading most CSS pages in either & you'll have problems in short notice.

          Comments
          1. By Anonymous Coward (66.11.66.41) on

            No you won't. If it doesn't support CSS then its ignored. It won't cause problems at all.

      2. By Anonymous Coward (62.252.32.11) on

        Works fine in Links. Would work fine in Dillo as well ..the markup won't show up, not the end of the world, is it?

        Comments
        1. By Anonymous Coward (24.46.21.229) on

          But it can most definitly be a pain in the ass to view certain websites that don't format the information contained within well _without css_. Most use CSS as a way to a quick & easy display format and when you cannot use CSS it is painful to view. Blech

    2. By Anonymous Coward (83.169.165.9) on

      there was a css version but basically nobody cared for it at all

      Comments
      1. By mrkris (67.168.169.105) on

        Why would anyone use a web browser that doesn't support CSS? I mean really, do you need to be that "leet" -- sounds retarded in my opinion.

  4. By Matthias Kilian (84.134.57.236) on

    ... more than 96 times within the last 24h window...

    A 24h window is uneffective for lots of (at least) german users behind typical el-cheapo ADSL connections, since those have to reconnect after 24 hours, anyways. So better use something like 32/8h (or even shorter windows).

    Comments
    1. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

      It's a sliding window, i.e. the CGI knows how many RSS requests an IP address made during the previous 24h, ending right now (not during the previously completed 24h accounting period, ending last midnight, or such). ;)

      Comments
      1. By Florian Hars (217.110.154.194) on http://www.hars.de

        But that was the point: the moment the window becomes effective, the users connection is terminated, and he gets a new IP address, starting a new 24 hour window.

        Comments
        1. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

          And with that new IP address, he can request the RSS 96 times, then the threshold (for the new address) is immediately exceeded. So, if he changes IP addresses every 24h, he can fetch at most 96 times with each address, that is 96 times per day. That's the goal. Not to silence anyone for subsequent 24h when he exceeds the threshold.

          Example: you fetch every minute, say for the first time on Monday 02:01. You'll get a reply every minute, 96 in total, until 03:37, when you get the first error (96 requests in the last 24h, Sun 03:37 to Mon 03:37).

          You continue to try to fetch every minute, and get only errors. That is until Tue 02:02 (95 requests in the last 24h, Mon 02:02 to Tue 02:02), when the request succeeds again. So do the next 95 requests. Then everything repeats. You're effectively fetching no more than 96 pages per day.

          Now, with the same example, assume the client switches IP address on Mon 10:00. His next requests will succeed, since there is nothing counted for the new address. But at 11:37 he reaches the limit again, for the new address.

          In short, if you change IP addresses N times a day, you can fetch at most N*96 times a day. Assuming N is not very large (1-3), the limiting still works. At least it's not at all 'uneffective'.

          Comments
          1. By gopher (84.164.205.112) on http://www.redsheep.de/

            May be you should also use the user agent string. That would terminate those broken rss readers but not completely block the ip address. Or may be those msie based rss readers eat cookies... However, its inlikely that's me who gets one of those blocked ips reassigned;-)

            Comments
            1. By Daniel Hartmeier (62.65.145.30) on

              There's no blocking on IP level (as in pf blacklisting), when an IP exceeds the RSS threshold, only further RSS requests (and only those that don't have the header) are refused, everything else (HTML pages, images) are still served.

  5. By Nikademus (85.201.20.195) nikademus at llorien . org on http://www.octools.com

    What about Google IG? (http://www.google.com/ig). I assume this thing takes almost the same IP for every fetch. The client doesn't know how much google can fetch the RSS feed at a time or in a period. Blocking google's IG may not seem very handy, because there should be more people fetching the feed from there than fetching it from the site. From one google RSS request, you probably spare many RSS request from clients. If people stop using google RSS client, you may end in getting more requests not less as people will begin using their own client. Dunno if Google IG uses the modified flags however.

    Comments
    1. By Daniel Hartmeier (62.65.145.30) on

      That would be Google's Feedfetcher, which is seen about 60 times per day like

      72.14.199.65 - - [02/Jan/2006:10:36:36 -0700] "GET /undeadly.org/cgi?action=rss HTTP/1.1"
        200 4871 "" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)"
      
      always from the same IP address, and not using the header. But Google itself rate-limits its requests, and doesn't trigger a request every time a reader uses the feed (which would defeat the purpose of a cache completely ;).

  6. By Sitsofe (213.105.224.17) on http://sucs.org/~sits/

    I'm sorry to hear that the bandwidth bill has been driven up in this manner. Can you name and shame those clients that do not send the appropriate header? How much are those clients costing you in comparison to the well behaved ones? Have you also considered using a third party host for the feeds (e.g. FeedBurner)? Would this help solve the problem or would that just mean paying more bills?

    Comments
    1. By Daniel Hartmeier (62.65.145.30) on

      If you're curious, here's a list of all clients that are currently being limited (because they sent more than 96 unconditional requests in the last 24h), together with the User-Agent: they supplied.

      http://undeadly.org/rsslog.txt

      This is updated every 15 minutes. If you find yourself on this list, see the original post above.

  7. By Anonymous Coward (83.59.175.166) on

    me likes good english yours

  8. By Anonymous Coward (80.65.225.229) on

    By the way, wouldn't an on-the-fly compression engine (a la mod_gzip / deflate) be a good bw saver here ? There's patches around there for thttpd too.

    And css would have a nice benefit also.

    Those would save the clients bw too, and make the pages load faster, so a good win for everyone, he ?

    Comments
    1. By 1uanjo (84.120.176.87) on http://blackshell.usebox.net/

      Good point, but I bet most RSS clients does'n handle gzip or deflated content, so the RSS problem won't be fixed compressing RSS if nobody ask for it compressed :P

      It should be easy to do a survey, just log 'Accept-Encoding' header from RSS readers and see how many put gzip or deflate there.

  9. By Anonymous Coward (64.180.208.107) on

    Just a quick note, there are currently 228 subscribers using the Bloglines service to monitor Undeadly. I have no idea how many hits Blogline generates, or if it uses If-Modified-Since:.

    Comments
    1. By Daniel Hartmeier (62.65.145.30) on

      It's fine, polling only every 30 minutes and supplying the header, too.

  10. By Anonymous Coward (82.143.205.169) on

    Why not just use a hosting provider that doesn't bill the traffic?

    Comments
    1. By Marco Peereboom (67.64.89.177) marco@peereboom.us on http://www.peereboom.us

      That's nice of you to offer money and resources. Please let us know where we can host this.

      Unsollicited advice is great!
      Thanks again!

      Comments
      1. By Ed White (151.38.61.235) on

        If you want to get a lot of bandwitdh and a good dedicated hardware, try this address: http://www.layeredtech.com/servers.shtml. They offer 1000GB/month. Servers start from $65/month. OpenBSD is supported.

      2. By Janne Johansson (82.182.176.20) it.su.se (at-reversed) jan.johansson on

        Actually, for something (non-commercial) like undeadly, I can host it on
        a 100mbit line for free. Including one 5-900MHz P3 machine for it to run
        24/7 dedicated to this.
        No problems.

      3. By www (195.47.114.89) on

        It's nice to see that people are trying to avoid wasting using knowledge and technology when they are forced to pay for the wasting. I'm just glad to see that it's working (you don't loose your comfort) and the world hopefully won't die from "overwasting". Just a philosophical note, never mind me :)))

  11. By Anonymous Coward (216.184.0.157) on

    Maybe you could add an ETag header to each response and support for If-None-Match in requests? It's an HTTP 1.1 thing instead of a 1.0 thing, so maybe wouldn't fit in well, but I've always seen it mentioned in relation to being a polite RSS consumer. I don't know if there are any RSS fetchers which support only ETags instead of modification dates (I'd assume that they'd be more likely to support modification dates if they were going to implement only one of the two), but it can't hurt to cover all your bases.

    http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.19
    http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.24
    http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.26

  12. By Coward, not totally anonymous (69.163.38.27) matt@reliabledata.com on

    7 gigs? I don't know what you pay, but if 7 gigs is making you worry, you need a new host. There are plenty of hosts that will move 7 gigs for very little money.

    I do this for a living. I buy my bandwidth by the megabit, but I generally have to sell it by the total megabytes or gigabytes transferred, because that's a concept customers can get their heads around. If a customer came to me with a circumstance like this, I would suggest we do the billing by the total amount transferred, but do some pf/altq rate limiting to save some money. You see, as a host, I am not necessarily concerned by how many gigabytes you move. My primary concern is the percentage of my available bandwidth that is being used up at any one point in time. I like to stay below 70%. It seems to me that some things you do with your site may not be as important as other things you do, and I bet you could slow down the rate a bit. We are talking about OpenBSD after all. There is usually no good reason why someone that receives a file in 60 seconds cannot wait 70, or 80, or even 90 seconds for that file. It's a small price for the end-user to pay, and can make all the difference at billing time.

    Comments
    1. By Daniel Hartmeier (195.234.187.87) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

      It's about USD $2 per GB, so we're talking about $10-$40 per month. That's what I meant with 'quite modest'. It's not that I'll starve because of an additional $30/month, but neither do I like wasting $360/year because a few people don't bother configuring their RSS clients.

      Those who suggest different providers, please check all related costs (including power, cooling, and support) and assume you want to bring your own box, which you (and only you) have full access to.

      Comments
      1. By Dormando (66.134.95.38) on

        Well, what are the actual requirements? undeadly needs a dedicated openbsd box somewhere with no other shared customers on it? Would an RSS/image/etc mirror be of any use? IE; undeadly randomly redirects to the RSS mirror instead of serving up the page locally.

      2. By Anonymous Coward (216.220.116.154) on

        Just a note about the bandwidth prices. Even if you buy a full 10Mb/s port from that provider, they are really marking up the bandwidth. It's really hard to pay more than $100/Mb these days, even if you buy from the more overpriced ones. So on that $2500/10Mb port, minimum markup is $1500, and that assumes you saturate it 24/7.

        Comments
        1. By wob (216.150.214.170) wob@bonch.org on

          From what I am reading here, the point is NOT THE COST OF THE BANDWIDTH.

          Most nerds that I know (including myself) will obsess over inefficient software, hardware, protocols, life itself :) etc. We hate waste. If we can make something more efficient, we will. Obviously it has been pointed out how some of the RSS clients are wasteful in how they do things, which wastes some money that could go somewhere else (like more geek toys!) instead of throwing it away because of bad behaving clients. It has nothing to do with the *affordability* of the bandwidth cost. The bandwidth can be paid for, but why pay for inefficient usage because of poorly designed clients?

          Geesh, how hard is that to understand...I could not bite my tongue any longer and not point this out.





  13. By Anonymous Coward (70.74.75.227) on

    To save bandwidth from text, you could drop isspace(3) between ">" and "<" in served pages (not templates.) For instance, right now I see:
    <html>
    
            <head>
                    <title>Administrative: RSS rate limiting</title>
            </head>
    You could strip the isspace(3) between ">" and "<", and have
    <html><head>Administrative: RSS rate limiting</title></head>

    You can save on average 10% of text bandwidth doing this.
    Happy New Year!!!

  14. By tedu (69.12.168.114) on

    while we're playing "nice to have", adding the most recent comment timestamp to the front, in addition to "X comments", would be convenient.

    Comments
    1. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

      Happy? :)

      I assume most people prefer a relative time, rather than an absolute (which would have to be in GMT), but I might be wrong.

      Comments
      1. By tedu (69.12.168.114) on

        sweet

      2. By Anonymous Coward (213.118.74.206) on

        Sweet! This is a very nice feature!

        Don't think I've ever seen anything like that on any website. But it certainly is very cool.

  15. By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html

    On-the-fly gzip compression is now enabled. The front page compresses from 36.1 kB to 7.4 kB, I guess I can leave the HTML whitespace in there now ;)

    Please let me know if anything breaks. Lynx seems to uncompress into a local file first, even though it links against zlib and should be able to inflate on-the-fly. Maybe I'm missing something there.

  16. By Fred Cirera (208.54.95.129) hwo-32ie@iximail.com on

    First you should consider adding the tag ttl in your rss file. <ttl>120</ttl> tells the rss aggregator the information is valid for the next two hours . The field etags in the http response helps also to save a lot of bandwith. -fred-

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]