Contributed by dhartmei on from the bean-counter dept.
If you're using undeadly.org's RSS feed in any way, please read on.
In December 2005, about 17,000 unique IP addresses requested the RSS feed more than 1,000,000 times in total. The returned XML document is about 7kB in size, so that's roughly 7GB of traffic, almost half of the entire traffic generated.
Further investigation showed that about half of the clients do provide an If-Modified-Since: HTTP header, and the CGI is now using that to return a brief 304 Not Modified status code where appropriate (84 bytes vs. 7,000 bytes).
However, of the remaining 50 percent of RSS readers which don't provide this header, some are polling with a very high frequency, like once every minute. Such a client is adding up to 300MB/month, an unproportionally high number, which we couldn't afford to spend for a large number of clients.
I'd like to ask everyone who has an RSS reader/aggregator set up to please:
- Enable the use of the If-Modified-Since: header to prevent the client from fetching old redundant documents in full, or to switch to a client which supports this. With this enabled, you are free to poll as often as once every 5-15 minutes, as most replies will be small (Not Modified).
- If this is not possible, and your client fetches the full document on each request, limit polling to once every 15-60 minutes.
There were 214 stories in 2005, that means the average time between two stories posted was about 40 hours. It's a waste to fetch the very same document hundreds of times, just to get the rare changed ones with a latency of less than 15 minutes.
The CGI now (experimentally) keeps track of RSS requests per client IP address. If a client (who is not using If-Modified-Since:) fetches the full document more than 96 times within a 24h sliding window, it receives a 503 Service Unavailable status code instead. If this is hitting someone unfairly, please contact me.
The more bandwidth we can save, the more we can use for more valuable content, like stories with images.
Happy new year to everyone!
(Comments are closed)
By Anonymous Coward (213.202.214.156) on
I subscribe to the RSS feed and had a 30 minute polling time up to reading this article, changed that to 120 minutes now.
I do all my RSS fetching (as well as most of my web browsing) over tor and just wanted to request that you take that special situation into account in your rate-limiting. I have no control over which exit node my RSS requests route through to your server, nor how many others use the same exit node, and it'd be a pity if my well-behaved requests were denied because of that.
The list of current tor exit nodes can be downloaded and is dynamically updated, so you could put in a special policy for these nodes and either
1.) Not block them at all
2.) Only allow RSS download with a username and block subscribers who abuse the service via that username
Thanks.
Comments
By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html
Note that if your reader DOES supply the If-Modified-Since: header, it's not counted at all towards rate limiting (rate limiting isn't applied at all to it). Of the user agents seen from Tor, only the following don't send that header:
MagpieRSS/0.72 (+http://magpierss.sf.net)
RssReader/1.0.91.0 (http://www.rssreader.com) Microsoft Windows NT 5.1.2600.0
Snownews/1.5.4 (OpenBSD; http://snownews.kcore.de/)
If you're not using either of those, you're safe.
The latter two clients seem to support the feature, I see requests (from other IPs) using it, so I assume they can be configured to use it.
Comments
By Rembrandt (193.201.54.32) on
I'm suing Firefox for the RSS-Stuff.
And I didn't found any RSS-Related Options but I'm no X/GUI-Wizzard either...
And because Tor: Some guys don't use privoxy ;-)
Comments
By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html
But Mozilla is not a problem, as far as I can see. Not a great many people manually click Shift-Reload once a second for hours, and its automatic reloading (when you pull an RSS feed URL into the bookmarks bar, so it appears like a folder with the latest stories underneath) sends the header.
If you really want to know, run something like
# tcpdump -s 1600 -nvvvpXi $ext_if tcp and host 66.51.111.60 and port 80
and check the HTTP request(s).
Comments
By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html
1. Open the URL about:config
2. Search for the entry browser.bookmarks.livemark_refresh_seconds
3. If it doesn't exist (which is by default), create the entry (right-click on any existing entry for the context menu, New, Integer)
4. The default value (if the entry doesn't exist) is 3600 seconds (one hour), which is fine
By waldo (24.180.148.115) on
By Christian Kellermann (85.31.186.61) Christian.Kellermann@nefkom.net on
So please don't hit that 'R' key too often.
Cheers,
Christian
By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html
http://undeadly.org/cgi?action=rss&items=3
will return the newest 3 stories. The default, if the parameter is not supplied, is now 4. The maximum is 10. Obviously, the more stories the RSS document contains, the larger it becomes.
If you poll with a high frequency to mail yourself every new story as soon as possible, you can safe bandwidth by requesting less items, based on the assumption that, within 15 minutes, not more than 3 stories are posted (which has been mostly true, afaik).
On the other hand, if want something like a daily digest, and have been polling more than once a day because you feared missing stories (if there were ever more than four on one day), you could safely poll once a day with &items=10.
In short, the more often you poll, the less items per document you generally need.
By Anonymous Coward (62.252.32.11) on
Comments
By Anonymous Coward (69.18.177.10) on
Comments
By Anonymous Coward (80.65.225.229) on
Console clients don't interpret most html tag "font" properties (like the implied "type" and "size" ones).
As for table properties, by the way (width, cellspacing, border ...), all this would gain to be factored on a css file, for bandwidth sake (and for a faster browsing experience).
I usually find that css based sites (layout and design) are usually really more lynx friendly, at the end of the day.
Dillo is a special case though.
Comments
By Anonymous Coward (69.18.177.10) on
Comments
By Anonymous Coward (66.11.66.41) on
By Anonymous Coward (62.252.32.11) on
Comments
By Anonymous Coward (24.46.21.229) on
By Anonymous Coward (83.169.165.9) on
Comments
By mrkris (67.168.169.105) on
By Matthias Kilian (84.134.57.236) on
A 24h window is uneffective for lots of (at least) german users behind typical el-cheapo ADSL connections, since those have to reconnect after 24 hours, anyways. So better use something like 32/8h (or even shorter windows).
Comments
By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html
Comments
By Florian Hars (217.110.154.194) on http://www.hars.de
Comments
By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html
Example: you fetch every minute, say for the first time on Monday 02:01. You'll get a reply every minute, 96 in total, until 03:37, when you get the first error (96 requests in the last 24h, Sun 03:37 to Mon 03:37).
You continue to try to fetch every minute, and get only errors. That is until Tue 02:02 (95 requests in the last 24h, Mon 02:02 to Tue 02:02), when the request succeeds again. So do the next 95 requests. Then everything repeats. You're effectively fetching no more than 96 pages per day.
Now, with the same example, assume the client switches IP address on Mon 10:00. His next requests will succeed, since there is nothing counted for the new address. But at 11:37 he reaches the limit again, for the new address.
In short, if you change IP addresses N times a day, you can fetch at most N*96 times a day. Assuming N is not very large (1-3), the limiting still works. At least it's not at all 'uneffective'.
Comments
By gopher (84.164.205.112) on http://www.redsheep.de/
Comments
By Daniel Hartmeier (62.65.145.30) on
By Nikademus (85.201.20.195) nikademus at llorien . org on http://www.octools.com
Comments
By Daniel Hartmeier (62.65.145.30) on
By Sitsofe (213.105.224.17) on http://sucs.org/~sits/
Comments
By Daniel Hartmeier (62.65.145.30) on
http://undeadly.org/rsslog.txt
This is updated every 15 minutes. If you find yourself on this list, see the original post above.
By Anonymous Coward (83.59.175.166) on
By Anonymous Coward (80.65.225.229) on
And css would have a nice benefit also.
Those would save the clients bw too, and make the pages load faster, so a good win for everyone, he ?
Comments
By 1uanjo (84.120.176.87) on http://blackshell.usebox.net/
Good point, but I bet most RSS clients does'n handle gzip or deflated content, so the RSS problem won't be fixed compressing RSS if nobody ask for it compressed :P
It should be easy to do a survey, just log 'Accept-Encoding' header from RSS readers and see how many put gzip or deflate there.
By Anonymous Coward (64.180.208.107) on
Comments
By Daniel Hartmeier (62.65.145.30) on
By Anonymous Coward (82.143.205.169) on
Comments
By Marco Peereboom (67.64.89.177) marco@peereboom.us on http://www.peereboom.us
Unsollicited advice is great!
Thanks again!
Comments
By Ed White (151.38.61.235) on
By Janne Johansson (82.182.176.20) it.su.se (at-reversed) jan.johansson on
a 100mbit line for free. Including one 5-900MHz P3 machine for it to run
24/7 dedicated to this.
No problems.
By www (195.47.114.89) on
By Anonymous Coward (216.184.0.157) on
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.19
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.24
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.26
By Coward, not totally anonymous (69.163.38.27) matt@reliabledata.com on
I do this for a living. I buy my bandwidth by the megabit, but I generally have to sell it by the total megabytes or gigabytes transferred, because that's a concept customers can get their heads around. If a customer came to me with a circumstance like this, I would suggest we do the billing by the total amount transferred, but do some pf/altq rate limiting to save some money. You see, as a host, I am not necessarily concerned by how many gigabytes you move. My primary concern is the percentage of my available bandwidth that is being used up at any one point in time. I like to stay below 70%. It seems to me that some things you do with your site may not be as important as other things you do, and I bet you could slow down the rate a bit. We are talking about OpenBSD after all. There is usually no good reason why someone that receives a file in 60 seconds cannot wait 70, or 80, or even 90 seconds for that file. It's a small price for the end-user to pay, and can make all the difference at billing time.
Comments
By Daniel Hartmeier (195.234.187.87) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html
Those who suggest different providers, please check all related costs (including power, cooling, and support) and assume you want to bring your own box, which you (and only you) have full access to.
Comments
By Dormando (66.134.95.38) on
By Anonymous Coward (216.220.116.154) on
Comments
By wob (216.150.214.170) wob@bonch.org on
Most nerds that I know (including myself) will obsess over inefficient software, hardware, protocols, life itself :) etc. We hate waste. If we can make something more efficient, we will. Obviously it has been pointed out how some of the RSS clients are wasteful in how they do things, which wastes some money that could go somewhere else (like more geek toys!) instead of throwing it away because of bad behaving clients. It has nothing to do with the *affordability* of the bandwidth cost. The bandwidth can be paid for, but why pay for inefficient usage because of poorly designed clients?
Geesh, how hard is that to understand...I could not bite my tongue any longer and not point this out.
By Anonymous Coward (70.74.75.227) on
<html><head>Administrative: RSS rate limiting</title></head>
You can save on average 10% of text bandwidth doing this.
Happy New Year!!!
By tedu (69.12.168.114) on
Comments
By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html
I assume most people prefer a relative time, rather than an absolute (which would have to be in GMT), but I might be wrong.
Comments
By tedu (69.12.168.114) on
By Anonymous Coward (213.118.74.206) on
Don't think I've ever seen anything like that on any website. But it certainly is very cool.
By Daniel Hartmeier (62.65.145.30) daniel@benzedrine.cx on http://www.benzedrine.cx/dhartmei.html
Please let me know if anything breaks. Lynx seems to uncompress into a local file first, even though it links against zlib and should be able to inflate on-the-fly. Maybe I'm missing something there.
By Fred Cirera (208.54.95.129) hwo-32ie@iximail.com on