Podcast Downloads Part Deux


I just checked the Apache logs and grepped out the lines dealing with the podcasts, and it looks like the podcasts have been downloaded a total of 1377 661 times! (see update below) Holy crap! Who is downloading this stuff? No, really. I'd love to know. If you've downloaded any of the experimental podcasts, please let me know!

The simple and not exactly efficient command I call to count podcatches is:


grep "/~dnorman/podcasts/" < /var/log/httpd/commons_access_log | grep "GET " | grep " 200 " | grep ".mp3" | grep -v "65536" | wc -l

The basic logic of that statement is something like: "Look in the apache log, and pull out all lines referring to files in the /~dnorman/podcasts/ directory. Of those lines, retain only those requests that were GET (ignoring HEAD requests...), and of those, retain only those that were 200 (not incomplete or file not found etc..). Of those lines, retain only those that point directly to a .mp3 file (ignoring directory listsings...), ignore the silly repeated download of files by podcatching software (which appear to download 65536 bytes of files just to check that they're still there...)and feed what's left into the wc (word count) command, telling it to return the number of lines left.

I'm sure there's a more efficient and reliable way of doing this (I'm sure if I knew wtf I was doing with regular expressions that I could combine all of the piped grep statements into a single one with a more complex pattern.), but it seems to work... If I want to see how many times a particular podcast has been downloaded, I can change the last ".mp3" grep to be the filename I'm looking for.

UPDATE: Christian let me know in the comments that the podcatch software typically does a very silly thing - it repeatedly downloads about 65K of the mp3 to see if it's changed. They don't do the right thing and request the HEAD, which would tell them the same thing in a few bytes of text, but instead download about 65K of the actual file and abort it. That's so unbelievably silly that my mind reels. And it also pollutes the apache logs so it's MUCH harder to see how many real downloads are going on. Perhaps I could modify my grep to drop lines containing the (hopefully fixed) downloads associated with podcatch pinging...

Anyway, I'm not disappointed by the lower "real" numbers - I was more freaked out by the extremely inflated "raw" numbers. Still, if you're downloading these things, I'd love to know...


comments powered by Disqus