As the last post was about the size of bash.org, this one is about xkcd, the famous comic site, a simple set of scripts and you get the whole set and a few stats:
Use script wisely, it's a strain on servers.
#!/bin/bash
echo "Downloading 395 pages."
for i in `seq 1 395`;
do
if [ -s "xkcd/$i" ]; then
continue
else
echo -n "`date +%H:%M:%S`: Trying $i ..."
lynx --source "http://xkcd.com/$i" > "xkcd/$i"
echo -n " Done. Image:.. "
wget -q -p "comics" -nH "http://imgs.xkcd.com/comics/"`awk 'BEGIN{FS="<img src=\"http://imgs.xkcd.com/comics/";RS="\" title="}/<img/{print $2}' "xkcd/$i"`
echo " Done."
sleep 2s
fi
done
echo "All done."
This piece of code does sometihng special, it takes the name of the image and uses wget to download it.
$n=1; $vse=0; while ($n < 410) { unset ($fajl); $fajl=file_get_contents("original/".$n); preg_match_all("| <p class=\"quote\">(.*)<b>#(.*)</b>(.*) <p class=\"qt\">(.*) |Us", $fajl, $out); $i=0; while (isset($out[0][$i])) { echo '('.$out[2][$i].")\n".$out[4][$i]."\n"; echo $out[2][$i]."\n".$out[4][$i]."\n"; $i++; $vse++; } $n++; } echo "\n(".$vse.")";
And a parser that makes the final big file of everything, coincidentally also making the comments easy to read.
Comics make the most part of the download, with ~22 MB.
And as usual, the download link: LINK (22mb), or email me for the data.