As the last post was about the size of bash.org, this one is about xkcd, the famous comic site, a simple set of scripts and you get the whole set and a few stats:
Use script wisely, it’s a strain on servers.
#!/bin/bash
echo "Downloading 395 pages."
for i in `seq 1 395`;
do
if [ -s "xkcd/$i" ]; then
continue
else
echo -n "`date +%H:%M:%S`: Trying $i ..."
lynx --source "http://xkcd.com/$i" > "xkcd/$i"
echo -n " Done. Image:.. "
wget -q -p "comics" -nH "http://imgs.xkcd.com/comics/"`awk 'BEGIN{FS="<img src=\"http://imgs.xkcd.com/comics/";RS="\" title="}/<img/{print $2}' "xkcd/$i"`
echo " Done."
sleep 2s
fi
done
echo "All done."
This piece of code does sometihng special, it takes the name of the image and uses wget to download it.
$n=1;
$vse=0;
while ($n < 410) {
unset ($fajl);
$fajl=file_get_contents("original/".$n);
preg_match_all("|
<p class=\"quote\">(.*)<b>#(.*)</b>(.*)
<p class=\"qt\">(.*)
|Us", $fajl, $out);
$i=0;
while (isset($out[0][$i])) {
echo '('.$out[2][$i].")\n".$out[4][$i]."\n";
echo $out[2][$i]."\n".$out[4][$i]."\n";
$i++;
$vse++;
}
$n++;
}
echo "\n(".$vse.")";
And a parser that makes the final big file of everything, coincidentally also making the comments easy to read.
Comics make the most part of the download, with ~22 MB.
And as usual, the download link: LINK (22mb), or email me for the data.