I spent the last few hours on a simple question, how large is the worlds largest irc quote database (bash.org) ?
Thinking specifically of the quotes themselves.
So first i had to get them all, a simple bash script was sufficient.
#!/bin/bash
echo "Downloading 409 pages."
for i in `seq 1 409`;
do
if [ -s "original/$i" ]; then
continue
else
echo -n "`date +%H:%M:%S`: Trying $i ..."
lynx --source "http://www.bash.org/?browse=$i" > "original/$i"
echo "Done."
sleep 10s
fi
done
echo "All done."
Please, do not use that script, it is a strain on the bash servers, instead you can grab the original files at the end of the article.
After a couple of hours that was done, and i had my next script ready as well;
$n=1;
$vse=0;
while ($n < 410) {
unset ($fajl);
$fajl=file_get_contents("original/".$n);
preg_match_all("|
(.*)#(.*)(.*)
(.*)|Us", $fajl, $out);
$i=0;
while (isset($out[0][$i])) {
echo '('.$out[2][$i].")\n".$out[4][$i]."\n";
echo $out[2][$i]."\n".$out[4][$i]."\n";
$i++;
$vse++;
}
$n++;
}
echo "\n(".$vse.")";
The last line is to make sure i got all of them, 20440 at the time.
Ran it with shell, and piped to “final”: php parser.php > final
So, the conclusion was, the size of bash.org is ~5 MB
This are the files if you want them: link. (or email me)