I spent the last few hours on a simple question, how large is the worlds largest irc quote database (bash.org) ?
Thinking specifically of the quotes themselves.
So first i had to get them all, a simple bash script was sufficient.
#!/bin/bash echo "Downloading 409 pages." for i in `seq 1 409`; do if [ -s "original/$i" ]; then continue else echo -n "`date +%H:%M:%S`: Trying $i ..." lynx --source "http://www.bash.org/?browse=$i" > "original/$i" echo "Done." sleep 10s fi done echo "All done."
Please, do not use that script, it is a strain on the bash servers, instead you can grab the original files at the end of the article.
After a couple of hours that was done, and i had my next script ready as well;
$n=1; $vse=0; while ($n < 410) { unset ($fajl); $fajl=file_get_contents("original/".$n); preg_match_all("| <p class="\">(.*)<strong>#(.*)</strong>(.*) <p class="\">(.*)|Us", $fajl, $out); $i=0; while (isset($out[0][$i])) { echo '('.$out[2][$i].")\n".$out[4][$i]."\n"; echo $out[2][$i]."\n".$out[4][$i]."\n"; $i++; $vse++; } $n++; } echo "\n(".$vse.")";
The last line is to make sure i got all of them, 20440 at the time.
Ran it with shell, and piped to "final": php parser.php > final
So, the conclusion was, the size of bash.org is ~5 MB
This are the files if you want them: link. (or email me)