RawDev.net - Developing developement developing developers.
Home - Files - Pastebin - Mail

Posts Tagged "script"

Size of XKCD

Saturday, March 15th, 2008 by Hekos

As the last post was about the size of bash.org, this one is about xkcd, the famous comic site, a simple set of scripts and you get the whole set and a few stats:
Use script wisely, it's a strain on servers.

#!/bin/bash
echo "Downloading 395 pages."
for i in `seq 1 395`;
do
	if [ -s "xkcd/$i" ]; then
		continue
	else
		echo -n "`date +%H:%M:%S`: Trying $i ..."
		lynx --source "http://xkcd.com/$i" > "xkcd/$i"
		echo -n " Done. Image:.. "
		wget -q -p "comics" -nH "http://imgs.xkcd.com/comics/"`awk 'BEGIN{FS="<img src=\"http://imgs.xkcd.com/comics/";RS="\" title="}/<img/{print $2}' "xkcd/$i"`
		echo " Done."
		sleep 2s
	fi
done
echo "All done."

This piece of code does sometihng special, it takes the name of the image and uses wget to download it.

$n=1;
$vse=0;
while ($n &lt; 410) {
	unset ($fajl);
	$fajl=file_get_contents("original/".$n);
 
	preg_match_all("|
&lt;p class=\"quote\"&gt;(.*)&lt;b&gt;#(.*)&lt;/b&gt;(.*)
&lt;p class=\"qt\"&gt;(.*)
 
|Us", $fajl, $out);
	$i=0;
	while (isset($out[0][$i])) {
		echo '('.$out[2][$i].")\n".$out[4][$i]."\n";
		echo $out[2][$i]."\n".$out[4][$i]."\n";
		$i++;
		$vse++;
	}
	$n++;
}
echo "\n(".$vse.")";

And a parser that makes the final big file of everything, coincidentally also making the comments easy to read.
Comics make the most part of the download, with ~22 MB.

And as usual, the download link: LINK (22mb), or email me for the data.

Tags: , , ,
Posted in Hacking, Scripting - No Comments

Size of Bash.org

Saturday, March 15th, 2008 by Hekos

I spent the last few hours on a simple question, how large is the worlds largest irc quote database (bash.org) ?
Thinking specifically of the quotes themselves.

So first i had to get them all, a simple bash script was sufficient.

#!/bin/bash
echo "Downloading 409 pages."
for i in `seq 1 409`;
do
if [ -s "original/$i" ]; then
continue
else
echo -n "`date +%H:%M:%S`: Trying $i ..."
lynx --source "http://www.bash.org/?browse=$i" &gt; "original/$i"
echo "Done."
sleep 10s
fi
done
echo "All done."

Please, do not use that script, it is a strain on the bash servers, instead you can grab the original files at the end of the article.
After a couple of hours that was done, and i had my next script ready as well;

$n=1;
$vse=0;
while ($n &lt; 410) {
unset ($fajl);
$fajl=file_get_contents("original/".$n);
 
preg_match_all("|
<p class="\">(.*)<strong>#(.*)</strong>(.*)
<p class="\">(.*)|Us", $fajl, $out);
$i=0;
while (isset($out[0][$i])) {
echo '('.$out[2][$i].")\n".$out[4][$i]."\n";
echo $out[2][$i]."\n".$out[4][$i]."\n";
$i++;
$vse++;
}
$n++;
}
echo "\n(".$vse.")";
 

The last line is to make sure i got all of them, 20440 at the time.
Ran it with shell, and piped to "final": php parser.php > final

So, the conclusion was, the size of bash.org is ~5 MB
This are the files if you want them: link. (or email me)

Tags: , , ,
Posted in Hacking, Scripting - No Comments