Saturday, 15 November 2008

A tour through home 'pageranking' - try it yerself!

Originally posted in The Daily Daniel
_____________________________

Now I live in the foothills of silicon valley, I thought I'd write a piece on one of its most famous and useful and revolutionary spawns, Google, and specifically it's PageRank algorithm. This is the piece of code which Google Search engines use to find webpages that are most relevant to your enquiry, and rank them so the most relevant are near the top. Its a bloody marvel, as we all know (remember the bad old days using old lycos and yahoo searches etc - pah!)

Instead of writing stuff on it, which has been done elsewhere numerous times, I thought I would show you how to do it using Matlab

First, you need to get the code. Cleve Moler at Mathworks has written a library which contains a primitive attempt at the proprietary algorithm. You need to get it, unzip it and untar it. This can all be done within Matlab, like so:

>> url='http://www.mathworks.com/moler/ncm.tar.gz';
>> gunzip(url,'ncm')
>> untar('ncm/ncm.tar','ncm')

Next, cd to where you unzipped it to, which is probably -

>> cd([pwd,filesep,'ncm'])

Then run the surfer algorithm on a website of your choosing, this one for example, and tell it to search for n nodes (in this case, 200 - warning, it takes a while)

>> [U,G]=surfer('http://www.thedailydanielblog.blogspot.com',200);

U is a list of websites connected to yours, arranged in matrix G. My website surfer matrix looks a little like this:




I was interested in what these sites were so I wrote a bit of code to print them to a text file:

>> fid=fopen('dansweb.txt','wt');
>> for i=1:size(char(U),1), fprintf(fid,'%s\n',char(U(i,:))); end
>> fclose(fid)

dansweb.txt then contains a list of 200 websites

The pagerank algorithm was then applied to U and G to find

>> pagerank(U,G)

The results are unsurprising - it charts links i have made, blogspot pages associated with my labels, and widgets i have used. There is even a bar chart output:
which looks slightly better if you add the websites to it

page-rank in out url
190 0.0432 9 2 http://www.usgs.gov
191 0.0383 4 2 http://www.usgs.gov/ask
192 0.0383 4 2 http://search.usgs.gov
72 0.0137 21 0 http://www.blogger.com
79 0.0131 40 0 http://purl.org/syndication/thread/1.0
71 0.0129 20 0 http://www.blogger.com/profile/
184 0.0098 5 0 http://planet.ubuntu.com
5 0.0092 60 6 http://www.blogger.com/profile/02981038739002302942
189 0.0092 1 0 http://wiki.octave.org
193 0.0091 6 5 http://marine.usgs.gov
10 0.0087 61 0 http://ims.ucsc.edu
3 0.0086 59 117 http://thedailydanielblog.blogspot.com/feeds/posts/default
4 0.0085 58 118 http://www.blogger.com/feeds/1936258002310927863/posts/default
8 0.0082 57 1 http://www.gnu.org/software/octave
11 0.0082 57 0 http://twitter.com
12 0.0082 57 0 http://hypem.com
63 0.0082 57 7 http://www.ourblogtemplates.com
21 0.0081 56 54 http://thedailydanielblog.blogspot.com/search/label/photos
24 0.0081 56 50 http://thedailydanielblog.blogspot.com/search/label/santa%20cruz
23 0.0081 56 49 http://thedailydanielblog.blogspot.com/search/label/san%20francisco
26 0.0081 56 46 http://thedailydanielblog.blogspot.com/search/label/travel
22 0.0081 56 43 http://thedailydanielblog.blogspot.com/search/label/preparations
25 0.0081 56 42 http://thedailydanielblog.blogspot.com/search/label/santa%20news
14 0.0080 56 36 http://thedailydanielblog.blogspot.com/search/label/internet
15 0.0080 56 36 http://thedailydanielblog.blogspot.com/search/label/job
17 0.0080 56 36 http://thedailydanielblog.blogspot.com/search/label/map
18 0.0080 56 36 http://thedailydanielblog.blogspot.com/search/label/matlab
20 0.0080 56 35 http://thedailydanielblog.blogspot.com/search/label/musings
16 0.0080 56 34 http://thedailydanielblog.blogspot.com/search/label/linux
27 0.0080 56 34 http://thedailydanielblog.blogspot.com/search/label/work
19 0.0080 56 33 http://thedailydanielblog.blogspot.com/search/label/move
6 0.0078 21 0 http://www.blogger.com/openid-server.g
67 0.0069 20 0 http://www.blogger.com/openid-server.g\42 /\76\n
196 0.0057 3 4 http://walrus.wr.usgs.gov/infobank
197 0.0057 3 2 http://mrib.usgs.gov
183 0.0053 4 1 http://fridge.ubuntu.com
194 0.0052 3 0 http://vineyard.er.usgs.gov/query.html
66 0.0052 17 0 http://www.blogger.com/profile/02981038739002302942\42 /\76\n

So, I have a pagerank of 0 (more or less) which is the lowest possible. The highest is 10, so this isn't a great example, admittedly, but I'll let interested readers have a play.

I discovered that facebook has a 9 and plymouth.ac.uk has a 7. Now, before you comment, i know that there are already loads of web-based pagerank checkers out there, but this is more fun, and here you get a breakdown of what sites contribute to each other's scores, OK?

Said in a Neil Buchanon from Art Attack (ITV since the 1990s) - 'try it yerself'!!

>> clear, clc, exit

No comments: