Page 1 of 1
Source code for Page rank algorithm
Posted: Tue Nov 17, 2009 10:41 pm
by ved2210
Hi friends ,
I am working on Page rank algorithm from last 1 month .
I have to implement the page rank algorithm and have to develop
the code for the algorithm . Basically i have the Num of INBOUND and the OUT BOUND LINKS for individual pages of a website . I have the list of pages which links to other pages of the website , i can say i have developed a crawler for that and now i am stuck up with the real implementation for the algorithm .
Please geeks , need your help .
Thanks ,
Ved.
Re: Source code for Page rank algorithm
Posted: Wed Nov 18, 2009 1:32 am
by requinix
There is no "the page rank algorithm". You come up with it by yourself.
You'll need more than just the number of inbound and outbound links to calculate a good ranking.
Re: Source code for Page rank algorithm
Posted: Wed Nov 18, 2009 2:17 am
by Weiry
tasairis wrote:You'll need more than just the number of inbound and outbound links to calculate a good ranking.
From what i understand, most ranking sites (the ones im thinking of) use nothing more than inbound and outbound to determine ranking.
That said of course, you could implement some sort of an algorithm taking into account many variables, but i can see the point of doing that when your looking for the most active (thats my assumption).
The question really is, how are you storing your inbound and outbound traffic numbers?
If they were stored in a database, you could simply select all your entries then order by your inbound field descending, which would give you a list of entries ordered by the highest inbound first (which i would attribute a good ranking).
Re: Source code for Page rank algorithm
Posted: Wed Nov 18, 2009 2:20 am
by onion2k
Weiry wrote:If they were stored in a database, you could simply select all your entries then order by your inbound field descending, which would give you a list of entries ordered by the highest inbound first (which i would attribute a good ranking).
So you think spammer's link farms should be ranked top?
You need to consider the quality of the inbound links. That's the hard bit.
Re: Source code for Page rank algorithm
Posted: Wed Nov 18, 2009 2:42 am
by Weiry
onion2k wrote:You need to consider the quality of the inbound links. That's the hard bit.
Well i suppose in that case, you could also record the IP addresses relating to each inbound/outbound and only select distinct results. Yes that may still be prone to spammers, but it should eliminate a majority.
But other than recording IP's, would there actually be a way to stop the spammers, assuming they were to use dynamic IP's?
Re: Source code for Page rank algorithm
Posted: Wed Nov 18, 2009 3:56 am
by iankent
Weiry wrote:
Well i suppose in that case, you could also record the IP addresses relating to each inbound/outbound and only select distinct results. Yes that may still be prone to spammers, but it should eliminate a majority.
there's a reason Google were so successful

largely down to their rather accurate page ranking system, something that other search engines failed miserably at. there's a lot of variables you need to take into account, for example page size, link to text ratio, inbound to outbound ratio, metadata to content comparisons etc.
Basically, anything that can be checked, should be checked. And ratings shouldn't be directly stored in the database, but rather 'points' should be awared and different points would have different weights. This allows for adjusting the algorithm without dealing with massive database updates.
Unfortunately you won't be able to do most of that using MySQL and would probably struggle to do it in PHP, something like C would be a more suitable language.
Re: Source code for Page rank algorithm
Posted: Wed Nov 18, 2009 6:06 am
by onion2k
Weiry wrote:onion2k wrote:You need to consider the quality of the inbound links. That's the hard bit.
Well i suppose in that case, you could also record the IP addresses relating to each inbound/outbound and only select distinct results. Yes that may still be prone to spammers, but it should eliminate a majority.
So you think proper sites on shared hosts should be marked down?
Ranking of unorganised linked data (like websites) is really, really hard. Stop trying to over-simplify the problem. Any "obvious" solution will be so badly flawed it'll be unusable. Google have had some very, very clever people working on the problem for a decade and they still haven't
really solved it.
Re: Source code for Page rank algorithm
Posted: Wed Nov 18, 2009 7:39 am
by Weiry
onion2k wrote:Stop trying to over-simplify the problem.
Im simply trying to provide a direction or give some suggestions, no one else seems to be saying anything other than "its really hard".
The only thing i am trying to do is provide some sort of a direction, while as over-simplified as they may be. I was grateful when iankent replied because that gave a better understanding of what is really required behind the scenes, which i think is more important then simply saying that "its really hard".
Re: Source code for Page rank algorithm
Posted: Wed Nov 18, 2009 7:42 am
by iankent
Have a look here for some more detailed info:
http://infolab.stanford.edu/~backrub/google.html
and importantly, note who the authors are

Re: Source code for Page rank algorithm
Posted: Wed Nov 18, 2009 7:49 am
by onion2k
Also note that that was around 1997, before spammers and "SEO experts" started to work out ways to cheat.
Re: Source code for Page rank algorithm
Posted: Wed Nov 18, 2009 7:55 am
by iankent
onion2k wrote:
Also note that that was around 1997, before spammers and "SEO experts" started to work out ways to cheat.
well yes, Google's algorithms have clearly evolved since that was written 12 years ago, but the principles are the same and it does give a good indication of how pages can be sorted. bear in mind of course that even Google's indexes aren't free from spam, and large numbers of spam sites do appear as top results in some search queries, but that's an on-going challenge thats almost certainly never going to be solved (much like piracy - the bad guys are always at least one step ahead and have been since the mid 70s)