Page 1 of 1

Calculating percent match

Posted: Fri Apr 06, 2007 3:08 pm
by GeXus
Okay, I have a theoretical question which I can't seem to figure out... maybe a math wiz might be able to help :)

Let's pretend you asked 100 questions to 100 people. The answers to the questions are either 'A lot' (value: 2), 'A Little' (value: 1), or 'Don't Care' (value: 0). Based on those values, you would then want to show which people have similar personalties (assuming that these questions are geared towards personality -- this would be based on a percent match)... Any idea how I would do that?

Posted: Fri Apr 06, 2007 3:40 pm
by Oren
Well, I think you need to use OOP (can be done without OOP but I'll explain the OOP way).
So what you need, I think, is to have some "person" object and you create a new instance of it for each one of the 100 persons. Now you need to pass to this object their answers to the questions and the object will create some profile based on the answers. You will need another object that will guide the first one. The first object I mentioned is really just an object which holds the person's ID, and a list of characters (kind, neat, smart) and possibly some other data you will want to hold - that's up to you. Now the second object + the answers you supplied to the first one will be used to initialize these characters in the "person" object.
Finally, you will need another object that takes all the "person" objects and tells you how they are related based on some rules that you will decide.

I know it's not a formula but just an abstract idea, but this is the power of OOP and programming - you solve a difficult problem by breaking it into smaller sub-problems.

I hope this helps somehow and gives you a general direction :P

P.S Very interesting topic by the way :wink:

Posted: Fri Apr 06, 2007 3:43 pm
by RobertGonzalez
What is the rest of the criteria? I think I have an idea of what you are doing, but there has to be more.

Posted: Fri Apr 06, 2007 3:47 pm
by Oren
Everah wrote:What is the rest of the criteria? I think I have an idea of what you are doing, but there has to be more.
Yeah I agree, that's what I thought just after I had read the original post.

Posted: Fri Apr 06, 2007 4:25 pm
by GeXus
That's pretty much it I think... there could be 1000 people, where one answers 5 questions, another answers 20.. etc. So in the end you could say, persons a,b, and c are close to person X, the weight of each question is equal, but matches of a higher valued answer would put more weight on that person towards that question. Each question would have a question ID.. so if 5 people said A Little (value: 1) to question ID 555, those people would be a match, then as more questions are answered, they may not be a match anymore, based on the overall percent.... hope that helps.

My question is not so much how would this be done programmatically, but in theory, how would you set this up.. what math would you use.

Posted: Fri Apr 06, 2007 4:43 pm
by Christopher
I think you need to define a weighting for each personality for each question. So say you had three personalities, say Maddog, Sheepish and Foxy. For question #1 you might say that Maddog is a +1 weighting, Sheepish has a -1 weighting, and Foxy had a 0 weighting. If the user answers 2 for quesiton #1 then the scores are Maddog=2, Sheepish=-2, Foxy=0. Just keep adding up the scores.

You can of course change the algorithm and values to make the weightings work specifically how you need them to work.

Posted: Fri Apr 06, 2007 5:02 pm
by bubblenut
There are lots of different similarity algorithms out there and it's an incredibly interesting area. I have gone with a very simple distance aproach but as Everah mentioned, without more information on your criteria I can't know if this is what you're looking for.

I've gone with a DB example to make things a little simpler. The first function simply pulls the info we need from the database either for a particular user or for everyone. The second function finds close matches. The idea behind it is to determine the score distance of each other user from the user we're interested in then take the top n of these and calculate the percentage.

Code: Select all

<?php

function extractAnswers( $user=null )
{
    global $con;
    $sql = "SELECT user, question, answer FROM user_answers";
    if( $user ) {
        $sql .= " WHERE user=" . (int)$user;
    }
    $res = mysql_query( $sql, $con );
    $return = array();
    while( $row = mysql_fetch_assoc( $res ) ) {
        $return[ $row['user'] ][ $row['question'] ] = $row['answer'];
    }
    if( $user ) {
        if( !isset( $return[ $user ] ) ) {
            return false;
        }
        return $return[ $user ];
    } else {
        return $return;
    }
}

function findMatchesFor( $user, $number_of_matches=5 )
{
    if(!$user_answers = extractAnswers( $user )) {
        return false;
    }
    if( count( $all_answers = extractAnswers() ) < 2 ) {
        return false;
    }
    $user_distances = array();
    foreach( $all_answers as $other_user => $other_user_answers ) {
        if( $other_user == $user ) continue;
        $user_distances[ $other_user ] = 0;
        foreach( $user_answers as $question => $answer ) {
            $user_distances[ $other_user ] += abs( $answer - $other_user_answers[ $question ] );
        }
    }

    sort( $user_distances );
    $user_distances = array_slice( $user_distances, 0, $number_of_matches );

    $max_score = count( $user_answers ) * 2;
    $user_percentages = array();
    foreach( $user_distances as $user => $distance ) {
        $user_percentages[ $user ] = ( ( $max_score - $distance ) / $max_score ) * 100;
    }
    return $user_percentages;
}

Posted: Fri Apr 06, 2007 8:05 pm
by GeXus
Thanks for all the input... I looked up some of the similarity algorithms.. and let me just say that they are WAAAY beyond me. But this is what i've come down to.

Each question will be placed inside a category.

A query will get the sum of all values to questions for a user from a paticular category.

From that sum I will get the standard deviation.

We now have a range in which we could concider others in this same range compatible for that paticular category.

I'll then repeat this process accross all categories and combine the results into one percent. Then based on the overal deviation, people who fall within that range would be concidered compatible.

What do you think? I'm still not 100% on it...