Determining shortest unique substring

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Determining shortest unique substring

Post by Ambush Commander »

Let's suppose I have a column of sha1 hashes, each 40 characters long (ideally, I'd be storing them as binary, but lets ignore that for now).

Code: Select all

86f7e437faa5a7fce15d1ddcb9eaeaea377667b8
e9d71f5ee7c92d6dc9e92ffdad17b8bd49418f98
84a516841ba77a5b4648de2cd0dfcb30ea46dbb4
...
While the hash is most secure when I display it, 40 characters is a little unwieldy, so I'd like to shorten it into a unique identifier, say, 12 characters long. For the sample set I gave above:

Code: Select all

86f7e437faa5a
e9d71f5ee7c9
84a516841ba7
...
This results in no collisions. However, for a large amount of possible values, the probability of two of these twelve-character strings colliding increases.

My question is, given a column of hashes (with an appropriate UNIQUE KEY added to them), for any given hash how do I find the shortest unique substring starting from the beginning of the hash? This appears to be a trivial extension to the binary search algorithm, but I can't figure out how one would make it work in SQL, except string matching (which I suspect wouldn't be as fast as it should be).
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Determining shortest unique substring

Post by Christopher »

Brute force:

Code: Select all

$n = 32;
$sql = "SELECT DISTINCT LEFT(myhash, $n) FROM mytable";
$result = $db->execute($sql);
$max row = $result->numberOfRows();
while ($n > 0) {
     --$n;
     $sql = "SELECT DISTINCT LEFT(myhash, $n) FROM mytable";
     $result = $db->execute($sql);
     if ($result->numberOfRows() < $max rows) {
          break;
     }
}
echo "Duplicate rows at $n characters.<br/>";
 
and count to see
(#10850)
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Re: Determining shortest unique substring

Post by Kieran Huggins »

ugh - smells a little to me... why not use an auto inc. id?
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Re: Determining shortest unique substring

Post by Ambush Commander »

Brute force
Heh, I was afraid of that. That'll mean a lot of queries, and a full table scan each time (unless we build indices for LEFT.)
ugh - smells a little to me... why not use an auto inc. id?
Actually, the system already does! I should explain this is with relation to Mercurial; they use a revision ID and a SHA1 hash to identify all revisions. However, the revision ID is repository specific and not portable; thus SHA1.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Determining shortest unique substring

Post by VladSun »

Ambush Commander wrote:This appears to be a trivial extension to the binary search algorithm.
I was able to write this (and it doesn't always work as expected :) ):

Code: Select all

function find_func($string)
{
//  return ([mysql_found($string)]);
    return (strpos('12346', $string) === 0);
}
 
function search_min_substr($hash, $find_func)
{
    for ($l = 0, $r = strlen($hash) - 1; $l <= $r;) 
    {
        $c = (int)(($l + $r) / 2);
 
        if (!$find_func(substr($hash, 0, $c)))
            $r = $c - 1;
        else
            $l = $c + 1;
    }
    return $find_func(substr($hash, 0, max($l, 1))) ? false : substr($hash, 0, max($l, 1));
}
 
$shash = '1234567';
 
echo substr($shash, 0, search_min_substr($shash, 'find_func'));
 
It uses a test find_func() - you need an SQL query like I have commented out.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Re: Determining shortest unique substring

Post by Ambush Commander »

Your algorithm would work, but there are two troubles:

- Even with O(lg n) performance, the worst case on 40 character hash will require about 6 queries
- Finding an efficient implementation of mysql_found. Remember, we're matching substrings, so we can't simply index the column.

I was thinking of something on the DB side...
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Determining shortest unique substring

Post by Christopher »

Ambush Commander wrote:
Brute force
Heh, I was afraid of that. That'll mean a lot of queries, and a full table scan each time (unless we build indices for LEFT.)
Technically you only need to run it when you insert a new row. And you could reduce the number of queries by starting at the current value and counting up after each insert. It would probably only require 1-2 queries ever.
(#10850)
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Re: Determining shortest unique substring

Post by Ambush Commander »

I must be missing something: even though the queries are short, wouldn't the queries themselves still need full table scans? And to throw another monkey-wrench in the system, lets say that we want different hashes to be able to have different lengths. For instance:

Code: Select all

 
aaa
bbb
ccc
cdd
 
have unique representations as:

Code: Select all

 
a
b
cc
cd
 
I know I'm being slightly crazy. But it should be so simple...
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Determining shortest unique substring

Post by VladSun »

Ambush Commander wrote:Your algorithm would work, but there are two troubles:

- Even with O(lg n) performance, the worst case on 40 character hash will require about 6 queries

I don't think 6 queries it's too much in your case :)
Ambush Commander wrote:Remember, we're matching substrings, so we can't simply index the column.
No, we are matching the whole strings in the DB - so, we can index the column.

I said, that my solution would not work always as expected. Let me explain myself. The closest test function to the mysql_found() function is:

Code: Select all

function find_func($string)
{
       $db_string = '1234567';
       return ($db_string == $string);
}
 
Suppose we have $shash='1234567', we have these results:

Code: Select all

 
$db_string = '1';       -> '12'      OK
$db_string = '12';     -> '1'        OK
$db_string = '123';    -> '1234'  ! OK (not the shortest, but still unique substring)
$db_string = '1234';  -> '1'        OK
$db_string = '12345';  -> '1'      OK
$db_string = '123456';  -> '1'    OK
$db_string = '1234567';  -> '1'   OK
 
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Determining shortest unique substring

Post by VladSun »

OK, I've tried it with MySQL :)

Code: Select all

$qc = 0;
 
mysql_query('delete from hashes');
 
function find_func($string)
{
    $query = 'select count(*) from hashes where hash="'.$string.'"';
    $result = mysql_query($query);
    $row = mysql_fetch_array($result);
    return($row[0] > 0);
}
 
function search_min_substr($hash, $find_func)
{
    global $qc;
    
    for ($l = 0, $r = strlen($hash) - 1; $l <= $r;)
    {
        $c = (int)(($l + $r) / 2);
 
        if (!$find_func(substr($hash, 0, $c)))
            $r = $c - 1;
        else
            $l = $c + 1;
        $qc++;
    }
    $qc++;
    return $find_func(substr($hash, 0, max($l, 1))) ? false : substr($hash, 0, max($l, 1));
}
 
$cc = 0;
 
for ($i=0; $i<1000; $i++)
{
    $shash = md5(rand(0, 10000000));
    if (($result = search_min_substr($shash, 'find_func')) !== false)
        mysql_query('insert into hashes values("'.$result.'")');
    else 
        $cc++;
    
}
 
echo "<hr />Queries count: $qc, Collisions count: $cc";
 
Results:
$shash = md5(rand(0, 10000000)); -> Queries count: 5994, Collisions count: 0
$shash = md5(rand(0, 100)); -> Queries count: 6000, Collisions count: 0
$shash = md5(rand(0, 10)) ; -> Queries count: 6674, Collisions count: 652

PS: [sql]SELECT * FROM hashes ORDER BY length(`hash`), `hash`;[/sql]
Beautiful :)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Determining shortest unique substring

Post by VladSun »

@Ambush Commander - So, have I helped you ?

PS: If you wnat to do this at DB layer, you could use a stored procedure :D :D :D
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Re: Determining shortest unique substring

Post by Ambush Commander »

I don't know. I have to test the function, but from a cursory glance it doesn't look like find_func() works properly? (Not really sure...)
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Determining shortest unique substring

Post by VladSun »

Well, my post is a copy-paste real test ...
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Re: Determining shortest unique substring

Post by Ambush Commander »

Ok, tried it out, I see what the function is doing. It isn't however, exactly what I'm looking for. This is because even though we're looking for a unique substring, the original string is still important. In this implementation, you discard all data after you find that there is no collision. So, for example, the data is generated like this:

Code: Select all

aaa -> a
baa -> b
aab -> aa
aba -> ab
aa4 -> aa4
What I'm looking for is this:

Code: Select all

aaa -> aaa
baa -> b
aab -> aab
aba -> ab
aa4 -> aa4
because an substring like 'a' is ambiguous: it could match aaa, aab, or aa4, depending on which one was inserted first. What this also means is that each hash needs to be re-compared against the database after a new hash is inserted to ensure that there are no collisions.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Determining shortest unique substring

Post by VladSun »

Ambush Commander wrote:because an substring like 'a' is ambiguous: it could match aaa, aab, or aa4, depending on which one was inserted first.
I don't follow you ... I made it work as your 4-th post requires to ... Your last post and your 4-th post requirements (and I think it should be "ccc" -> "c", not-> "cc") don't match ...
There are 10 types of people in this world, those who understand binary and those who don't
Post Reply