Page 1 of 2
Determining shortest unique substring
Posted: Mon Jan 28, 2008 9:13 pm
by Ambush Commander
Let's suppose I have a column of sha1 hashes, each 40 characters long (ideally, I'd be storing them as binary, but lets ignore that for now).
Code: Select all
86f7e437faa5a7fce15d1ddcb9eaeaea377667b8
e9d71f5ee7c92d6dc9e92ffdad17b8bd49418f98
84a516841ba77a5b4648de2cd0dfcb30ea46dbb4
...
While the hash is most secure when I display it, 40 characters is a little unwieldy, so I'd like to shorten it into a unique identifier, say, 12 characters long. For the sample set I gave above:
Code: Select all
86f7e437faa5a
e9d71f5ee7c9
84a516841ba7
...
This results in no collisions. However, for a large amount of possible values, the probability of two of these twelve-character strings colliding increases.
My question is, given a column of hashes (with an appropriate UNIQUE KEY added to them), for any given hash how do I find the shortest unique substring starting from the beginning of the hash? This appears to be a trivial extension to the binary search algorithm, but I can't figure out how one would make it work in SQL, except string matching (which I suspect wouldn't be as fast as it should be).
Re: Determining shortest unique substring
Posted: Mon Jan 28, 2008 10:26 pm
by Christopher
Brute force:
Code: Select all
$n = 32;
$sql = "SELECT DISTINCT LEFT(myhash, $n) FROM mytable";
$result = $db->execute($sql);
$max row = $result->numberOfRows();
while ($n > 0) {
--$n;
$sql = "SELECT DISTINCT LEFT(myhash, $n) FROM mytable";
$result = $db->execute($sql);
if ($result->numberOfRows() < $max rows) {
break;
}
}
echo "Duplicate rows at $n characters.<br/>";
and count to see
Re: Determining shortest unique substring
Posted: Mon Jan 28, 2008 10:27 pm
by Kieran Huggins
ugh - smells a little to me... why not use an auto inc. id?
Re: Determining shortest unique substring
Posted: Tue Jan 29, 2008 3:39 pm
by Ambush Commander
Brute force
Heh, I was afraid of that. That'll mean a lot of queries, and a full table scan each time (unless we build indices for LEFT.)
ugh - smells a little to me... why not use an auto inc. id?
Actually, the system already does! I should explain this is with relation to Mercurial; they use a revision ID and a SHA1 hash to identify all revisions. However, the revision ID is repository specific and not portable; thus SHA1.
Re: Determining shortest unique substring
Posted: Tue Jan 29, 2008 5:30 pm
by VladSun
Ambush Commander wrote:This appears to be a trivial extension to the binary search algorithm.
I was able to write this (and it doesn't always work as expected

):
Code: Select all
function find_func($string)
{
// return ([mysql_found($string)]);
return (strpos('12346', $string) === 0);
}
function search_min_substr($hash, $find_func)
{
for ($l = 0, $r = strlen($hash) - 1; $l <= $r;)
{
$c = (int)(($l + $r) / 2);
if (!$find_func(substr($hash, 0, $c)))
$r = $c - 1;
else
$l = $c + 1;
}
return $find_func(substr($hash, 0, max($l, 1))) ? false : substr($hash, 0, max($l, 1));
}
$shash = '1234567';
echo substr($shash, 0, search_min_substr($shash, 'find_func'));
It uses a test find_func() - you need an SQL query like I have commented out.
Re: Determining shortest unique substring
Posted: Tue Jan 29, 2008 8:21 pm
by Ambush Commander
Your algorithm would work, but there are two troubles:
- Even with O(lg n) performance, the worst case on 40 character hash will require about 6 queries
- Finding an efficient implementation of mysql_found. Remember, we're matching substrings, so we can't simply index the column.
I was thinking of something on the DB side...
Re: Determining shortest unique substring
Posted: Tue Jan 29, 2008 8:28 pm
by Christopher
Ambush Commander wrote:Brute force
Heh, I was afraid of that. That'll mean a lot of queries, and a full table scan each time (unless we build indices for LEFT.)
Technically you only need to run it when you insert a new row. And you could reduce the number of queries by starting at the current value and counting up after each insert. It would probably only require 1-2 queries ever.
Re: Determining shortest unique substring
Posted: Tue Jan 29, 2008 8:41 pm
by Ambush Commander
I must be missing something: even though the queries are short, wouldn't the queries themselves still need full table scans? And to throw another monkey-wrench in the system, lets say that we want different hashes to be able to have different lengths. For instance:
have unique representations as:
I know I'm being slightly crazy. But it should be so simple...
Re: Determining shortest unique substring
Posted: Wed Jan 30, 2008 1:48 am
by VladSun
Ambush Commander wrote:Your algorithm would work, but there are two troubles:
- Even with O(lg n) performance, the worst case on 40 character hash will require about 6 queries
I don't think 6 queries it's too much in your case
Ambush Commander wrote:Remember, we're matching substrings, so we can't simply index the column.
No, we are matching the whole strings in the DB - so, we can index the column.
I said, that my solution would not work always as expected. Let me explain myself. The closest test function to the mysql_found() function is:
Code: Select all
function find_func($string)
{
$db_string = '1234567';
return ($db_string == $string);
}
Suppose we have $shash='1234567', we have these results:
Code: Select all
$db_string = '1'; -> '12' OK
$db_string = '12'; -> '1' OK
$db_string = '123'; -> '1234' ! OK (not the shortest, but still unique substring)
$db_string = '1234'; -> '1' OK
$db_string = '12345'; -> '1' OK
$db_string = '123456'; -> '1' OK
$db_string = '1234567'; -> '1' OK
Re: Determining shortest unique substring
Posted: Wed Jan 30, 2008 2:15 am
by VladSun
OK, I've tried it with MySQL
Code: Select all
$qc = 0;
mysql_query('delete from hashes');
function find_func($string)
{
$query = 'select count(*) from hashes where hash="'.$string.'"';
$result = mysql_query($query);
$row = mysql_fetch_array($result);
return($row[0] > 0);
}
function search_min_substr($hash, $find_func)
{
global $qc;
for ($l = 0, $r = strlen($hash) - 1; $l <= $r;)
{
$c = (int)(($l + $r) / 2);
if (!$find_func(substr($hash, 0, $c)))
$r = $c - 1;
else
$l = $c + 1;
$qc++;
}
$qc++;
return $find_func(substr($hash, 0, max($l, 1))) ? false : substr($hash, 0, max($l, 1));
}
$cc = 0;
for ($i=0; $i<1000; $i++)
{
$shash = md5(rand(0, 10000000));
if (($result = search_min_substr($shash, 'find_func')) !== false)
mysql_query('insert into hashes values("'.$result.'")');
else
$cc++;
}
echo "<hr />Queries count: $qc, Collisions count: $cc";
Results:
$shash = md5(rand(0, 10000000)); -> Queries count: 5994, Collisions count: 0
$shash = md5(rand(0, 100)); -> Queries count: 6000, Collisions count: 0
$shash = md5(rand(0, 10)) ; -> Queries count: 6674, Collisions count: 652
PS: [sql]SELECT * FROM hashes ORDER BY length(`hash`), `hash`;[/sql]
Beautiful

Re: Determining shortest unique substring
Posted: Thu Jan 31, 2008 5:34 am
by VladSun
@Ambush Commander - So, have I helped you ?
PS: If you wnat to do this at DB layer, you could use a stored procedure

Re: Determining shortest unique substring
Posted: Thu Jan 31, 2008 12:23 pm
by Ambush Commander
I don't know. I have to test the function, but from a cursory glance it doesn't look like find_func() works properly? (Not really sure...)
Re: Determining shortest unique substring
Posted: Thu Jan 31, 2008 3:48 pm
by VladSun
Well, my post is a copy-paste real test ...
Re: Determining shortest unique substring
Posted: Thu Jan 31, 2008 4:09 pm
by Ambush Commander
Ok, tried it out, I see what the function is doing. It isn't however, exactly what I'm looking for. This is because even though we're looking for a unique substring, the original string is still important. In this implementation, you discard all data after you find that there is no collision. So, for example, the data is generated like this:
Code: Select all
aaa -> a
baa -> b
aab -> aa
aba -> ab
aa4 -> aa4
What I'm looking for is this:
Code: Select all
aaa -> aaa
baa -> b
aab -> aab
aba -> ab
aa4 -> aa4
because an substring like 'a' is ambiguous: it could match aaa, aab, or aa4, depending on which one was inserted first. What this also means is that each hash needs to be re-compared against the database after a new hash is inserted to ensure that there are no collisions.
Re: Determining shortest unique substring
Posted: Thu Jan 31, 2008 4:30 pm
by VladSun
Ambush Commander wrote:because an substring like 'a' is ambiguous: it could match aaa, aab, or aa4, depending on which one was inserted first.
I don't follow you ... I made it work as your 4-th post requires to ... Your last post and your 4-th post requirements (and I think it should be "ccc" -> "c", not-> "cc") don't match ...