Page 1 of 1

Help with multi-value join

Posted: Sun Sep 09, 2007 3:16 pm
by Ambush Commander
Alright, I have two tables: `threads` and `threadtags`. The format of threads is thread ID, and the related thread info, the format of `threadtags` is thread ID and tag name.

What I would like to do is construct an efficient query that brings back threads with a certain set of tags. For example, if I were to query "question, answered", the database would return threads that had both tags (and possibly more), but not those with just question or just answered. There are also constraints on threads that are being returned (such as forum ID, and limit).

I think I need a join to do this, but I'm not exactly sure how. It's easy enough doing a subquery for one tag:

Code: Select all

SELECT * FROM `threads` WHERE `forum` = $forum_id AND EXISTS (SELECT * FROM `threadtags` WHERE `threadtags`.`thread` = `threads`.`id` AND `tag` = $tag) LIMIT 10
But I suspect this is very inefficient. Joins went completely over my head, sorry.

Posted: Sun Sep 09, 2007 3:55 pm
by josa
This is the same query but with a join instead:

Code: Select all

SELECT
	*
FROM
	`threads`
INNER JOIN
	`threadtags`
ON
	`threads`.`id` = `threadtags`.`thread`
WHERE
	`forum` = $forum_id
AND
	`threadtags`.`tag` = $tag
LIMIT 10
/josa

Posted: Sun Sep 09, 2007 4:34 pm
by Ambush Commander
But because threads can have multiple tags, how would the DB handle multiple possible rows that it can join with?

Posted: Sun Sep 09, 2007 6:00 pm
by Christopher
When you join you will get all sets of records that match the "ON threads.id=threadtags.thread" condition. If there are multiple threadtags records for each thread record you will get the thread record data duplicated in each joined record returned.

Posted: Sun Sep 09, 2007 10:25 pm
by Ambush Commander
Makes enough sense, although that doesn't help at all for my situation.

I am not really interested in having the tag data present in the result (although it would be a nice present). How would I ensure that records exist for tag1, tag2 and tag3? Consider a database with this data:

Code: Select all

thread_id, tag
1, question
1, answered
2, question
3, bug
And assume that thread_id is linked to another table with appropriate data. An inner join for 'question' would result in:

Code: Select all

1, question
2, question
But an inner join for 'question' OR 'answered' (using AND would give no results) would result in:

Code: Select all

1, question
1, answered
2, question
What kind of query would give me

Code: Select all

1, question
1, answered
Or, even more ideally, just one entry, truncating the rest of the tags?

Code: Select all

1, question

Posted: Mon Sep 10, 2007 1:27 am
by Christopher
If you join two tables on id=thread_id such as:

Code: Select all

id, name
1, foo
2, bar
3, baz

Code: Select all

thread_id, tag
1, question
1, answered
2, question
3, bug
You will get the following:

Code: Select all

id, name, thread_id, tag
1, foo, 1, question
1, foo, 1, answered
2, bar, 2, question
3, baz, 3, bug

Posted: Mon Sep 10, 2007 7:01 pm
by Ambush Commander
Yes, I've figured out that much. (I omitted it from the example code for brevity) The important tidbit is that I need MySQL to give me exactly what I need, that is, only rows that have all the tags I want. This is because I'm using a LIMIT clause, so if I grab a superset and then filter things out PHP-side, I'll end up with less items then I want.

If there's another way to do this (without pulling every entry or I sizable amount over), please, tell me.

Posted: Mon Sep 10, 2007 9:37 pm
by Christopher
I am not sure exactly what you mean my "exactly what you need." You could try different joins (e.g., LEFT JOIN) or limit to fewer records with LIMIT or WHERE, or perhaps DISTINCT would help eliminate duplicate records.

Posted: Mon Sep 10, 2007 9:52 pm
by Ambush Commander
To illustrate the problem:

Suppose I have a table that looks like this:

Code: Select all

id, name
1, one
2, two
3, three
4, four
And a tag table that looks like:

Code: Select all

thread_id, tag
1, question
1, answered
2, question
4, question
4, answered
I am searching for threads that are both a question, and answered.

Using a regular join, I get a result set like this:

Code: Select all

id, name, tag
1, one, question
1, one, answered
2, two, question
3, three, NULL
4, four, question
4, four, answered
By stipulating that tag = question OR tag = answered, I can winnow down the selection to

Code: Select all

id, name, tag
1, one, question
1, one, answered
2, two, question
4, four, question
4, four, answered
But the thread with the ID 2 is still hanging around, and it is not what one wants. If we use SORT DESC, UNIQUEROW and LIMIT 2 on this result set, we'll end up with rows 4 and 2, but this is not one wants: row 2 is only tagged question, not answered, and there's no way to get this data!

Any suggestions?

Posted: Tue Sep 11, 2007 1:40 pm
by josa
This will give you all threads that are both 'question' and 'answered'.

Code: Select all

SELECT
	`threads`.`id`
FROM
	`threads`
INNER JOIN
	`threadtags`
ON
	`threads`.`id` = `threadtags`.`thread`
WHERE
	(`threadtags`.`tag` = 'question'
OR
	`threadtags`.`tag` = 'answered')
GROUP BY
	`threads`.`id`
HAVING
	COUNT(`threads`.`id`) = 2
If you add `threadtags`.`tag` to the selected columns you get one tag. Except from ordering the rows you can't really control which tag this is going to be. [s]If you use this query you should consider indexing the tag column for efficiency.[/s]

/josa

Posted: Tue Sep 11, 2007 2:06 pm
by Ambush Commander
Ah, that's exactly what I need (If I'm understanding you correctly. Any efficiency / optimization notes? Does this approach scale for, say, five tags? Ten tags? (i.e. should I cap the number of tags).

Posted: Wed Sep 12, 2007 2:30 am
by josa
I edited out that last part about indexing because I felt I had to test it first. Now that I have I can say that a two-column primary key (thread, tag) for table threadtags will do the trick. Try running the query without the primary key and put EXPLAIN before the select statement and then do the same after adding the key. You will see that it's now using the key instead of doing a full table scan.

If you want to search for more tags than two, just add them to the where clause and update the number at the end to the number of tags you are searching for.

/josa