MySQL Intersect

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

User avatar
inghamn
Forum Contributor
Posts: 174
Joined: Mon Apr 16, 2007 10:33 am
Location: Bloomington, IN, USA

MySQL Intersect

Post by inghamn »

I've been working on a Tagging system for my photo album. Since MySQL doesn't support INTERSECT, I'm kinda stumped how to get a set of images that all contain all of a given set of tags.

Code: Select all

 
CREATE TABLE images (
id int UNSIGNED PRIMARY KEY,
filename varchar(128)
);
 
CREATE TABLE tags (
id int UNSIGNED PRIMARY KEY,
name varchar(128)
);
 
CREATE TABLE image_tags (
image_id int UNSIGNED,
tag_id int UNSIGNED
);
 
So, given a set of tags...say tag_id=15 and tag_id=6, I'm trying to write SQL that will return all images that have both tags.

UNION gives me all the images that have 15 OR 6, not AND.

I'm sure there's a way, but I haven't wrapped my brain around it, yet.
mabwi
Forum Commoner
Posts: 27
Joined: Wed Aug 01, 2007 4:51 pm

Re: MySQL Intersect

Post by mabwi »

This should work, but it's probably not the best way

[sql]  SELECT filename FROM images     WHERE id IN (SELECT image_id FROM image_tags WHERE tag_id = 6)       AND id IN (SELECT image_id FROM image_tags WHERE tag_id=15)  [/sql]
User avatar
inghamn
Forum Contributor
Posts: 174
Joined: Mon Apr 16, 2007 10:33 am
Location: Bloomington, IN, USA

Re: MySQL Intersect

Post by inghamn »

So far, I've got something like this to get all the images based on a set of tags.

Code: Select all

 
SELECT filename FROM media m
INNER JOIN media_tags mt ON m.id=mt.media_id AND mt.tag_id IN (15,6)
GROUP BY m.id HAVING count(*)=2
 
But this technique starts breaking down when I want to get the related tags to display. In other words....the set of tags in a set of images defined by a set of tags. Ugh, recursion hurts my brain...
The only way I could think of to get the related tags is to put the above into a subquery.

Code: Select all

 
SELECT DISTINCT t.id FROM tags t
INNER JOIN media_tags mt ON t.id=mt.tag_id AND mt.media_id IN
(SELECT m.id AS id FROM media m
INNER JOIN media_tags mt ON m.id=mt.media_id AND mt.tag_id IN (15,6)
GROUP BY m.id HAVING count(*)=2 ORDER BY id) 
This, however takes around 20 seconds to run and is not viable. Makes me feel like I'm missing something.

(forgive the table name changes from the original example)
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: MySQL Intersect

Post by dml »

* How many rows in those tables?
* What indexes do you have?
* The last query is a bit puzzling. You're trying to get the tag ids which are associated with the media that are associated with the tag ids 6 and 15?
* Can you put an EXPLAIN in front of the problem query to determine what's going on? I wonder if that inner select is getting executed for every combination of tags and media_tags.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: MySQL Intersect

Post by Eran »

Are you just trying to get the images tagged with specific tags? it seems a bit straightforward, but maybe I missed something:

Code: Select all

 
SELECT filename FROM images
     INNER JOIN image_tags ON image_tags.image_id=images.id
     INNER JOIN tags ON tags.id=image_tags.tag_id
WHERE tags.id = 15 AND tags.id = 6
 
User avatar
inghamn
Forum Contributor
Posts: 174
Joined: Mon Apr 16, 2007 10:33 am
Location: Bloomington, IN, USA

Re: MySQL Intersect

Post by inghamn »

I was trying to keep the example simple by not bringing in all the other fields I'm selecting media on. But, for example, selecting the media list will include other things in the where, like:

Code: Select all

 
SELECT filename FROM media m
INNER JOIN media_tags mt ON m.id=mt.media_id AND mt.tag_id IN (15,6)
WHERE roll_id=5 AND year(date)>2004
GROUP BY m.id HAVING count(*)=2
 
This works reasonably well. It's when I want to find the tags that are related to this set of selected media that stuff goes all wonky. Maybe it's just late, and my brain will work better in the morning.

But right now, it seems the counterpart to the above query is something like:

Code: Select all

 
SELECT name FROM tags t
INNER JOIN media_tags mt ON t.id=mt.tag_id AND mt.media_id IN (1,2,3,....n)
GROUP BY t.id HAVING count(*)=n
 
Except that doesnt' account for the above where clause. (
Last edited by inghamn on Fri Jul 18, 2008 3:50 pm, edited 1 time in total.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: MySQL Intersect

Post by Eran »

Did you try my solution? Can you show us the full query you are trying to modify?
User avatar
inghamn
Forum Contributor
Posts: 174
Joined: Mon Apr 16, 2007 10:33 am
Location: Bloomington, IN, USA

Re: MySQL Intersect

Post by inghamn »

Here's the raw queries being run right now -- given a user selecting Roll 4, and Tag 15 and Tag 6.

The media query is:

Code: Select all

 
SELECT m.id AS id FROM media m
INNER JOIN media_tags mt ON m.id=mt.media_id AND mt.tag_id IN (15,6)
WHERE m.roll_id='4' GROUP BY m.id HAVING count(*)=2 ORDER BY id
 
And then, to get the tags related to that media

Code: Select all

 
SELECT DISTINCT t.id AS id FROM tags t
INNER JOIN media_tags mt ON t.id=mt.tag_id AND mt.media_id IN 
    (SELECT m.id AS id FROM media m
    INNER JOIN media_tags mt ON m.id=mt.media_id AND mt.tag_id IN (15,6)
    WHERE m.roll_id='4' GROUP BY m.id HAVING count(*)=2 ORDER BY id )
ORDER BY name
 
Why the sub-select? Because I'm trying to avoid imploding an array of 20,000 media elements. So, I'm trying to base the set of media off of the query that originally selected the media.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: MySQL Intersect

Post by Eran »

Can you also specify your full table structure?
And explain in words what you are trying to retrieve and with what restrictions
User avatar
inghamn
Forum Contributor
Posts: 174
Joined: Mon Apr 16, 2007 10:33 am
Location: Bloomington, IN, USA

Re: MySQL Intersect

Post by inghamn »

So, now with a good night's sleep and a fresh cup of coffee, let's try this agin. :) I'll abandon my attempts to keep it terse.

I have been building a photo system for my family's photos. We're nearing the 20,000 photo mark right now. What I've chosen to build is a system where the user adds filters over time, thus shrinking the set of photos that they would be currently browsing. Something along the lines of, "Show me all the photos from Georgia with VW Beetles and my brother John."


Instead of some tree-style categories that I predetermine, I'm attempting to apply tagging to the system. The tags will be just another one of the filters the user can apply to the collection.

Code: Select all

 
CREATE TABLE media (
    id int UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
    filename varchar(128) NOT NULL,
    mime_type varchar(128) NOT NULL,
    media_type varchar(24) NOT NULL,
    title varchar(128),
    description text,
    md5 varchar(32) NOT NULL,
    user_id int UNSIGNED NOT NULL,
    roll_id int UNSIGNED NOT NULL,
    date datetime,
    city varchar(128),
    state varchar(128),
    country varchar(128),
    notes varchar(255),
    FOREIGN KEY (user_id) REFERENCES users(id),
    FOREIGN KEY (roll_id) REFERENCES rolls(id)
) engine=InnoDB;
 

Code: Select all

 
CREATE TABLE tags (
    id int UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
    name varchar(128) NOT NULL UNIQUE
) engine=InnoDB;
 

Code: Select all

 
CREATE TABLE media_tags (
    media_id int UNSIGNED NOT NULL,
    tag_id int UNSIGNED NOT NULL,
    FOREIGN KEY (media_id) REFERENCES media(id),
    FOREIGN KEY (tag_id) REFERENCES tags(id)
) engine=InnoDB;
 

The problem I'm having is with queries to get the intersection of a user-filtered set of media and the set of tags. This case comes up once the user has filtered the media by choosing a few tags. If the user chooses a few tags, then the media should be filtered to only the media where each photo has every tag the user has chosen. This is, of course, in addition to any other filters the user could have chosen, like say, roll_id or date.

I was reasonable successful getting all (and only) the media that matches each of the user's given tags. So, for example, if the user chooses tag 15, we show only the media that has that tag. If the user then chooses tag 6, we show only the media that has both tag 15 and tag 6. The best query I could come up with for this is (and this is the raw sql).

Code: Select all

 
SELECT m.id AS id FROM media m
INNER JOIN media_tags mt ON m.id=mt.media_id AND mt.tag_id IN (15,6)
GROUP BY m.id HAVING count(*)=2 ORDER BY id
 
This query is great, since if the user chooses other filters, I can add them in to a where clause, like so:

Code: Select all

 
SELECT m.id AS id FROM media m
INNER JOIN media_tags mt ON m.id=mt.media_id AND mt.tag_id IN (15,6)
WHERE roll_id=5 AND year(date)>1999
GROUP BY m.id HAVING count(*)=2 ORDER BY id
 
This means keeping track of the user's choices in an array, and imploding the id's of the tags. I also need to add the count of the array in the query. For user choices, this query is reasonable, since the number of tags they choose will most likely be small.

So, now, once the user is viewing a set of photos that has been filtered to these two tagsI decided (and it may be misguided) that I wanted to show the user a box of "Related" tags. I decided that a tag would be "Related" if every last photo in the current set had that tag. In our case, if the user filtered down to tags 15 and 6, I would expect the related tags box to display 15 and 6 but also maybe one or two other tags that the set happened to share.

So my challenge has been to write a query to give me only the set of tags that appear in each and every photo of a set....a set that has been defined by a couple tags that the user has already chosen. My attempts at this query have not been too successful. The only accurate one that I've got takes 20 seconds or so to run.

Code: Select all

 
SELECT DISTINCT t.id FROM tags t
INNER JOIN media_tags mt ON t.id=mt.tag_id AND mt.media_id IN
(SELECT m.id AS id FROM media m
INNER JOIN media_tags mt ON m.id=mt.media_id AND mt.tag_id IN (15,6)
GROUP BY m.id HAVING count(*)=2 ORDER BY id)
 

And just as a sidenote, elsewhere I am displaying the list of all the tags that appear at least once in a given collection of photos. That's a whole lot easier query. But in this case, I'm trying to get just the tags that each and every photo in a defined set has.
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: MySQL Intersect

Post by dml »

It looks like the tables are indexed ok. Can you put an EXPLAIN in front of the problem query, and run it in the mysql console to find out what mysql is trying to do:

Code: Select all

 
EXPLAIN SELECT DISTINCT t.id FROM tags t
INNER JOIN media_tags mt ON t.id=mt.tag_id AND mt.media_id IN
(SELECT m.id AS id FROM media m
INNER JOIN media_tags mt ON m.id=mt.media_id AND mt.tag_id IN (15,6)
GROUP BY m.id HAVING count(*)=2 ORDER BY id)\G
 
Probably the best way for mysql to execute the query is from bottom to top: get the list of media ids from the subselect, iterate through those media ids and generate a list of related tag ids from media_tags, and then it'll have to put those tag ids into a temporary file and sort it to get the distinct tag ids (which will slow things down a bit). If mysql is doing the query the other way around, starting from the tag ids and looking up related media, the danger is that it's doing the subselect on every iteration instead of just once. That's just speculation, though: the EXPLAIN output will say exactly what's going on.
User avatar
inghamn
Forum Contributor
Posts: 174
Joined: Mon Apr 16, 2007 10:33 am
Location: Bloomington, IN, USA

Re: MySQL Intersect

Post by inghamn »

Here's the results:

Code: Select all

 
SELECT DISTINCT t.id AS id FROM tags t
INNER JOIN media_tags mt ON t.id=mt.tag_id AND mt.media_id IN
(SELECT m.id AS id FROM media m
INNER JOIN media_tags mt ON m.id=mt.media_id AND mt.tag_id IN (6,15)
GROUP BY m.id HAVING count(*)=2 ORDER BY id )
ORDER BY name
 
I'm not sure if the query can be done without the subqueries. Although I would love to be able to do it using only joins. But, for the life of me, I can't figure out how I would write it differently.

Code: Select all

 
+----+--------------------+-------+--------+-----------------+---------+---------+-----------------------+------+----------------------------------------------+
| id | select_type        | table | type   | possible_keys   | key     | key_len | ref                   | rows | Extra                                        |
+----+--------------------+-------+--------+-----------------+---------+---------+-----------------------+------+----------------------------------------------+
|  1 | PRIMARY            | t     | index  | PRIMARY         | name    | 386     | NULL                  |   32 | Using index; Using temporary; Using filesort |
|  1 | PRIMARY            | mt    | ref    | tag_id          | tag_id  | 4       | photobase.t.id        |  164 | Using where; Distinct                        |
|  2 | DEPENDENT SUBQUERY | mt    | range  | media_id,tag_id | tag_id  | 4       | NULL                  |  642 | Using where; Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | m     | eq_ref | PRIMARY         | PRIMARY | 4       | photobase.mt.media_id |    1 | Using index                                  |
+----+--------------------+-------+--------+-----------------+---------+---------+-----------------------+------+----------------------------------------------+
 
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: MySQL Intersect

Post by dml »

It looks like mysql isn't being very clever. It should know that it only has to process the subquery once, but it's being executed as a DEPENDENT SUBQUERY, which means that mysql thinks it's going to get a different result for every row in media_tags, so it rexecutes it every time. Can you try something like this (not 100% sure if I've got the syntax right): it separates off the subquery a bit more so that mysql should recognise that it only has to execute it once.

Code: Select all

 
SELECT DISTINCT 
  t.id AS id 
FROM 
  tags t
  JOIN media_tags mt ON t.id=mt.tag_id
  JOIN (
    SELECT media_id 
    FROM media_tags
    WHERE mt.tag_id IN (6,15)
    GROUP BY media_id HAVING count(*)=2
   )related_media_ids ON mt.media_id = related_media_ids.media_id
 
User avatar
inghamn
Forum Contributor
Posts: 174
Joined: Mon Apr 16, 2007 10:33 am
Location: Bloomington, IN, USA

Re: MySQL Intersect

Post by inghamn »

Much obliged for the idea.

But, alas, that query returns the list of tags that appear at least once in the set of media. In order to limit the results to just the set of tags where each tags exists in each and every media, I'm beginning to believe the dependent subqueries are necessary.
dml
Forum Contributor
Posts: 133
Joined: Sat Jan 26, 2008 2:20 pm

Re: MySQL Intersect

Post by dml »

Doh! It seems so simple, but then my brain gets tangled up trying to parse the subtleties of "any of the x's for which each of the y's have all of the z's". I think it's easy to do it in two requests, in a reasonably logical way that will probably perform not horribly, but I can't think of a way of doing it in one request - I don't even understand how that dependent subquery version works.

If you have the following photos+tags, and we're interested in the related tags for "Fred"+"2007".
  • Photo 1: Fred, Beer, 2007, Picnic
  • Photo 2: Fred, 2008, Concert
  • Photo 3: Fred, Maud
  • Photo 4: Cat
  • Photo 5: Dog, Picnic
  • Photo 6: Fred, Alice, 2007, Picnic
  • Photo 7: Fred, Maurice, 2007, Picnic
It happens that all the 2007 photos of Fred were taken at a Picnic, so that's what we want when we do a query on related tags.

We're ok on the first part of it, we want to get the ids of the photos tagged Fred+2007, ie photos 1, 6, and 7.

Code: Select all

/* IDS OF PHOTOS TAGGED WITH Fred+2007 -> RETURNS 1, 6, 7*/
SELECT photo_id FROM photos_tags WHERE tag IN ('Fred', '2007') 
GROUP BY photo_id HAVING count(*) = 2
The next part is to extract the tags associated with all of the related photos, and the structure of the query is the same, just the fields reversed:

Code: Select all

/* COMMON TAGS OF PHOTOS 1, 6, 7 -> RETURNS 'Fred', '2007', 'Picnic'*/
SELECT tag FROM photos_tags WHERE photo_id IN (1, 6, 7)
GROUP BY tag HAVING count(*)=3
The first query could be made into a subquery of the second, except for the count(*)=3 part: you don't know what to make the count until you've run the first query. The photo_ids from the first_query needn't be brought all the way back into php, they can be thrown into a temporary table and a join done on that. Another optimization is to filter out the original tags from the second query: fewer records to have to group and sort.
Post Reply