Page 1 of 2

MySQL Get records that have 3+ specific related records

Posted: Thu Oct 13, 2011 7:34 am
by batfastad
Hi everyone

I've got a reasonably complex SQL query I'm looking to build.
I have 2 related tables (for this question)... companies and tags_data.
tags_data stores the information about which companies are tagged with which tags.
The companies table is related to tags_data by a company_id field. Tags_data has 3 fields in total: tagdata_id (the auto-enter primary key), company_id and tag_id.

I'm looking to get a list of all companies which have the same 3 tags. But I also want the ability to use it for a variable number of tags, so to list companies that have the same 2 tags or the same 10 tags. That's why I have initially set it up using INNER JOIN to compute the company_ids that have the required tags, then return those companies from the companies table.

Here's what I've got so far to list companies that have all 3 of the following tag ids... 713, 718, 724

Code: Select all

SELECT `companies`.`company_id`, `companies`.`company_name` 
FROM `companies` 
INNER JOIN (
SELECT `company_id`, COUNT(`tagdata_id`) FROM `tags_data` WHERE `tag_id` IN(713, 718, 724) GROUP BY `company_id` HAVING COUNT(`tagdata_id`)=3
) AS `require_tags` ON `companies`.`company_id`=`require_tags`.`company_id`
GROUP BY `companies`.`company_id` 
ORDER BY `companies`.`company_name`
I've simplified this by removing many fields from the parent SELECT and several other LEFT JOINs which get related info from other tables.

Performance of this is adequate, taking about 0.1 seconds to look through 150k tags_data records and 40k companies records to return 99 companies matching the query. The tables have indexes on company_id and tag_id fields.
The great thing about the INNER JOIN on a derived table is that I can easily change it to list companies that don't have any of those tags, or companies that have more than x and less than y number of multiple tags by just modifying the HAVING clause. I'm planning to make a query builder so users can build these complex queries for themselves.

Performance dives when I add a WHERE clause to the parent query (eg: outside the INNER JOIN) to do an even more complex search but I guess the answer to that is to move the WHERE clause inside the INNER JOIN.

Is there a method that would be faster but still give me the same flexibility as using an INNER JOIN in this way?
Or should the INNER JOIN theoretically be the fastest method and any further speed improvements would come from improved indexes and hardware?

Cheers, B

Re: MySQL Get records that have 3+ specific related records

Posted: Thu Oct 13, 2011 6:39 pm
by Christopher
Just an idea to get rid of the JOIN on a SELECT, maybe try a LEFT JOIN which will give NULLs where there is no tags_data. You can then remove the NULL records and check if the counts are 3. Something like that anyway:

Code: Select all

SELECT `companies`.`company_id`, `companies`.`company_name` 
FROM `companies` 
LEFT JOIN `tags_data` ON companies.company_id=tags_data.company_id
WHERE `tag_id` IN(713, 718, 724) AND tags_data.company_id NOT NULL
GROUP BY `company_id` HAVING COUNT(companies.tagdata_id)=3

Re: MySQL Get records that have 3+ specific related records

Posted: Fri Oct 14, 2011 1:39 am
by VladSun
Christopher, I haven't checked it, but removing the NULL LEFT joined records sounds like an INNER JOIN?

Re: MySQL Get records that have 3+ specific related records

Posted: Fri Oct 14, 2011 2:24 am
by Christopher
VladSun wrote:Christopher, I haven't checked it, but removing the NULL LEFT joined records sounds like an INNER JOIN?
Yeah they probably are the same. This problem would need some testing. I was mainly trying to show the idea of simplifying the query where you would have some IDs with 3 results and others with 1 or 2. If you started with a LEFT JOIN then you could see NULLs for the tags that did not match. Then remove the NULL records with INNER JOIN or WHERE NOT NULL. Then check the COUNT()s.

It seems like it was the JOIN to a SELECT that was making it slow.

Re: MySQL Get records that have 3+ specific related records

Posted: Mon Oct 17, 2011 10:56 am
by batfastad
Hi guys
Sorry for the delay in checking this out. I've done some testing and that revised query seems to produce a vastly expanded result set.

The problem is the WHERE ... IN() as that does an OR search for companies that have one of those tag_ids set.
Not companies that have all of those tag_ids set.
The HAVING clause is restricting the results to companies that have exactly 3 tags set. But any 3 tags - not just those specified in the WHERE ... IN() clause.

But I do agree that it is the INNER JOIN slowing things down. MySQL first has to compute the list of company_ids returned by the INNER JOIN before going through each company_id and retrieving further data from the companies table and possibly doing further WHERE operations on other related tables.

Cheers, B

Re: MySQL Get records that have 3+ specific related records

Posted: Mon Oct 17, 2011 9:29 pm
by Christopher
batfastad wrote:The problem is the WHERE ... IN() as that does an OR search for companies that have one of those tag_ids set.
Not companies that have all of those tag_ids set.
Then don't use IN() but instead build AND conditions. PHP is excellent at building strings like this.

Re: MySQL Get records that have 3+ specific related records

Posted: Tue Oct 18, 2011 3:14 am
by batfastad
It certainly is excellent for building strings like that. The simple search part of our SQL query builder already constructs WHERE clauses by adding criteria to an array and imploding.

Using AND produces no results.
By definition there are no related tag records that have a tag_id of 713 AND 718 AND 724.
Each tag for a company is it's own related record.

Cheers, B

Re: MySQL Get records that have 3+ specific related records

Posted: Tue Oct 18, 2011 5:29 am
by VladSun
Christopher wrote:
batfastad wrote:The problem is the WHERE ... IN() as that does an OR search for companies that have one of those tag_ids set.
Not companies that have all of those tag_ids set.
Then don't use IN() but instead build AND conditions. PHP is excellent at building strings like this.
Isn't it "OR conditions"?

Re: MySQL Get records that have 3+ specific related records

Posted: Tue Oct 18, 2011 5:46 am
by VladSun

Code: Select all

SELECT 
	`company`.`id`, 
	`company`.`name`,
	COUNT(DISTINCT `company_tag`.`tag_id`) as `tags_found`
FROM 
	`company` 
INNER JOIN
	`company_tag`	ON 
		`company_tag`.`company_id` = `company`.`id`
		AND
		`company_tag`.`tag_id` in (713, 718, 724)
GROUP BY
	`company`.`id`, 
	`company`.`name`
HAVING
	`tags_found` = 3
?

Re: MySQL Get records that have 3+ specific related records

Posted: Tue Oct 18, 2011 5:50 am
by VladSun
batfastad wrote:Tags_data has 3 fields in total: tagdata_id (the auto-enter primary key), company_id and tag_id.
Do you have duplicated company_id-tag_id pairs ? I doubt :)
So, that would be a perfect example of when NOT to use a surrogate PK ;) Your PK should be a composed natural PK - i.e. the "company_id-tag_id" :) and drop the surrogate key.

Re: MySQL Get records that have 3+ specific related records

Posted: Tue Oct 18, 2011 8:00 am
by batfastad
VladSun wrote:Do you have duplicated company_id-tag_id pairs ? I doubt :)
So, that would be a perfect example of when NOT to use a surrogate PK ;) Your PK should be a composed natural PK - i.e. the "company_id-tag_id" :) and drop the surrogate key.
Correct. Each company can only be tagged with the same tag, once.

Generating my own primary key in this instance combining company_id and tag_id does sound like a good idea. Rather than having a separate autoincrement as the PK.
But then do I still need separate fields for company_id and tag_id, or would that just be repetition of data that's already in the table?
Would that enable me to rewrite this query in a way that could be faster?

Cheers, B

Re: MySQL Get records that have 3+ specific related records

Posted: Tue Oct 18, 2011 8:10 am
by VladSun
batfastad wrote:But then do I still need separate fields for company_id and tag_id, or would that just be repetition of data that's already in the table?
Huh? Sorry, couldn't get that one. The company_tag table (i.e. tags_data) will contain only primary key data, though it's composite. You will still have access to both columns.

I meant:

Code: Select all

create table company_tag
(
   company_id int(10) not null,
   tag_id int(10) not null,
   primary key (company_id, tag_id)
);
Did you try the query I posted?

Re: MySQL Get records that have 3+ specific related records

Posted: Tue Oct 18, 2011 3:58 pm
by batfastad
Just tried that query and it seems to be slower than my original. I think because it first gets 20k records in the INNER JOIN before restricting it to those records that have all 3 of those tags with the HAVING.

Just wondering what field you mean by in the 2nd one of the GROUP BY?

I now understand your point about the compound primary key, so there's only 2 fields in the table.
I thought you might mean having 3 fields... a primary key which is the company_id and tag_id concatenated, then a company_id field, then a tag_id field.
So if you only have 2 fields per your suggestion, when deleting a record it means you would need to provide both the company_id and tag_id to isolate that unique record?

Cheers, B

Re: MySQL Get records that have 3+ specific related records

Posted: Tue Oct 18, 2011 7:21 pm
by Weirdan
batfastad wrote:Just tried that query and it seems to be slower than my original. I think because it first gets 20k records in the INNER JOIN before restricting it to those records that have all 3 of those tags with the HAVING.
Most likely you're correct. As we're talking about performance, can you please post EXPLAINs of your original query, latest query as well as results of

Code: Select all

show indexes from company
and

Code: Select all

show indexes from company_tag
?
So if you only have 2 fields per your suggestion, when deleting a record it means you would need to provide both the company_id and tag_id to isolate that unique record?
Yes

Re: MySQL Get records that have 3+ specific related records

Posted: Wed Oct 19, 2011 1:45 am
by VladSun
batfastad wrote:Just tried that query and it seems to be slower than my original. I think because it first gets 20k records in the INNER JOIN before restricting it to those records that have all 3 of those tags with the HAVING.
Post the results of EXPLAIN query also.
batfastad wrote:Just wondering what field you mean by in the 2nd one of the GROUP BY?
It's a habbit. Although recent SQL standarts permit grouping by some and not all of the non aggregating columns in the SELECT clause (in case they have 1:1 relationship, just like company.id and company.name columns), some of the SQL engines (e.g. PostgreSQL) will raise an error if the set of the columns in the GROUP BY clause does not contain ALL of the non agregating columns in the SELECT clause.

Now, since you said company.id/tag.id pairs are unique, you can remove the DISTINCT modifier from the COUNT(...) column. That should improve performance too.