Login

I am an old-school MySQL user and have always preferred `JOIN` over sub-query. But nowadays everyone uses sub-query, and I hate it; I don't know why.

I lack the theoretical knowledge to judge for myself if there is any difference. Is a sub-query as good as a `JOIN` and therefore is there nothing to worry about?

Sub-queries are the logically correct way to solve problems of the form, "Get facts from A, conditional on facts from B". In such instances, it makes more logical sense to stick B in a sub-query than to do a join. It is also safer, in a practical sense, since you don't have to be cautious about getting duplicated facts from A due to multiple matches against B.

Practically speaking, however, the answer usually comes down to performance. Some optimisers suck lemons when given a join vs a sub-query, and some suck lemons the other way, and this is optimiser-specific, DBMS-version-specific and query-specific.

Historically, explicit joins usually win, hence the established wisdom that joins are better, but optimisers are getting better all the time, and so I prefer to write queries first in a logically coherent way, and then restructure if performance constraints warrant this.

Use EXPLAIN to see how your database executes the query on your data. There is a huge "it depends" in this answer...

PostgreSQL can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other. It all depends on the data, indexes, correlation, amount of data, query, etc.

In most cases `JOIN`s are faster than sub-queries and it is very rare for a sub-query to be faster.

In `JOIN`s RDBMS can create an execution plan that is better for your query and can predict what data should be loaded to be processed and save time, unlike the sub-query where it will run all the queries and load all their data to do the processing.

The good thing in sub-queries is that they are more readable than `JOIN`s: that's why most new SQL people prefer them; it is the easy way; but when it comes to performance, JOINS are better in most cases even though they are not hard to read too.

[MSDN Documentation for SQL Server says][1]
> Many Transact-SQL statements that include subqueries can be alternatively formulated as joins. Other questions can be posed only with subqueries. In Transact-SQL, there is usually no performance difference between a statement that includes a subquery and a semantically equivalent version that does not. However, in some cases where existence must be checked, a join yields better performance. Otherwise, the nested query must be processed for each result of the outer query to ensure elimination of duplicates. In such cases, a join approach would yield better results.

so if you need something like

select * from t1 where exists select * from t2 where t2.parent=t1.id

try to use join instead. In other cases, it makes no difference.

I say: Creating **functions** for subqueries eliminate the problem of cluttter and allows you to implement additional logic to subqueries. So I recommend creating functions for subqueries whenever possible.

Clutter in code is a big problem and the industry has been working on avoiding it for decades.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

MySQL version: 5.5.28-0ubuntu0.12.04.2-log

I was also under the impression that JOIN is always better than a sub-query in MySQL, but EXPLAIN is a better way to make a judgment. Here is an example where sub queries work better than JOINs.

Here is my query with 3 sub-queries:

EXPLAIN SELECT vrl.list_id,vrl.ontology_id,vrl.position,l.name AS list_name, vrlih.position AS previous_position, vrl.moved_date
FROM `vote-ranked-listory` vrl
INNER JOIN lists l ON l.list_id = vrl.list_id
INNER JOIN `vote-ranked-list-item-history` vrlih ON vrl.list_id = vrlih.list_id AND vrl.ontology_id=vrlih.ontology_id AND vrlih.type='PREVIOUS_POSITION'
INNER JOIN list_burial_state lbs ON lbs.list_id = vrl.list_id AND lbs.burial_score < 0.5
WHERE vrl.position <= 15 AND l.status='ACTIVE' AND l.is_public=1 AND vrl.ontology_id < 1000000000
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=43) IS NULL
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=55) IS NULL
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=246403) IS NOT NULL
ORDER BY vrl.moved_date DESC LIMIT 200;

EXPLAIN shows:

+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
| 1 | PRIMARY | vrl | index | PRIMARY | moved_date | 8 | NULL | 200 | Using where |
| 1 | PRIMARY | l | eq_ref | PRIMARY,status,ispublic,idx_lookup,is_public_status | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 1 | PRIMARY | vrlih | eq_ref | PRIMARY | PRIMARY | 9 | ranker.vrl.list_id,ranker.vrl.ontology_id,const | 1 | Using where |
| 1 | PRIMARY | lbs | eq_ref | PRIMARY,idx_list_burial_state,burial_score | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 4 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
| 3 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+

The same query with JOINs is:

EXPLAIN SELECT vrl.list_id,vrl.ontology_id,vrl.position,l.name AS list_name, vrlih.position AS previous_position, vrl.moved_date
FROM `vote-ranked-listory` vrl
INNER JOIN lists l ON l.list_id = vrl.list_id
INNER JOIN `vote-ranked-list-item-history` vrlih ON vrl.list_id = vrlih.list_id AND vrl.ontology_id=vrlih.ontology_id AND vrlih.type='PREVIOUS_POSITION'
INNER JOIN list_burial_state lbs ON lbs.list_id = vrl.list_id AND lbs.burial_score < 0.5
LEFT JOIN list_tag lt1 ON lt1.list_id = vrl.list_id AND lt1.tag_id = 43
LEFT JOIN list_tag lt2 ON lt2.list_id = vrl.list_id AND lt2.tag_id = 55
INNER JOIN list_tag lt3 ON lt3.list_id = vrl.list_id AND lt3.tag_id = 246403
WHERE vrl.position <= 15 AND l.status='ACTIVE' AND l.is_public=1 AND vrl.ontology_id < 1000000000
AND lt1.list_id IS NULL AND lt2.tag_id IS NULL
ORDER BY vrl.moved_date DESC LIMIT 200;

and the output is:

+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | lt3 | ref | list_tag_key,list_id,tag_id | tag_id | 5 | const | 2386 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | l | eq_ref | PRIMARY,status,ispublic,idx_lookup,is_public_status | PRIMARY | 4 | ranker.lt3.list_id | 1 | Using where |
| 1 | SIMPLE | vrlih | ref | PRIMARY | PRIMARY | 4 | ranker.lt3.list_id | 103 | Using where |
| 1 | SIMPLE | vrl | ref | PRIMARY | PRIMARY | 8 | ranker.lt3.list_id,ranker.vrlih.ontology_id | 65 | Using where |
| 1 | SIMPLE | lt1 | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.lt3.list_id,const | 1 | Using where; Using index; Not exists |
| 1 | SIMPLE | lbs | eq_ref | PRIMARY,idx_list_burial_state,burial_score | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 1 | SIMPLE | lt2 | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.lt3.list_id,const | 1 | Using where; Using index |
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+

A comparison of the `rows` column tells the difference and the query with JOINs is using `Using temporary; Using filesort`.

Of course when I run both the queries, the first one is done in 0.02 secs, the second one does not complete even after 1 min, so EXPLAIN explained these queries properly.

If I do not have the INNER JOIN on the `list_tag` table i.e. if I remove

AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=246403) IS NOT NULL

from the first query and correspondingly:

INNER JOIN list_tag lt3 ON lt3.list_id = vrl.list_id AND lt3.tag_id = 246403

from the second query, then EXPLAIN returns the same number of rows for both queries and both these queries run equally fast.

Run on a very large database from an old Mambo CMS:

SELECT id, alias
FROM
mos_categories
WHERE
id IN (
SELECT
DISTINCT catid
FROM mos_content
);
0 seconds

SELECT
DISTINCT mos_content.catid,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;

~3 seconds

An EXPLAIN shows that they examine the exact same number of rows, but one takes 3 seconds and one is near instant. Moral of the story? If performance is important (when isn't it?), try it multiple ways and see which one is fastest.

And...

SELECT
DISTINCT mos_categories.id,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;

0 seconds

Again, same results, same number of rows examined. My guess is that DISTINCT mos_content.catid takes far longer to figure out than DISTINCT mos_categories.id does.

First of all, to compare the two first you should distinguish queries with subqueries to:

1. a class of subqueries that always have corresponding equivalent query written with joins
2. a class of subqueries that can not be rewritten using joins

**For the first class** of queries a good RDBMS will see joins and subqueries as equivalent and will produce same query plans.

These days even mysql does that.

Still, sometimes it does not, but this does not mean that joins will always win - I had cases when using subqueries in mysql improved performance. (For example if there is something preventing mysql planner to correctly estimate the cost and if the planner doesn't see the join-variant and subquery-variant as same then subqueries can outperform the joins by forcing a certain path).

Conclusion is that you should test your queries for both join and subquery variants if you want to be sure which one will perform better.

**For the second class** the comparison makes no sense as those queries can not be rewritten using joins and in these cases subqueries are natural way to do the required tasks and you should not discriminate against them.

These days, many dbs can optimize subqueries and joins. Thus, you just gotto examine your query using explain and see which one is faster. If there is not much difference in performance, I prefer to use subquery as they are simple and easier to understand.

The difference is only seen when the second joining table has significantly more data than the primary table. I had an experience like below...

We had a users table of one hundred thousand entries and their membership data (friendship) about 3 hundred thousand entries. It was a join statement in order to take friends and their data, but with a great delay. But it was working fine where there was only a small amount of data in the membership table. Once we changed it to use a sub-query it worked fine.

But in the mean time the join queries are working with other tables that have fewer entries than the primary table.

So I think the join and sub query statements are working fine and it depends on the data and the situation.

gaylenefljkoclof

jemimahidpqmqtcit

spryness531214

seltzer739

akinesia929510

trudytrue704

talyahccncvi

nephogram558

wisent575

Droverfly3