Subqueries can be a performance killer in MySQL, but the surprising truth is that most of them can be rewritten as joins and run much faster.
Let’s see this in action. Imagine we have two tables: orders and customers.
-- orders table
CREATE TABLE orders (
order_id INT PRIMARY KEY AUTO_INCREMENT,
customer_id INT,
order_date DATE,
amount DECIMAL(10, 2)
);
-- customers table
CREATE TABLE customers (
customer_id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100),
email VARCHAR(100)
);
-- Sample Data
INSERT INTO orders (customer_id, order_date, amount) VALUES
(1, '2023-01-15', 100.50),
(2, '2023-01-16', 75.20),
(1, '2023-01-17', 200.00),
(3, '2023-01-18', 50.75),
(2, '2023-01-19', 120.00);
INSERT INTO customers (name, email) VALUES
('Alice', 'alice@example.com'),
('Bob', 'bob@example.com'),
('Charlie', 'charlie@example.com');
Now, let’s say we want to find all orders placed by customers named 'Alice'. A common way to do this with a subquery is:
SELECT *
FROM orders
WHERE customer_id IN (SELECT customer_id FROM customers WHERE name = 'Alice');
When MySQL executes this, it often has to run the subquery for every row in the outer query. This is like asking "Is this order’s customer ID in the list of Alice’s customer IDs?" for each and every order. If you have millions of orders, that’s millions of subquery executions.
The same logic can be expressed using a JOIN:
SELECT o.*
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.name = 'Alice';
Here, MySQL’s query optimizer can be much more efficient. It can scan the customers table once to find 'Alice', get her customer_id, and then efficiently look up all matching orders in the orders table using the index on customer_id. This is typically one or a few table scans and index lookups, rather than millions of subquery executions.
The core problem subqueries solve is relating data across tables when you only need a subset of data from one table to filter another. The IN subquery, for example, is checking for membership. A JOIN achieves the same goal by directly linking rows from both tables based on a common key.
The key levers you control are the JOIN type (INNER JOIN, LEFT JOIN, etc.) and the ON condition. INNER JOIN is the most common for replacing IN subqueries because it only returns rows where a match exists in both tables, mirroring the behavior of IN. The ON clause is the direct equivalent of the WHERE clause used in the subquery to link the tables.
Consider another common pattern: correlated subqueries in the SELECT list.
SELECT
o.order_id,
o.amount,
(SELECT c.name FROM customers c WHERE c.customer_id = o.customer_id) AS customer_name
FROM orders o;
This subquery runs for every single row in the orders table, fetching the customer’s name. Rewriting this as a join:
SELECT
o.order_id,
o.amount,
c.name AS customer_name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id;
Using LEFT JOIN here is crucial. If an order had a customer_id that didn’t exist in the customers table (which shouldn’t happen with proper foreign keys, but can in messy data), the LEFT JOIN would still return the order row, showing NULL for customer_name. The subquery version would also return NULL if the subquery found no match. If you only wanted orders with valid customers, you’d use an INNER JOIN here.
The most impactful performance gains come from replacing subqueries that are executed repeatedly for each row of the outer query. This often happens with IN or EXISTS in the WHERE clause, or scalar subqueries in the SELECT list. The query optimizer in modern MySQL versions is quite good at automatically converting some subqueries to joins, but it’s not perfect, and explicitly writing them as joins guarantees the efficient execution plan.
The fundamental difference in how the database processes these queries lies in the execution plan. A subquery often leads to a plan where the outer query is processed, and for each row, the subquery is executed. A join typically allows the database to build a plan that reads from both tables simultaneously or in a coordinated manner, using indexes and join algorithms (like hash joins or nested loop joins) to efficiently combine the data. When you rewrite a subquery as a join, you’re essentially guiding the optimizer towards a more efficient execution strategy by providing the explicit relationships between the tables.
The next logical step after mastering query optimization with joins is understanding how to leverage indexes effectively to further accelerate these join operations.