MySQL’s PARTITION BY HASH is surprisingly ineffective for distributing data evenly across partitions, leading to performance bottlenecks that most users don’t anticipate.

Let’s see what happens when we have a table partitioned by HASH on a column that isn’t a primary key or unique index.

CREATE TABLE sales (
    sale_id INT NOT NULL AUTO_INCREMENT,
    product_id INT NOT NULL,
    sale_date DATE NOT NULL,
    amount DECIMAL(10, 2) NOT NULL,
    PRIMARY KEY (sale_id)
) ENGINE=InnoDB
PARTITION BY HASH(product_id)
PARTITIONS 4;

INSERT INTO sales (product_id, sale_date, amount) VALUES
(101, '2023-01-15', 50.00),
(102, '2023-01-16', 75.00),
(101, '2023-01-17', 60.00),
(103, '2023-01-18', 100.00),
(102, '2023-01-19', 80.00),
(101, '2023-01-20', 55.00);

-- Simulate a large number of inserts
INSERT INTO sales (product_id, sale_date, amount)
SELECT product_id, sale_date, amount
FROM sales; -- Repeat this many times to fill the table

Now, imagine we want to query sales for a specific product_id.

SELECT * FROM sales WHERE product_id = 101;

If product_id were the partitioning key and we had a good distribution, MySQL would only need to scan the partition(s) containing product_id = 101. However, with PARTITION BY HASH(product_id) and product_id not being the primary key, MySQL often can’t prune partitions effectively. It might end up scanning all partitions, negating the benefit of partitioning for this specific query. This is because the optimizer doesn’t have enough information to guarantee that product_id values are exclusively within certain partitions if product_id isn’t part of the primary key.

The core problem PARTITION BY HASH solves is distributing rows evenly. It does this by applying the HASH() function to the partitioning expression and then using the modulo operator (%) with the number of partitions. For example, HASH(product_id) % 4. The issue arises when the partitioning column isn’t directly involved in the PRIMARY KEY or a UNIQUE index. In such cases, MySQL cannot guarantee that a query on a non-indexed column will map to a single partition, even if the hash function is applied. It might be forced to scan multiple partitions to ensure it finds all matching rows.

The primary goal of partitioning is to reduce the scope of data that a query needs to examine. When you partition a table, you’re essentially dividing it into smaller, more manageable chunks based on a specific column or expression. For range partitioning, this is straightforward: sale_date BETWEEN '2023-01-01' AND '2023-01-31'. For list partitioning, it’s discrete values: country IN ('USA', 'Canada'). For hash partitioning, it’s supposed to be a mathematical distribution.

Let’s consider a more effective strategy: partitioning by RANGE on the sale_date.

CREATE TABLE sales_ranged (
    sale_id INT NOT NULL AUTO_INCREMENT,
    product_id INT NOT NULL,
    sale_date DATE NOT NULL,
    amount DECIMAL(10, 2) NOT NULL,
    PRIMARY KEY (sale_id, sale_date) -- Include partitioning key in PK
) ENGINE=InnoDB
PARTITION BY RANGE (TO_DAYS(sale_date)) (
    PARTITION p202301 VALUES LESS THAN (TO_DAYS('2023-02-01')),
    PARTITION p202302 VALUES LESS THAN (TO_DAYS('2023-03-01')),
    PARTITION p202303 VALUES LESS THAN (TO_DAYS('2023-04-01')),
    PARTITION pmax VALUES LESS THAN MAXVALUE
);

-- Insert data into the new table
INSERT INTO sales_ranged (product_id, sale_date, amount)
SELECT product_id, sale_date, amount FROM sales; -- Assuming 'sales' table exists

Now, a query for a specific date range can efficiently target only the relevant partition(s).

SELECT * FROM sales_ranged WHERE sale_date BETWEEN '2023-01-15' AND '2023-01-20';

MySQL can now look at TO_DAYS(sale_date) for the query’s WHERE clause and determine that it only needs to scan p202301. This is partition pruning at its best. The key here is that the query condition directly matches the partitioning scheme.

The most surprising aspect of MySQL partitioning is how often the optimizer fails to prune partitions even when it seems like it should. This is especially true for HASH partitioning on non-indexed columns or when the query’s WHERE clause doesn’t perfectly align with the partitioning expression and the partitioning column isn’t part of a unique index. The optimizer needs a strong guarantee that all matching rows for a given query condition will reside in a subset of partitions. If it can’t get that guarantee, it defaults to scanning all partitions to avoid missing data.

If you’re using PARTITION BY HASH and noticing that queries aren’t performing as expected, the most common culprit is that the hash key isn’t part of the primary key or a unique index. Without that, MySQL can’t confidently prune partitions.

To fix this, you have a few options:

  1. Add the partitioning column to the Primary Key: If product_id was your hash key, alter the table:

    ALTER TABLE sales DROP PRIMARY KEY, ADD PRIMARY KEY (product_id, sale_id);
    

    This tells MySQL that product_id is unique within a partition (or globally if it’s part of the PK) and allows for effective pruning.

  2. Use PARTITION BY RANGE or LIST: If your queries are typically based on date ranges or specific discrete values, switch to RANGE or LIST partitioning. This aligns the partitioning scheme with common query patterns.

    -- Example for RANGE on date
    ALTER TABLE sales PARTITION BY RANGE (TO_DAYS(sale_date)) (
        PARTITION p202301 VALUES LESS THAN (TO_DAYS('2023-02-01')),
        PARTITION pmax VALUES LESS THAN MAXVALUE
    );
    
  3. Change the Partitioning Key: If product_id is not suitable for range/list partitioning and you can’t make it part of the PK, consider partitioning on a different column that is suitable for range/list partitioning or that can be part of a composite PK.

  4. Ensure EXPLAIN PARTITIONS shows pruning: Always run EXPLAIN PARTITIONS on your critical queries. Look for the partitions column in the output. If it lists a subset of your total partitions (e.g., p0, p1 out of p0, p1, p2, p3), pruning is working. If it lists all partitions, it’s not.

    EXPLAIN PARTITIONS SELECT * FROM sales WHERE product_id = 101;
    
  5. Rebuild partitions if distribution is skewed: If HASH partitioning is used correctly (e.g., on a PK) but data distribution is still uneven due to the nature of the data itself (e.g., one product_id has 99% of sales), you might need to adjust the number of partitions or use a different partitioning strategy.

  6. Consider KEY partitioning: PARTITION BY KEY() works similarly to HASH but uses MySQL’s internal hashing algorithm. It’s often more effective when partitioning by columns that are part of the primary key.

    ALTER TABLE sales PARTITION BY KEY(sale_id) PARTITIONS 4;
    

    This is usually only beneficial if sale_id is the primary key.

  7. Check for NULL values in the partitioning column: If the partitioning column can be NULL, RANGE and LIST partitioning might behave unexpectedly, and HASH partitioning will treat NULL as a specific hash value. Ensure your strategy accounts for NULLs if they exist.

After fixing partitioning, the next hurdle is often managing partition maintenance, such as adding new partitions for future data or dropping old ones.

Want structured learning?

Take the full Express course →