How to merge several rows into one row using SQL

I’m currently dealing with a database where data is spread across different rows. My goal is to merge everything into a single row for each employee. The database consists of two related tables, and I need to restructure the data for easier analysis.

Employee Table:

EmployeeID FullName
100 John Smith
200 Sarah Johnson

Reviews Table:

ReviewID EmployeeID Sequence Subject Feedback Score
1 100 1 Project A Great work 5
2 100 2 Project B Needs improvement 3
3 200 1 Task X Excellent 4
4 200 2 Task Y Outstanding 5

I aim for the output to appear like this:

FullName First Subject First Score Second Subject Second Score
John Smith Project A 5 Project B 3
Sarah Johnson Task X 4 Task Y 5

I’ve tried using multiple joins for this, but the performance isn’t adequate with my larger database. Below is what I have tried:

WITH review_data AS (
    SELECT EmployeeID, Sequence, Subject, Score
    FROM Reviews
)
SELECT 
    e.FullName,
    r1.Subject,
    r1.Score,
    r2.Subject,
    r2.Score
FROM Employee e
LEFT JOIN review_data r1 ON e.EmployeeID = r1.EmployeeID AND r1.Sequence = 1
LEFT JOIN review_data r2 ON e.EmployeeID = r2.EmployeeID AND r2.Sequence = 2

What would be the best method to enhance the speed and efficiency of this query?

what’s ur table size? How many employees and reviews? also, which db are u using? maybe just need an index on (EmployeeID, Sequence) in Reviews table - that’s usually the bottleneck, not the query itself. have u tried dat yet?

The Problem:

You’re trying to merge data from multiple rows into a single row per employee in your database, but your current query using multiple joins is slow. You have two tables: Employee and Reviews, and you want to consolidate review data for each employee into a single row.

:thinking: Understanding the “Why” (The Root Cause):

Multiple JOIN operations, especially LEFT JOINs, can significantly impact query performance, especially on large datasets. Your initial approach performs multiple scans of the Reviews table, leading to increased execution time. The database has to perform several lookups for each employee. Self-joins are often less efficient than other methods for pivoting data. A more efficient approach would be to use conditional aggregation to consolidate the review data within a single query. This avoids multiple joins and reduces the database’s workload.

:gear: Step-by-Step Guide:

  1. Use Conditional Aggregation: Replace your multiple JOIN approach with conditional aggregation using CASE statements. This single query will group the data by employee and use MAX to select the appropriate subject and score for each sequence. The following SQL query efficiently achieves your desired output:
SELECT 
    e.FullName,
    MAX(CASE WHEN r.Sequence = 1 THEN r.Subject END) AS "First Subject",
    MAX(CASE WHEN r.Sequence = 1 THEN r.Score END) AS "First Score",
    MAX(CASE WHEN r.Sequence = 2 THEN r.Subject END) AS "Second Subject",
    MAX(CASE WHEN r.Sequence = 2 THEN r.Score END) AS "Second Score"
FROM Employee e
LEFT JOIN Reviews r ON e.EmployeeID = r.EmployeeID
GROUP BY e.EmployeeID, e.FullName;
  1. Optimize with Indexing: For even better performance with larger datasets, create indexes on the columns used in the JOIN and WHERE clauses. In this case, create indexes on the EmployeeID and Sequence columns in the Reviews table. The specific command will depend on your database system (e.g., CREATE INDEX idx_reviews_employeeid_sequence ON Reviews (EmployeeID, Sequence); for MySQL or PostgreSQL).

  2. Verify Database System: The optimal solution might vary slightly depending on the specific database system you are using (MySQL, PostgreSQL, SQL Server, etc.). If performance issues persist, consult the documentation for your database system for advanced optimization techniques.

:mag: Common Pitfalls & What to Check Next:

  • Data Integrity: Ensure that the Sequence column in your Reviews table accurately reflects the order of reviews. Inconsistencies here could lead to incorrect results.
  • NULL Handling: Your LEFT JOIN ensures that all employees are included in the result, even if they have no reviews. Consider how you want to handle NULL values – for example, replacing them with a default value like “N/A” or 0.
  • Scalability: For significantly larger datasets, consider further optimization techniques, such as partitioning or materialized views, if your database system supports them.

:speech_balloon: Still running into issues? Share your (sanitized) config files, the exact command you ran, and any other relevant details. The community is here to help!

The Problem:

You’re trying to optimize a slow SQL Server stored procedure that uses a WHILE loop to process payment calculations for a large number of policies (300K). The current loop iterates row-by-row, leading to significant performance issues. The goal is to replace the loop with a set-based approach using CTEs and bulk updates while maintaining the exact same output.

:thinking: Understanding the “Why” (The Root Cause):

The performance bottleneck stems from the row-by-row processing of the WHILE loop and the use of subqueries within the loop. SQL Server is optimized for set-based operations; processing data in sets is considerably faster than iterating row-by-row. Each iteration of the WHILE loop involves multiple round trips to the database, greatly increasing the overall processing time. Replacing the loop with a set-based approach using CTEs (Common Table Expressions) and bulk update statements allows SQL Server to perform these operations more efficiently in a single batch. The use of window functions within the CTEs further streamlines the calculations that were previously handled iteratively within the loop.

:gear: Step-by-Step Guide:

  1. Implement a Set-Based Approach Using CTEs and PIVOT: The most efficient solution leverages the power of CTEs and the PIVOT operator (available in SQL Server) to replace the WHILE loop. This approach pre-calculates all necessary values in a single pass, significantly improving performance.

    WITH CalculationCTE AS (
        -- This CTE performs all the calculations previously done in the loop
        SELECT 
            pb.contract_id,
            pb.system_id,
            pb.premium_amount as payment_amount,
            CASE 
                WHEN pm.total_premium = 0 THEN CAST(CASE WHEN ISNULL(pm.outstanding_premium,0) <> 0 THEN pm.outstanding_premium ELSE pm.premium_change END AS NUMERIC(18,3))
                ELSE CASE 
                        WHEN CAST(((CAST(CASE WHEN ISNULL(pm.outstanding_premium,0) <> 0 THEN pm.outstanding_premium ELSE pm.premium_change END AS NUMERIC(18,3)) / pm.total_premium) * pb.premium_amount) AS NUMERIC(18,2)) > CAST(CASE WHEN ISNULL(pm.outstanding_premium,0) <> 0 THEN pm.outstanding_premium ELSE pm.premium_change END AS NUMERIC(18,3)) AND CAST(CASE WHEN ISNULL(pm.outstanding_premium,0) <> 0 THEN pm.outstanding_premium ELSE pm.premium_change END AS NUMERIC(18,3)) > 0
                            THEN CAST(CASE WHEN ISNULL(pm.outstanding_premium,0) <> 0 THEN pm.outstanding_premium ELSE pm.premium_change END AS NUMERIC(18,3))
                            ELSE CAST(((CAST(CASE WHEN ISNULL(pm.outstanding_premium,0) <> 0 THEN pm.outstanding_premium ELSE pm.premium_change END AS NUMERIC(18,3)) / pm.total_premium) * pb.premium_amount) AS NUMERIC(18,2))
                    END
            END as calculated_commission,
            CASE 
                WHEN pm.total_premium = 0 THEN CAST(CASE WHEN ISNULL(pm.outstanding_premium,0) <> 0 THEN pm.outstanding_premium ELSE pm.premium_change END AS NUMERIC(18,3))
                ELSE CAST(CASE WHEN ISNULL(pm.outstanding_premium,0) <> 0 THEN pm.outstanding_premium ELSE pm.premium_change END AS NUMERIC(18,3)) - CAST(((CAST(CASE WHEN ISNULL(pm.outstanding_premium,0) <> 0 THEN pm.outstanding_premium ELSE pm.premium_change END AS NUMERIC(18,3)) / pm.total_premium) * pb.premium_amount) AS NUMERIC(18,2))
            END as calculated_outstanding,
            pm.total_premium as total_amount,
            ISNULL(pb.has_payment,0) as has_payment,
            pb.process_date,
            pb.transaction_id,
            pb.original_transaction_id,
            pb.created_date as payment_created_date_new,
            ROW_NUMBER() OVER (PARTITION BY pb.contract_id, pb.system_id ORDER BY pb.created_date) as rn  -- added row number for ranking
        FROM payment_batch_temp pb
        JOIN (SELECT SUM(CAST(CASE WHEN ISNULL(mt.outstanding_premium,0) <> 0 THEN mt.outstanding_premium ELSE mt.premium_change END AS NUMERIC(18,3))) OVER(PARTITION BY mt.contract_id,mt.system_id) as total_premium, mt.*
              FROM (SELECT *, DENSE_RANK() OVER(PARTITION BY a.contract_id,a.system_id ORDER BY a.effective_date,a.expiration_date ASC,contract_year,transaction_ref_id) as date_rank
                    FROM policy_transaction_main a
                    WHERE batch_id = @current_batch_id AND processed_status IN ('N','2')
              ) mt
              JOIN payment_batch_temp pi ON mt.contract_id = pi.contract_id AND mt.system_id = pi.system_id
                                          AND (ISNULL(mt.premium_type,'STANDARD')) = ISNULL(pi.premium_type,'STANDARD')
              WHERE mt.batch_id = @current_batch_id
                AND CAST(CASE WHEN ISNULL(mt.outstanding_premium,0) <> 0 THEN mt.outstanding_premium ELSE mt.premium_change END AS NUMERIC(18,3)) <> 0
                AND mt.record_type IN (1,4)
                AND mt.processed_status IN ('N','2')
                AND pi.batch_num = @min_batch
                AND (ISNULL(pi.agent_id,0) = 0 OR ISNULL(pi.agent_id,0) = mt.agent_id)
                AND date_rank = 1
         ) pm ON pb.contract_id = pm.contract_id AND pb.system_id = pm.system_id AND (ISNULL(pb.premium_type,'STANDARD')) = ISNULL(pm.premium_type,'STANDARD')
        WHERE pm.batch_id = @current_batch_id
          AND pm.total_premium <> 0
          AND pm.record_type IN (1,4)
          AND pm.processed_status IN ('N','2')
    ),
    PivotedData AS (
        SELECT *
        FROM CalculationCTE
        PIVOT (MAX(calculated_commission) FOR rn IN ([1],[2],[3], ...)) AS PivotTable --add as many columns as needed
    )
    UPDATE ptm
    SET ptm.commission_premium = pd.calculated_commission,
        ptm.outstanding_premium = pd.calculated_outstanding,
        ptm.processed_status = CASE WHEN pd.calculated_outstanding = 0 AND pd.total_amount <> 0 AND pd.has_payment = 1 THEN '1' ELSE CASE WHEN pd.total_amount <> 0 AND pd.has_payment = 1 THEN '2' ELSE 'N' END END,
        ptm.collected_premium = ptm.term_premium - pd.calculated_outstanding,
        ptm.has_payment_flag = CASE WHEN pd.has_payment = 0 THEN 0 WHEN pd.total_amount <> 0 THEN 1 ELSE 0 END,
        ptm.payment_process_date = pd.process_date,
        ptm.payment_created_date = pd.payment_created_date_new,
        ptm.process_step = 'STEP: 3', 
        ptm.notes = 'Applied payment processing for batch - ' + CAST(@current_batch_id AS VARCHAR(100))
    FROM policy_transaction_main ptm
    JOIN PivotedData pd ON ptm.record_id = pd.record_id;
    
    INSERT INTO payment_processing_log (transaction_id)
    SELECT transaction_id
    FROM PivotedData;
    
    UPDATE payment_batch_temp
    SET processed_status = 1
    WHERE transaction_id IN (SELECT transaction_id FROM payment_processing_log);
    
    
  2. Indexing: Create clustered indexes on contract_id and system_id in your payment_batch_temp and policy_transaction_main tables. This will drastically improve join performance. Consider adding non-clustered indexes on other frequently queried columns.

  3. Verify Results: Carefully compare the results of the set-based query with the results produced by the original WHILE loop to ensure complete accuracy.

:mag: Common Pitfalls & What to Check Next:

  • Data Type Mismatches: Double-check that all data types used in calculations are compatible and consistent across tables. Implicit conversions can lead to unexpected results and performance issues.
  • NULL Handling: Explicitly handle NULL values in your CASE statements and calculations to prevent errors.
  • Error Handling: Add error handling to your CTEs and update statements to gracefully manage potential issues.
  • Statistics Update: After making significant schema changes (indexes, etc.), update statistics using UPDATE STATISTICS to ensure the query optimizer has the latest information.

:speech_balloon: Still running into issues? Share your (sanitized) config files, the exact command you ran, and any other relevant details. The community is here to help!