As an SQL developer, you may be well-versed in the power and versatility of JOINs, essential for combining data from multiple tables. While JOINs are indispensable tools, filtering joined data introduces a set of challenges and pitfalls that even seasoned professionals may encounter. Filtering joined data plays a pivotal role in shaping the accuracy and specificity of your SQL queries. It enables you to retrieve precisely the data you need, narrowing down results based on specific criteria. We'll start with a concise overview of JOINs, both INNER JOIN and LEFT JOIN, to lay a solid foundation. Then, we'll delve into the challenges that arise when applying filtering conditions to your JOINed data.
Table Introduction: Instagram Examples
In this table introduction, we present two essential tables from our Instagram examples: the users
table and the posts_upload
table.
In users
table, we store key attributes of users, such as their unique user_id
, chosen username
, number of followers
and following
, and an intriguing measure of their online presence - the reputation
.
In posts_upload
table each entry corresponds to a distinct post and is linked to its creator through the user_id
attribute, which establishes a connection with the users
table. Alongside the caption
, the likes
, comments
, and creation_date
provide vital insights into the post's popularity and the timeline of its creation.
Part 1: The Solid Foundation - INNER JOIN vs. LEFT JOIN
1. INNER JOIN:
INNER JOIN combines rows from two tables based on a common column, known as the join condition.
It filters out rows that do not have matching values in both tables.
The result contains only the rows that meet the matching criteria from both tables.
SELECT u.*,pu.post_id,pu.creation_date FROM users as u JOIN posts_upload as pu ON u.user_id = pu.user_id;
output:
When we did Inner Join, user_id
the present in both the table are matched.
2. LEFT JOIN:
LEFT JOIN returns all rows from the left table and matching rows from the right table.
When there's no match in the right table, it fills the result with NULL values for the right table columns.
This makes LEFT JOIN perfect for keeping all rows from the left table, even if there are no corresponding matches in the right table.
SELECT u.*,pu.post_id,pu.creation_date FROM users as u LEFT JOIN posts_upload as pu ON u.user_id = pu.user_id;
output:
The user_id
100 has not created any posts yet.
Part 2: Accidental INNER JOIN - Beware the WHERE Clause!
Now that we've established the foundation, let's dive into the hidden intricacies of LEFT JOIN. Brace yourself for the accidental INNER JOIN!
Assume you want to see the users who have not created any posts between the '2023-07-16' and '2023-07-18'
SELECT u.*,pu.post_id,pu.creation_date
FROM users as u LEFT JOIN posts_upload as pu
ON u.user_id = pu.user_id
WHERE pu.creation_date NOT BETWEEN '2023-07-16' AND '2023-07-18';
output:
There is a problem with this query because if you pay attention to the user with the user_id
100 is missing from the above output. yes, the user_id
100 has not yet created any posts which means he has to be in the above output but is not present.
This happened because SQL doesn't evaluate value comparisons to TRUE when one of the values being compared is NULL. Which makes the above query accidentally behave like an INNER JOIN.
Part 3: Preserving LEFT JOIN - Filter Conditions Done Right
It's time to correct our mistakes and discover the art of preserving the LEFT JOIN behavior while still applying filtering conditions.
1. Filter Conditions in the JOIN:
To preserve the LEFT JOIN and filter rows in the right table, move the condition from the WHERE clause to the JOIN condition.
This ensures the filtering occurs before the LEFT JOIN, retaining NULL values in the result.
SELECT u.*,pu.post_id,pu.creation_date FROM users as u LEFT JOIN posts_upload as pu ON u.user_id = pu.user_id AND pu.creation_date NOT BETWEEN '2023-07-16' AND '2023-07-18';
output:
Now have a look at the user with
user_id
100, who is present in the above output and has not created any posts between the '2023-07-16' and '2023-07-18'.
2. Add NULL filter in the WHERE clause:
We can add a second condition that checks for rows that
pu.user_id
is NULL. This condition ensures that any users without corresponding posts in theposts_upload
table are included in the result set.SELECT u.*,pu.post_id,pu.creation_date FROM users as u LEFT JOIN posts_upload as pu ON u.user_id = pu.user_id WHERE (pu.creation_date NOT BETWEEN '2023-07-16' AND '2023-07-18') OR (pu.user_id IS NULL) ;
output:
The result will include all users, along with their corresponding posts except for those uploaded between '2023-07-16' and '2023-07-18'. Users without any posts (with NULL
user_id
inposts_upload
) will also be included.Conclusion:
In this blog post, we explored the challenges and pitfalls of filtering joined data, and we learned two crucial techniques to preserve the desired LEFT JOIN behavior.
First, by moving the filtering conditions to the JOIN clause, we ensure that filtering occurs before the JOIN, avoiding accidental exclusions of rows. Second, adding a NULL filter in the WHERE clause enables us to include rows without matching data in the right table.
Mastering these approaches empowers SQL developers to retrieve accurate and meaningful results, enhancing the precision and reliability of data analysis and reporting. By understanding the art of filtering joined data, we can make the most of SQL's power and versatility while avoiding common traps and pitfalls.
I hope you found this blog post useful. In the next section, we will see the concept of Granularity in SQL. Make sure to check it.
Thank you for reading!
HAPPY ANALYSIS !!!!!!!!!!!