DISTINCT VS GROUP BY
The DISTINCT statement retrieves unique values by removing duplicate rows from the result set based on the specified columns.
SELECT DISTINCT location_id, department_name
FROM departments;
The GROUP BY clause groups rows based on shared values in specified columns. It is typically used with aggregate functions to compute measures for each group. When no aggregate function is present, it yields the same result as DISTINCT and can filter out duplicates during the grouping process, depending on the query.
SELECT location_id, department_name,
COUNT (*) AS Count
FROM departments
GROUP BY location_id,department_name;
WHERE VS HAVING
The WHERE clause filters records based on specified condition and appears before the GROUP BY Clause.
It cannot be used with aggregate functions in the SQL Statement.
SELECT location_id,department_name,
COUNT (*) AS department_Count
FROM departments
WHERE department_name = 'IT'
GROUP BY location_id,department_name;
The HAVING clause is commonly used with the GROUP BY clause to filter groups based on specified conditions, and it can include aggregate functions in the SELECT statement.
SELECT location_id,department_name,
COUNT (*) AS department_Count
FROM departments
GROUP BY location_id,department_name
HAVING COUNT (department_name) > 2;
JOINS /UNION / UNION ALL
The JOINS combine all records and may include duplicates into new columns from multiple tables based on related columns.
SELECT department_name,city,country_name
FROM departments d
JOIN locations l
ON l.location_id = d.location_id
JOIN countries c
ON l.country_id = c.country_id;
The UNION Operator concatenates result sets from two or more SELECT statements, combining the records into new distinct rows.
SELECT country_name
FROM countries
UNION
SELECT country_name
FROM new_countries;
In Contrast, The UNION ALL Operator concatenates result sets from multiple SELECT statements while including duplicate records.
SELECT country_name
FROM countries
UNION ALL
SELECT country_name
FROM new_countries;
BETWEEN / IN / EXISTS
The BETWEEN Operator selects a range of data between two values (begin and end values).
SELECT country_name FROM countries
WHERE region_id
BETWEEN 1 AND 3;
Whereas the IN Operator specifies multiple values in the WHERE Clause.
SELECT country_name FROM countries
WHERE region_id IN (1, 2, 3);
The EXISTS Operator returns True or False based on the existence of the records in a subquery.
SELECT country_name FROM countries c
WHERE EXISTS (SELECT * FROM regions r WHERE c.region_id=r.region_id);
DROP/ DELETE /TRUNCATE
The DROP command is a part of Data Definition Language (DDL), removes a table from the database.
DROP TABLE regions;
The DELETE command is a part of Data Manipulation Language (DML), removes one or more records from a table.
DELETE FROM regions WHERE region_id = 2;
The TRUNCATE command, also a DDL operation, deletes all rows from an existing table.
TRUNCATE TABLE regions;
COALESCE VS NULLIF
The COALESCE () function returns the first non-null value from a list of arguments. While it can accept multiple arguments, they must all be of the same data type.
SELECT COALESCE(NULL,NULL,'US','UK') AS Country;
On the other hand, The NULLIF () function takes two values and returns the first value unless the two values are equal, in which case it returns NULL.
SELECT NULLIF ('US','US') AS Country;
AGGREGATE VS WINDOW FUNCTIONS
SQL aggregate functions are essential for performing calculations on groups of data.
AVG () computes the average of a numeric column.
SELECT ROUND (AVG (min_salary),2) FROM jobs;
SUM () adds up all the values in a numeric column.
SELECT ROUND(SUM(max_salary),2) AS Total_Max_salary FROM jobs;
MAX () retrieves the highest value in a specified column.
SELECT ROUND(MAX(max_salary),2) AS Highest_Max_Salary FROM jobs;
MIN () finds the lowest value in a specified column.
SELECT ROUND(MIN(max_salary),2) AS Lowest_Min_Salary FROM jobs;
COUNT () counts the number of rows in a result set.
SELECT COUNT(*) FROM jobs;
The above functions can be used with GROUP BY to perform calculations on subsets of data as well.
In SQL, WINDOW FUNCTIONS operate on a set of rows called a window frame. They return a single value for each row from the underlying query. A window function allows you to perform calculations across a set of rows related to the current row, and defining that set is done with the OVER () clause.
We can use AGGREGATE FUNCTION in window functions.
SELECT
job_id, min_salary,
SUM(min_salary) OVER (ORDER BY job_id) AS running_total
FROM jobs;
The PARTITION BY clause in a window function allows to define distinct groups (or partitions) within the result set.
SELECT
job_id, max_salary,
ROUND(AVG(max_salary) OVER (PARTITION BY job_id)::numeric,2) AS avg_max_salary
FROM jobs;
The ORDER BY clause within the OVER () clause sorts the result set in ascending or descending order.
SELECT
job_id,
max_salary,
RANK () OVER (ORDER BY max_salary DESC) AS max_salary_rank
FROM jobs;
The ROW_NUMBER () window function assigns a unique sequential integer to each row within a partition of a result set, starting at 1 for the first row. This numbering is based on the order specified in the ORDER BY clause within the OVER () statement.
SELECT
job_id,
max_salary,
ROW_NUMBER () OVER (ORDER BY max_salary DESC) AS max_salary_row_num
FROM jobs;
Both RANK () and DENSE_RANK () is used for ranking rows within a partition, but they handle ties (duplicate values) differently compared to ROW_NUMBER ()
In RANK () When there are identical values, they receive the same rank. If two rows are tied for rank 1, the next rank will be 3.
SELECT
job_id,
min_salary,
RANK() OVER (ORDER BY min_salary DESC) AS min_salary_rank
FROM jobs;
In Contrast, DENSE_RANK () identical rows receive the same rank and the next rank after ties will continue sequentially. So, if two rows are tied for rank 1, the following row will be ranked 2.
SELECT
job_id,
min_salary,
DENSE_RANK() OVER (ORDER BY min_salary DESC) AS min_salary_rank
FROM jobs;
The NTILE () function is a powerful window function in SQL that divides the result set into n buckets (or tiles) and assigns a bucket number to each row. This is particularly useful for identifying percentiles, quartiles, or any other subdivision of your data.
SELECT
job_id,
max_salary,
NTILE(4) OVER (ORDER BY max_salary) AS max_salary_quartile
FROM jobs;
LAG () function allows you to access data from a previous row in the same result set. It’s useful for comparing the current row’s value with a preceding row's value. Conversely, LEAD () function allows you to access data from a following row. It’s helpful when you want to compare the current row’s value with a subsequent row's value.
SELECT job_id, min_salary,
LAG(min_salary) OVER (ORDER BY min_salary desc) AS previous_min_salary,
LEAD(min_salary) OVER (ORDER BY min_salary DESC) AS next_min_salary
FROM jobs
ORDER BY min_salary desc;
CONCLUSION
SQL is an invaluable tool for data analysis, but its vast array of statements and functions can be daunting for beginners. In this guide, we have explored some fundamental concepts that often lead to confusion. Understanding these distinctions is crucial for effective data manipulation. Happy querying!