Chapter 6: Mastering SQL for Data Science with Python

Chapter 6: Mastering SQL for Data Science with Python#

The chapter introduces Structured Query Language (SQL), a powerful and essential tool for data professionals. SQL is the standard language used to communicate with and manage relational databases. This chapter focuses on its application in data science, demonstrating how to use SQL for data retrieval, manipulation, and analysis. It also covers the integration of SQL with Python, a crucial skill for any data scientist.

Introduction to SQL for Data Science#

Data has become the foundation of decision-making in modern organizations. From social media platforms storing billions of user interactions to hospitals managing electronic health records, most of this information is stored in databases. Among different types of databases, relational databases are the most widely used.

Relational databases organize data into tables, which consist of rows (records) and columns (attributes). This design reflects the way data naturally relates to entities in the real world. For example:

*A retail store has customers (with IDs, names, and ages).

Each customer places orders (with product details, dates, and amounts).
The relationship between customers and orders can be represented through keys.

To interact with these databases, we use a language called Structured Query Language (SQL).

Structured Query Language (SQL).#

SQL provides a standardized way to create, read, update, and delete data (commonly referred to as CRUD operations). Unlike programming languages such as Python or Java, SQL is declarative: you specify what you want, and the database figures out how to get it.

This makes it highly efficient for managing large datasets. For data science, SQL is invaluable for:

Data Retrieval: Extracting specific subsets of data from large databases.
Data Cleaning and Transformation: Handling missing values, standardizing formats, and creating new features.
Exploratory Data Analysis (EDA): Performing quick summaries, aggregations, and data profiling.
Feature Engineering: Creating new variables from existing ones before feeding them into machine learning models.

The Evolution of Databases and SQL#

Early data management: Before databases, organizations stored information in files. This approach led to redundancy, inconsistency, and inefficiency.
Birth of the relational model: In 1970, Edgar F. Codd introduced the relational model, a mathematical foundation for organizing data in tables with well-defined relationships.
Development of SQL: By the late 1970s, IBM developed SEQUEL (Structured English Query Language), which evolved into SQL. It became the ANSI standard in 1986.
SQL today: Almost every relational database system (MySQL, PostgreSQL, Oracle, SQL Server, SQLite) supports SQL, with minor dialect differences.
Takeaway: SQL is not just a programming tool—it is the backbone of modern data storage and analytics.

Relational Databases: Core Concepts#

Relational databases are the backbone of structured data storage. They organize information in a way that ensures consistency, integrity, and efficient retrieval. The fundamental ideas of tables, keys, and relationships help us understand how real-world data is modeled.

Tables, Rows, and Columns#

A table is like a spreadsheet.

Rows (records/tuples): Each row corresponds to a single entity or instance of data. For example, one row might represent a single customer.
Columns (fields/attributes): Each column stores one specific type of information about the entity, such as name, age, or gender.
Schema: The structure of the table, which defines what columns exist and their data types (e.g., integer, string, date).

Example: A Customers table may contain columns such as Customer_ID, Name, Age, Gender, and Email. Each row represents one customer.

Keys#

Keys in Relational Databases are crucial for ensuring that data remains unique and consistent across tables.

1. Primary Key (PK)#

The primary key uniquely identifies each row. It contain UNIQUE values in column, and does not allows NULL values.

Here, Empid is a Primary Key. Example: **Customers Table**

Customer_ID	Name	Age	Gender
101	Alice	25	F
102	Bob	30	M
103	Charlie	28	M

Primary Key: Customer_ID
Ensures each customer is uniquely identifiable.

2. Foreign Key (FK)#

A foreign key links one table to another. It creates a relationship between two or more tables, a primary key of one table is referred as a foreign key in another table. It can also accept multiple null values and duplicate values.

Orders Table

Order_ID	Customer_ID	Product	Quantity
5001	101	Laptop	1
5002	102	Keyboard	2
5003	101	Mouse	1

Customer_ID here is a foreign key connecting each order to the Customers table.
Prevents creating an order for a non-existent customer.

3. Composite Key#

Sometimes, no single column uniquely identifies a row. Composite Key is a combination of more than one columns of a table. It can be a Candidate key and Primary key.

Enrollments Table

Student_ID	Course_ID	Grade
S001	CSE101	A
S001	MTH201	B+
S002	CSE101	A-
S002	PHY110	B

Neither Student_ID nor Course_ID alone is unique.
Together (Student_ID, Course_ID) form a composite key.
Ensures a student cannot enroll in the same course twice.

4. Candidate Key#

A candidate key is any column (or set of columns) that could serve as a primary key. Candidate Key(s) an identify a record uniquely in a table and which can be selected as a primary key of the table.

It contains UNIQUE values in column, and does not allows NULL values.

Here, Empid, EmpLicence and EmpPassport are candidate keys.

Example: Employees Table

Employee_ID	Email	SSN	Name
E001	alice@company.com	123-45-6789	Alice
E002	bob@company.com	987-65-4321	Bob
E003	charlie@company.com	111-22-3333	Charlie

Possible unique identifiers:
- Employee_ID
- Email
- SSN
Each is a candidate key.
One (e.g., Employee_ID) is chosen as the primary key.

Remember, Each table can have only one Primary key and multiple Candidate keys

PK-FK Relationships#

Relational databases use primary keys (PK) and foreign keys (FK) to maintain data integrity and model relationships.

Types of Relationships#

One-to-One (1:1): Each person has one passport; each passport belongs to one person.

One-to-Many (1:N): A customer can have many orders; each order belongs to one customer.

In this figure, a customer can have many accounts; each account belongs to one customer.

Many-to-Many (M:N): Students enroll in many courses; courses have many students.

In this figure, each customer can buy more than one product and a product can be bought by many different customers.

Customers Table#

Customer_ID (PK)	Name	Age	Gender
101	Alice	25	F
102	Bob	30	M
103	Charlie	28	M

Orders Table#

Order_ID (PK)	Customer_ID (FK)	Product_ID (FK)	Quantity
5001	101	P100	1
5002	102	P101	2
5003	101	P102	1

Customer_ID is a foreign key referencing Customers.Customer_ID.
Product_ID is a foreign key referencing Products.Product_ID.

Products Table#

Product_ID (PK)	Product_Name	Price
P100	Laptop	1000
P101	Keyboard	50
P102	Mouse	30

Passports Table (One-to-One Example)#

Passport_ID (PK)	Customer_ID (FK)	Expiration_Date
P001	101	2030-12-31
P002	102	2031-06-30
P003	103	2030-09-15

Customer_ID is a foreign key referencing Customers.Customer_ID.
Each customer has exactly one passport.

Enrollments Table (Many-to-Many Example)#

Student_ID	Course_ID	Grade
S001	CSE101	A
S001	MTH201	B+
S002	CSE101	A-
S002	PHY110	B

Neither Student_ID nor Course_ID alone is unique.
The combination (Student_ID, Course_ID) forms a composite key.
Students can enroll in many courses, and courses can have many students.

Relationships Overview#

From Table	To Table	Type	Notes
Customers	Orders	1:N	One customer → many orders
Products	Orders	1:N	One product → many orders
Customers	Passports	1:1	One customer → one passport
Students	Courses	M:N	Implemented via Enrollments table

Key Points#

Primary Key (PK): Unique identifier for each record. Cannot be NULL.
Foreign Key (FK): Links a table to another table’s primary key. Maintains referential integrity.
Composite Key: Combination of columns used when a single column is not unique.
Candidate Key: Any column or combination of columns that could serve as a primary key.
Constraints: Rules to maintain data validity (e.g., NOT NULL, UNIQUE, CHECK, FOREIGN KEY).

The Role of SQL in Data Science#

Think of SQL as your conversation partner with the data. It’s a declarative language, which means you simply state your desired outcome, and the database handles the complex task of finding and organizing the data for you. This makes it incredibly efficient for handling massive datasets. A typical data science workflow using SQL might look like this:

Data Extraction: You use a SELECT query to pull a specific subset of data relevant to your project.
Data Wrangling: You perform initial cleaning, filtering (WHERE), and aggregation (GROUP BY) directly in the database.
Analysis: The prepared data is loaded into Python (often as a Pandas DataFrame) for more sophisticated analysis, modeling, and visualization.

Core SQL Commands: Your Essential Toolkit#

We have already learned that SQL is the standard language for managing and querying relational databases.

These core commands allow you to create tables, insert data, retrieve information, update records, and maintain data integrity.

Whether you are analyzing sales data, customer information, or product inventories, mastering these commands is essential for data-driven tasks.

Command	Purpose	Example
CREATE TABLE	Create a new table in the database	`sql CREATE TABLE Customers (CustomerID INT PRIMARY KEY AUTO_INCREMENT, Name VARCHAR(50), Email VARCHAR(100));`
INSERT INTO	Adds new rows of data to a table	`sql INSERT INTO Customers (CustomerID, Name, Email) VALUES (1, 'John Doe', 'john.doe@example.com');`
SELECT	Retrieves data from one or more tables	`sql SELECT Name, Email FROM Customers;`
WHERE	Filters records based on a condition	`sql SELECT Name, Email FROM Customers WHERE CustomerID = 1;`
UPDATE	Modifies existing data in a table	`sql UPDATE Customers SET Email = 'john.doe@newdomain.com' WHERE CustomerID = 1;`
DELETE	Removes rows from a table	`sql DELETE FROM Customers WHERE CustomerID = 1;`
DROP TABLE	Deletes the entire table and all its data	`sql DROP TABLE Customers;`

Notes:

CustomerID is the primary key and uses AUTO_INCREMENT to generate unique IDs automatically.
Always use WHERE in UPDATE and DELETE to avoid modifying all rows by mistake.
DROP TABLE permanently deletes the table and its data, so use with caution.