Chapter 6: Mastering SQL for Data Science with Python#

The chapter introduces Structured Query Language (SQL), a powerful and essential tool for data professionals. SQL is the standard language used to communicate with and manage relational databases. This chapter focuses on its application in data science, demonstrating how to use SQL for data retrieval, manipulation, and analysis. It also covers the integration of SQL with Python, a crucial skill for any data scientist.

Introduction to SQL for Data Science#

Data has become the foundation of decision-making in modern organizations. From social media platforms storing billions of user interactions to hospitals managing electronic health records, most of this information is stored in databases. Among different types of databases, relational databases are the most widely used.

Relational databases organize data into tables, which consist of rows (records) and columns (attributes). This design reflects the way data naturally relates to entities in the real world. For example:

*A retail store has customers (with IDs, names, and ages).

  • Each customer places orders (with product details, dates, and amounts).

  • The relationship between customers and orders can be represented through keys.

To interact with these databases, we use a language called Structured Query Language (SQL).

Structured Query Language (SQL).#

SQL provides a standardized way to create, read, update, and delete data (commonly referred to as CRUD operations). Unlike programming languages such as Python or Java, SQL is declarative: you specify what you want, and the database figures out how to get it.

This makes it highly efficient for managing large datasets. For data science, SQL is invaluable for:

  • Data Retrieval: Extracting specific subsets of data from large databases.

  • Data Cleaning and Transformation: Handling missing values, standardizing formats, and creating new features.

  • Exploratory Data Analysis (EDA): Performing quick summaries, aggregations, and data profiling.

  • Feature Engineering: Creating new variables from existing ones before feeding them into machine learning models.

The Evolution of Databases and SQL#

  • Early data management: Before databases, organizations stored information in files. This approach led to redundancy, inconsistency, and inefficiency.

  • Birth of the relational model: In 1970, Edgar F. Codd introduced the relational model, a mathematical foundation for organizing data in tables with well-defined relationships.

  • Development of SQL: By the late 1970s, IBM developed SEQUEL (Structured English Query Language), which evolved into SQL. It became the ANSI standard in 1986.

  • SQL today: Almost every relational database system (MySQL, PostgreSQL, Oracle, SQL Server, SQLite) supports SQL, with minor dialect differences.

  • Takeaway: SQL is not just a programming tool—it is the backbone of modern data storage and analytics.

Relational Databases: Core Concepts#

Relational databases are the backbone of structured data storage. They organize information in a way that ensures consistency, integrity, and efficient retrieval. The fundamental ideas of tables, keys, and relationships help us understand how real-world data is modeled.

Tables, Rows, and Columns#

A table is like a spreadsheet.

  • Rows (records/tuples): Each row corresponds to a single entity or instance of data. For example, one row might represent a single customer.

  • Columns (fields/attributes): Each column stores one specific type of information about the entity, such as name, age, or gender.

  • Schema: The structure of the table, which defines what columns exist and their data types (e.g., integer, string, date).

Example: A Customers table may contain columns such as Customer_ID, Name, Age, Gender, and Email. Each row represents one customer.

Keys#

Keys in Relational Databases are crucial for ensuring that data remains unique and consistent across tables.

1. Primary Key (PK)#

The primary key uniquely identifies each row. It contain UNIQUE values in column, and does not allows NULL values.

SQL Candidate key
Here, Empid is a Primary Key. Example: **Customers Table**

Customer_ID

Name

Age

Gender

101

Alice

25

F

102

Bob

30

M

103

Charlie

28

M

  • Primary Key: Customer_ID

  • Ensures each customer is uniquely identifiable.


2. Foreign Key (FK)#

A foreign key links one table to another. It creates a relationship between two or more tables, a primary key of one table is referred as a foreign key in another table. It can also accept multiple null values and duplicate values.

SQL Candidate key

Orders Table

Order_ID

Customer_ID

Product

Quantity

5001

101

Laptop

1

5002

102

Keyboard

2

5003

101

Mouse

1

  • Customer_ID here is a foreign key connecting each order to the Customers table.

  • Prevents creating an order for a non-existent customer.


3. Composite Key#

Sometimes, no single column uniquely identifies a row. Composite Key is a combination of more than one columns of a table. It can be a Candidate key and Primary key.

SQL Candidate key

Enrollments Table

Student_ID

Course_ID

Grade

S001

CSE101

A

S001

MTH201

B+

S002

CSE101

A-

S002

PHY110

B

  • Neither Student_ID nor Course_ID alone is unique.

  • Together (Student_ID, Course_ID) form a composite key.

  • Ensures a student cannot enroll in the same course twice.


4. Candidate Key#

A candidate key is any column (or set of columns) that could serve as a primary key. Candidate Key(s) an identify a record uniquely in a table and which can be selected as a primary key of the table.

It contains UNIQUE values in column, and does not allows NULL values.

SQL Candidate key

Here, Empid, EmpLicence and EmpPassport are candidate keys.

Example: Employees Table

Employee_ID

Email

SSN

Name

E001

alice@company.com

123-45-6789

Alice

E002

bob@company.com

987-65-4321

Bob

E003

charlie@company.com

111-22-3333

Charlie

  • Possible unique identifiers:

    • Employee_ID

    • Email

    • SSN

  • Each is a candidate key.

  • One (e.g., Employee_ID) is chosen as the primary key.

Remember, Each table can have only one Primary key and multiple Candidate keys

PK-FK Relationships#

Relational databases use primary keys (PK) and foreign keys (FK) to maintain data integrity and model relationships.


Types of Relationships#

  • One-to-One (1:1): Each person has one passport; each passport belongs to one person.

SQL Candidate key
  • One-to-Many (1:N): A customer can have many orders; each order belongs to one customer.

SQL Candidate key
In this figure, a customer can have many accounts; each account belongs to one customer.
  • Many-to-Many (M:N): Students enroll in many courses; courses have many students.

SQL Candidate key

In this figure, each customer can buy more than one product and a product can be bought by many different customers.


Customers Table#

Customer_ID (PK)

Name

Age

Gender

101

Alice

25

F

102

Bob

30

M

103

Charlie

28

M


Orders Table#

Order_ID (PK)

Customer_ID (FK)

Product_ID (FK)

Quantity

5001

101

P100

1

5002

102

P101

2

5003

101

P102

1

  • Customer_ID is a foreign key referencing Customers.Customer_ID.

  • Product_ID is a foreign key referencing Products.Product_ID.


Products Table#

Product_ID (PK)

Product_Name

Price

P100

Laptop

1000

P101

Keyboard

50

P102

Mouse

30


Passports Table (One-to-One Example)#

Passport_ID (PK)

Customer_ID (FK)

Expiration_Date

P001

101

2030-12-31

P002

102

2031-06-30

P003

103

2030-09-15

  • Customer_ID is a foreign key referencing Customers.Customer_ID.

  • Each customer has exactly one passport.


Enrollments Table (Many-to-Many Example)#

Student_ID

Course_ID

Grade

S001

CSE101

A

S001

MTH201

B+

S002

CSE101

A-

S002

PHY110

B

  • Neither Student_ID nor Course_ID alone is unique.

  • The combination (Student_ID, Course_ID) forms a composite key.

  • Students can enroll in many courses, and courses can have many students.


Relationships Overview#

From Table

To Table

Type

Notes

Customers

Orders

1:N

One customer → many orders

Products

Orders

1:N

One product → many orders

Customers

Passports

1:1

One customer → one passport

Students

Courses

M:N

Implemented via Enrollments table


Key Points#

  • Primary Key (PK): Unique identifier for each record. Cannot be NULL.

  • Foreign Key (FK): Links a table to another table’s primary key. Maintains referential integrity.

  • Composite Key: Combination of columns used when a single column is not unique.

  • Candidate Key: Any column or combination of columns that could serve as a primary key.

  • Constraints: Rules to maintain data validity (e.g., NOT NULL, UNIQUE, CHECK, FOREIGN KEY).


The Role of SQL in Data Science#

Think of SQL as your conversation partner with the data. It’s a declarative language, which means you simply state your desired outcome, and the database handles the complex task of finding and organizing the data for you. This makes it incredibly efficient for handling massive datasets. A typical data science workflow using SQL might look like this:

  • Data Extraction: You use a SELECT query to pull a specific subset of data relevant to your project.

  • Data Wrangling: You perform initial cleaning, filtering (WHERE), and aggregation (GROUP BY) directly in the database.

  • Analysis: The prepared data is loaded into Python (often as a Pandas DataFrame) for more sophisticated analysis, modeling, and visualization.

Core SQL Commands: Your Essential Toolkit#

We have already learned that SQL is the standard language for managing and querying relational databases.

These core commands allow you to create tables, insert data, retrieve information, update records, and maintain data integrity.

Whether you are analyzing sales data, customer information, or product inventories, mastering these commands is essential for data-driven tasks.

Command

Purpose

Example

CREATE TABLE

Create a new table in the database

sql CREATE TABLE Customers (CustomerID INT PRIMARY KEY AUTO_INCREMENT, Name VARCHAR(50), Email VARCHAR(100));

INSERT INTO

Adds new rows of data to a table

sql INSERT INTO Customers (CustomerID, Name, Email) VALUES (1, 'John Doe', 'john.doe@example.com');

SELECT

Retrieves data from one or more tables

sql SELECT Name, Email FROM Customers;

WHERE

Filters records based on a condition

sql SELECT Name, Email FROM Customers WHERE CustomerID = 1;

UPDATE

Modifies existing data in a table

sql UPDATE Customers SET Email = 'john.doe@newdomain.com' WHERE CustomerID = 1;

DELETE

Removes rows from a table

sql DELETE FROM Customers WHERE CustomerID = 1;

DROP TABLE

Deletes the entire table and all its data

sql DROP TABLE Customers;

Notes:

  • CustomerID is the primary key and uses AUTO_INCREMENT to generate unique IDs automatically.

  • Always use WHERE in UPDATE and DELETE to avoid modifying all rows by mistake.

  • DROP TABLE permanently deletes the table and its data, so use with caution.