CMSC320 Textbook#
Dr. Fardina Alam and Gavin Hung
Table of Contents#
About the Book
Chapter 1 - Intro to Data Science
Chapter 2 - Intro to Data Science
Chapter 3 - Experimental Design
Chapter 4 - Pandas
- Pandas: Turning Raw Data into Usable Data
- What is Pandas?
- Why Use Pandas?
- Installing Pandas
- Loading Data with Pandas
- A Beginner’s Working Model
- Core Data Structures
- DataFrame: The Two-Dimensional Powerhouse
- Inspecting Data: First Contact
- Adding a new column to DataFrame
- Arithmetic Operations and Functions in Pandas
- Applying Built-in Pandas/Numpy Functions
- Filtering Data in Pandas
- Combining Multiple Conditions
- Filtering Strings
- Applying Aggregation Functions Directly to a DataFrame
- More Advanced: Filtering Data & Apply Statistical Functions
- Grouping Data with
groupby - Exporting Data in Pandas
- Exporting Pandas Data in Google Colab
- The Pandas Ecosystem: How It Fits In
- When to Use Pandas (And When Not To)
- Summary of the Chapter
- Interactive Pandas Playground
Chapter 5 - SQL
- Chapter 6: Mastering SQL for Data Science with Python
- The Evolution of Databases and SQL
- Relational Databases: Core Concepts
- PK-FK Relationships
- The “Big 6” Elements of a SQL Select Statement
- Some More SQL Essentials
- SQL JOINs: Combining Data from Multiple Tables
- Summary of the Chapter
- Interactive SQL Playground
Chapter 6 - Probability and Distributions
- Why Probability Matters in Data Science
- A Gentle Walk Through Probability: Making Sense of Uncertainty
- Sample Space and Events: Defining the World
- Assigning Probabilities to Events
- The Rules of the Game: Probability Axioms
- Conditional Probability: Updating Beliefs With Context
- Law of Total Probability
- Conditional Independence
- Expected Value: Summarizes Average Behavior
- Probability Distribution: From Single Events to Patterns
- Types of Probability Distributions and how it connects to Data Science
- Statistical Distribution Explorer
- Central Limit Theorem (Why Normal Appears Everywhere)
- Summary of the Chapter
- Knowledge Check
- Mathematical Examples (with Solutions)
- Practice Problems (No Solutions, Try Yourself)
Chapter 7 - Descriptive Statistics
- Descriptive Statistics: Understanding Data Before Modeling
- The Four Pillars of Describing Data
- (I). Measures of Location: Where Is the Center?
- Pythagorean Means: Three Ways to Average
- (II) Measures of Shape: Skewness, Modality, and Distribution Behavior
- (II.a) Skewness: Which Direction Is the Tail?
- (II.b) Modality: How Many Peaks?
- (III) Measures of Variability: How Spread Out Is the Data?
- (IV) Correlation and Relationships Between Variables
- Summary of the Chapter
Chapter 8 - Hypothesis Testing
- Hypothesis Testing in Data Science
- Steps in Hypothesis Testing
- Step 1: Define the Hypotheses
- Step 3: Choose a Significance Level (α)
- Example Walkthrough
- What Hypothesis Testing Does NOT Tell You
- Two Common Mistakes (Errors)
- Why This Matters in Data Science
- Running a Real Hypothesis Test in Python
- Different Types of Statistical Tests
- Summary of the Chapter
- BONUS: Python Visulization
- Bonus: Animated Sampling Distribution
Chapter 9 - Exploratory Data Analysis
- The Story of Exploration
- Understanding Data Structure
- Understanding Variable Types
- Exploratory Data Analysis (EDA): A Practical Workflow
- Summarizing Variables: Categorical vs Numerical
- Exploring Relationships Between Variables
- Multivariate Exploration
- Handling Missing Data and Data Quality Issues
- Advanced EDA Techniques
- Quick EDA Template
- Case Study: Exploratory Data Analysis of the Titanic Dataset
- Mini Exercises: Try yourself:
- BONUS: Visual Cheat Sheet
- Summary of the Chapter
- Student EDA Checklist
- BONUS: EDA Example for Common Tasks
Chapter 12 - Intro to ML
- Introduction to Machine Learning
- Section 5. Your First Algorithm: k-Nearest Neighbors (KNN)
- Steps of standard KNN Algorithm:
- Sectioin 5. Evaluation, Boundaries, and Generalization (Some more ML Concepts)
- Distance Metrics and Variants in KNN
- 1. Euclidean Distance (L2 Norm)
- 2. Manhattan (City Block) Distance (L1 Norm)
- 3. Hamming Distance
- 4. Cosine Similarity
- Weighted KNN
- Section 6: Ethics and Responsible ML
- Section 7: Hands-on Practice
- Section 8: Reflection Questions
- Knowledge Check
- Interactive K-Nearest Neighbors
Chapter 14 - Decision Tree
- Introduction: The Power of Simple Questions
- Section 1: Basic Concepts of Decision Tree
- Section 2: The Tree-Building Algorithm: A Greedy, Recursive Process
- Splitting Criteria (Attribute Selection Measures):
- Information Gain (IG)
- How Decision Trees Decide Which Feature to Split (Using Entropy and Information Gain)
- Detailed Example 01: How Decision Trees Decide Which Feature to Split
- Overfitting and Pruning
- Interactive Entropy and Information Gain
Chapter 17 - Clustering and PCA
Chapter 18. Neural Network
- Chapter 18: Introduction to the Neural Network
- 18.1 Activation Functions
- ** Loss Functions**
- Interactive Activation Functions
- Gradient Descent
- 18.2 Optimizers
- Training Techniques & Regularization
- Forward Pass: Step-by-Step
- Backward Pass (Backpropagation)
- Regularization Techniques
- Putting It All Together: How Neural Network Training Happens
- Neural Network Training: Batch Size, Iteration, and Epoch
- Interactive Forward Propagation
- Interactive Backward Propagation
- Putting It All Together: How Neural Network Training Happens
- Neural Network Training: Batch Size, Iteration, and Epoch
- Chapter Summary
- Knowledge Check
Chapter 19 - Convolutional Neural Network
Chapter 20 - NLPs
- Introduction: What is Natural Language Processing?
- 19.1 The Importance of NLP
- 19.2 Historical Evolution of NLP
- 19.4 Fundamental NLP Tasks: Pipeline of NLP
- 19.4 Text Representation and Feature Extraction/Representation (Turning texts into vectors)
- 19.3 The Challenge of Human Language
- Interactive N-Gram
- Interactive Text Preprocessing
- Interactive Text Frequency
Chapter 21 - Graph Theory