Skip to content

Astik97/EdStatsAnalysis

Repository files navigation

πŸŽ“ World Bank Education Statistics β€” Data Analysis Pipeline

Python Pandas MySQL Jupyter License

A complete end-to-end data analysis pipeline on the World Bank Education Statistics dataset β€” covering data loading, cleaning, EDA, statistical analysis, visualization, MySQL storage, and Excel dashboard export.


πŸ“Œ Project Overview

This project analyzes global education indicators across 886,930 rows and 69 columns spanning years 1970 to 2075, covering countries worldwide. The goal is to extract meaningful insights on education trends β€” literacy rates, enrollment rates, government expenditure, and more β€” using a full data pipeline built in Python.


πŸ“‚ Dataset

Property Details
Name Education Statistics (EdStatsData.csv)
Source World Bank via Kaggle
Kaggle Link https://www.kaggle.com/datasets/theworldbank/education-statistics
Provider The World Bank Open Data
Size ~300 MB (886,930 rows Γ— 69 columns)
Coverage 200+ countries, 4,000+ indicators, 1970–2075
License World Bank Open Data License (CC BY 4.0)

⚠️ The raw CSV file is NOT included in this repository due to its large size.
Download it directly from Kaggle using the link above.

Dataset Columns

Column Description
Country Name Full name of the country
Country Code ISO 3-letter country code
Indicator Name Full name of the education indicator
Indicator Code World Bank indicator code
1970 – 2075 Indicator value for each year

πŸ—‚οΈ Repository Structure

EdStatsAnalysis/
β”‚
β”œβ”€β”€EdStatsCode.ipynb          # Main analysis notebook (11 sections)
β”œβ”€β”€ EdStats_Dashboard_Data.xlsx   # Excel dashboard (3 summary sheets)
β”œβ”€β”€ README.md                     # Project documentation
β”œβ”€β”€ .gitignore                    # Excludes CSV, cache, checkpoints
β”‚
β”œβ”€β”€ plot_global_trend.png         # Global average value trend (1970–2023)
β”œβ”€β”€ plot_top_countries.png        # Top 10 countries by data availability
β”œβ”€β”€ plot_distribution.png         # Value distribution histogram + KDE
└── plot_regression.png           # Linear regression β€” actual vs predicted

πŸ” Pipeline

Kaggle CSV (886k rows)
        ↓
Chunked Memory-Safe Loading (50k rows/chunk)
        ↓
Data Cleaning & Null Handling
        ↓
Wide β†’ Long Format Reshape (melt)
        ↓
NumPy Statistical Analysis
        ↓
Matplotlib & Seaborn Visualizations
        ↓
Linear Regression (sklearn)
        ↓
MySQL Storage (chunked push, 10k rows/chunk)
        ↓
Excel Dashboard Export (3 sheets)

πŸ““ Notebook Sections

Section Description
1 Import libraries
2 Memory-safe chunked CSV loading with dtype optimization
3 Basic EDA β€” shape, dtypes, missing values
4 Wide β†’ Long format reshape using melt()
5 NumPy statistical analysis (mean, median, std, percentiles, decade stats)
6 Reusable filter helper functions
7 5 visualizations β€” trend, bar, distribution, country filter, heatmap
8 Linear regression with StandardScaler, RΒ² and RMSE evaluation
9 Excel export β€” 3 summarized sheets for dashboard
10 MySQL push β€” chunked, memory-safe
11 Power BI / Excel dashboard guide

πŸ“Š Excel Dashboard Sheets

Sheet Contents Best Visual in Excel
Country_Year_Avg Average value per country per year Line chart with country slicer
Indicator_Summary Data count, mean, std per indicator Bar chart β€” top indicators
Decade_Country_Avg Average value per country per decade PivotTable matrix / heatmap

πŸ› οΈ Tech Stack

Tool Purpose
Python 3.10+ Core programming language
Pandas Data loading, cleaning, reshaping
NumPy Vectorized statistical computations
Matplotlib Static plotting
Seaborn Statistical visualizations
Scikit-learn Linear regression, scaling, train-test split
SQLAlchemy + PyMySQL MySQL database connection
OpenPyXL Excel file export
MySQL 8.0 Relational database storage
Jupyter Notebook Interactive development environment

βš™οΈ Setup & Installation

1. Clone the repository

git clone https://github.com/yourusername/edstats-analysis.git
cd edstats-analysis

2. Install dependencies

pip install pandas numpy matplotlib seaborn scikit-learn sqlalchemy pymysql openpyxl jupyter

3. Download the dataset

4. Set up MySQL (optional β€” for Section 10)

CREATE DATABASE ed_stats_data_db;
USE ed_stats_data_db;

Then update Section 10 in the notebook with your MySQL credentials.

5. Run the notebook

jupyter notebook EdStatsCode.ipynb

Run cells top to bottom. Section 10 (MySQL push) is optional β€” set PUSH_TO_MYSQL = True only when MySQL is configured.


πŸ”‘ Key Fixes from Original Notebook

Original Issue Fix Applied
Dropped Country/Indicator columns Retained all 4 ID columns throughout
X = 1970 col, y = 1971 col (meaningless) X = Year, y = Value per filtered country+indicator
OneHotEncoder on numeric data Removed β€” not applicable
StandardScaler applied after model fit Scaler now applied before fitting
Plots on future NaN columns (2080–2100) Only historical years 1970–2023 used
No memory management Chunked loading + dtype spec + del after concat

πŸ“ˆ Sample Visualizations

Plot File
Global average indicator trend Global Trend
Top 10 countries by data volume Top 10 Countries
Value distribution (histogram + KDE) Value Distribution
Regression β€” actual vs predicted Regression

⚠️ Memory Safety Notes

This dataset is large (~300MB CSV, 886k rows). The notebook is designed to handle it safely:

  • CSV loaded in 50,000 row chunks β€” prevents RAM crash
  • dtype specified upfront β€” reduces memory by ~40%
  • Only historical years (1970–2023) used β€” avoids mostly-NaN future columns
  • MySQL push done in 10,000 row chunks β€” stable on any system
  • Excel export uses aggregated summaries only β€” not raw 886k rows

πŸ‘€ Author

Astik Mohapatra
B.Tech Computer Science Engineering
Government College of Engineering Keonjhar (CGPA: 8.09, 2026)
πŸ“§ astikm7007@gmail.com
πŸ”— https://linkedin.com/in/astik-mohapatra
πŸ™ https://github.com/Astik97


πŸ“„ License

This project is licensed under the MIT License.
The dataset is provided by The World Bank under the Creative Commons Attribution 4.0 License (CC BY 4.0).


πŸ™ Acknowledgements

About

Educational statistics analysis and reporting dashboard built using Python and data visualization libraries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors