Coursework Detail Page

Anime Data Platform

Anime Harbor

A medallion-architecture data platform that ingests, transforms, and serves anime analytics datasets on AWS.

Platform Summary

Cloud-native anime data platform implementing a medallion architecture on AWS to ingest, transform, and serve analytics-ready datasets for APIs, dashboards, and ML workflows.

Medallion Architecture

Bronze

Raw Data Layer

Unprocessed API and dataset payloads in JSON/CSV on S3, partitioned by source and ingest date.

Silver

Cleaned Data Layer

Glue ETL standardizes schema, removes duplicates, validates records, and writes partitioned Parquet.

Gold

Serving Layer

Analytics-ready aggregates such as top-rated anime, trending titles, and genre-level insights.

Data Flow

Ingest from MyAnimeList API and public datasets into Bronze S3 prefixes.
Run Glue ETL jobs to normalize, deduplicate, enforce schema, and convert to Parquet.
Register metadata in Glue Data Catalog for discoverability and governance.
Serve curated Gold datasets through Athena and API endpoints.

Tech Stack

Area	Tools
Storage	Amazon S3 (bronze/silver/gold prefixes)
ETL + Catalog	AWS Glue jobs, crawlers, Data Catalog, PySpark
Orchestration	EventBridge schedules, optional Lambda triggers
Serving	Athena queries + Node.js/Express REST API
Reliability	Schema checks, null/duplicate validation, CloudWatch logging

Implementation Priorities

Modular pipeline folders: /ingestion, /etl, /models, /api.
Idempotent pipeline runs with retry handling and clear step-level logging.
Partitioned S3 layout by layer and year/month/day for scale.
Gold-layer datasets optimized for analytics, dashboards, and downstream ML features.
Schema evolution handling and type-safe transformations as source payloads change.

Anime Data Platform