Cory Zeitz

Return Home

Coursework Detail Page

Anime Data Platform

Anime Harbor

A medallion-architecture data platform that ingests, transforms, and serves anime analytics datasets on AWS.

Anime data platform details

Platform Summary

Cloud-native anime data platform implementing a medallion architecture on AWS to ingest, transform, and serve analytics-ready datasets for APIs, dashboards, and ML workflows.

Medallion Architecture

Bronze

Raw Data Layer

Unprocessed API and dataset payloads in JSON/CSV on S3, partitioned by source and ingest date.

Silver

Cleaned Data Layer

Glue ETL standardizes schema, removes duplicates, validates records, and writes partitioned Parquet.

Gold

Serving Layer

Analytics-ready aggregates such as top-rated anime, trending titles, and genre-level insights.

Data Flow

  1. Ingest from MyAnimeList API and public datasets into Bronze S3 prefixes.
  2. Run Glue ETL jobs to normalize, deduplicate, enforce schema, and convert to Parquet.
  3. Register metadata in Glue Data Catalog for discoverability and governance.
  4. Serve curated Gold datasets through Athena and API endpoints.

Tech Stack

AreaTools
StorageAmazon S3 (bronze/silver/gold prefixes)
ETL + CatalogAWS Glue jobs, crawlers, Data Catalog, PySpark
OrchestrationEventBridge schedules, optional Lambda triggers
ServingAthena queries + Node.js/Express REST API
ReliabilitySchema checks, null/duplicate validation, CloudWatch logging

Implementation Priorities

  • Modular pipeline folders: /ingestion, /etl, /models, /api.
  • Idempotent pipeline runs with retry handling and clear step-level logging.
  • Partitioned S3 layout by layer and year/month/day for scale.
  • Gold-layer datasets optimized for analytics, dashboards, and downstream ML features.
  • Schema evolution handling and type-safe transformations as source payloads change.