GitHub - TechMitten/Origami-AI: Transforms static PDF documents into professional narrated presentations using advanced local AI.

Turn PDF slides into narrated videos — entirely in your browser.

AI-written narration, local text-to-speech, in-browser FFmpeg rendering, and smart screen recording. No upload, no render farm, no subscription.

Quick Start · Features · How It Works · Configuration · Contributing

What is Origami AI?

Upload a PDF deck, and Origami AI extracts the slides, writes a narration script with an LLM, voices it with local text-to-speech, and renders a finished MP4 — using WebLLM, Kokoro.js, and FFmpeg.wasm running directly in your browser via WebGPU. The Express/Cloudflare server only proxies optional cloud LLM calls so API keys never reach the client bundle — your slides, audio, and video never have to leave your machine.

It also doubles as a screen recorder with cinematic auto-zoom, an MP4 scene analyzer, and an AI assistant chat.

	Traditional video editors	Cloud AI services	Origami AI
Learning curve	Steep	Easy	Minimal — automated
Privacy	Local	Cloud-based	Local-first
Cost	One-time / free	Pay-per-minute or credits	Free & open source
Voice	Your own / hire talent	Pay per minute	Unlimited local TTS
Time to video	Hours	Minutes	~10–30 min

✨ Features

🎬 AI narration scripts — generated locally with WebLLM, or via Gemini/OpenAI-compatible APIs
🎙️ In-browser TTS — Kokoro.js with multiple voices, no server round-trip
⚡ WebGPU acceleration — for both narration generation and the AI assistant
📹 In-browser rendering — FFmpeg.wasm composes slides, audio, music, and pan/zoom into a 720p/1080p MP4
🎯 Smart screen recording — auto-zoom on idle, with an optional Chrome extension for richer cursor/DOM telemetry
🔍 Scene-aware video analysis — turn an MP4 into a timestamped scene breakdown
💬 AI assistant chat — local WebLLM models or cloud fallback, with image/video attachments
🐛 Issue reporter — record a bug, get an AI-generated debugging writeup
🔒 Server-side key proxying — LLM_API_KEY never ships in the production client bundle
🎵 Background music & mixing — auto-ducking under narration with per-slide control
📦 Portable projects — export/import a full project (slides, media, audio, settings) as a .origami archive

🚀 Quick Start

Requirements: Node.js ≥ 20.19.0 and a WebGPU-capable browser.

git clone https://github.com/TechMitten/Origami-AI.git
cd Origami-AI
npm install
npm run dev

Open http://localhost:3000.

Important

Don't open index.html directly. The dev server sets the COOP/COEP headers that SharedArrayBuffer/FFmpeg.wasm need — without them, rendering and TTS init silently fail.

Command	Purpose
`npm run dev`	Express + Vite dev server with HMR
`npm run build`	Production build → `dist/`
`npm run preview`	Serve the production build
`npm run lint`	Lint plain `.js` files (see note below)
`npm run stop`	Kill whatever is on port 3000

Run with Docker instead

docker compose up --build

Available at http://localhost:3000.

Optional: install the Chrome extension

The extension adds DOM-level cursor/click/scroll telemetry for more precise auto-zoom during screen recording. Origami AI works without it via an in-page fallback.

Open chrome://extensions
Enable Developer mode
Click Load unpacked → select the chrome-extension/ folder

You can also download a packaged ZIP from inside the app (header menu → Download Chrome Extension, or Slide Editor → Slide Media tab). See chrome-extension/README.md for details.

🧭 How It Works

Primary flow — PDF → video:

Upload a PDF; slide images and text are extracted automatically
An LLM (local WebLLM or a remote API) drafts a narration script per slide
Kokoro.js synthesizes speech for each script
Edit scripts, timing, transitions, and music in the visual editor
FFmpeg.wasm renders a 720p/1080p MP4, fully in-browser
Download the finished video

Typical end-to-end time is 10–30 minutes, depending on slide count and GPU.

Other entry points:

Screen recording — capture a tab or desktop, auto-zoom on idle (>2s), combine with PDF slides or use standalone
Scene analysis — upload an MP4, get a timestamped scene breakdown via the Gemini API
AI assistant chat — ask questions, attach images/video, local or cloud models
Issue reporter — record a bug, get an AI-generated analysis and debugging suggestions

⚙️ Configuration

Open the app and click ⚙️ Settings for:

Tab	Controls
General	Intro fade timing, post-audio delay, default transition, recording options
TTS Model	Kokoro.js quantization (`q4` quality vs. `q8` speed)
WebLLM	Enable/disable local AI, model selection, precision filter (f16/f32)
API	Remote OpenAI-compatible provider (Gemini, OpenRouter, Ollama, etc.)
AI Prompt	Narration script generation behavior

API keys: dev vs. production

Origami AI works with zero API keys via local WebLLM. Cloud APIs (Gemini, OpenAI-compatible) are optional, for narration, video analysis, and issue reporting.

cp .env.example .env

# Dev only — Vite bakes VITE_-prefixed vars into the client bundle
VITE_LLM_API_KEY=your_api_key_here
VITE_LLM_BASE_URL=https://generativelanguage.googleapis.com/v1beta/openai/
VITE_LLM_MODEL=gemini-flash-latest

In production, set LLM_API_KEY (no VITE_ prefix) on the server/host instead. The client detects there's no client-side key and automatically routes calls through the server proxy (POST /api/llm/chat, /api/llm/analyze-video, /api/llm/analyze-issue) so the key never reaches the browser.

Warning

Never set VITE_LLM_API_KEY in production — anything with the VITE_ prefix is compiled into the public client bundle.

Full environment variable reference

Variable	Context	Purpose
`VITE_LLM_API_KEY`	Client (dev only)	Exposes the API key to the browser for development. Never set in production.
`LLM_API_KEY`	Server (prod)	Server-side key used by the proxy endpoints; never sent to the client.
`VITE_LLM_BASE_URL`	Client	OpenAI-compatible endpoint base URL
`VITE_LLM_MODEL`	Client	Model identifier (e.g. `gemini-flash-latest`)
`CLIENT_URL`	Server	Comma-separated allowed CORS origins
`PORT`	Server	Port to listen on (default `3000`)
`NODE_ENV`	Server	Set to `production` for production builds

🖥️ Requirements

Node.js ≥ 20.19.0
A WebGPU-capable browser (below) — required for local narration generation, the AI assistant, and zoom effects during screen recording. Without it, fall back to a remote OpenAI-compatible API.
A stable connection for first-run model downloads (roughly 1–5GB depending on models chosen)

Browser support

Browser	Min. version	Notes
Chrome / Chromium	113+	Chrome extension available for enhanced recording
Edge	113+	Chrome extension available for enhanced recording
Firefox	Nightly	Enable `dom.webgpu.enabled` in `about:config`
Safari	18+ (macOS Sonoma)	Desktop recording supported

System specs & model sizes

Minimum — 4-core CPU, 8GB RAM, integrated GPU (expect 1–2 hours for first-run downloads + rendering) Recommended — 8-core CPU, 16GB RAM, dedicated GPU with F16 support, NVMe SSD

AI assistant chat model options:

Model	Download	VRAM
Gemma 2 2B	1.4GB	~2GB
Llama 3.2 1B	800MB	~1.5GB
Llama 3.2 3B	1.7GB	~2.5GB
Phi 3.5 Vision	3.9GB	~4GB (adds image/video analysis)

🏗️ Tech Stack

Frontend, core libraries, and backend

Frontend — React 19 + TypeScript, Vite 7, Tailwind CSS 4, React Router 7

Core libraries

@mlc-ai/web-llm — local LLM inference for narration and chat
@ffmpeg/ffmpeg — in-browser video rendering
pdfjs-dist — PDF text/image extraction
kokoro-js — text-to-speech
@dnd-kit — drag-and-drop slide reordering

Backend — Express 5 + TypeScript (server.ts), with a parallel Cloudflare Pages Functions implementation (functions/) for the Wrangler deployment target. See CLAUDE.md for why both exist and how they need to stay in sync.

Chrome extension — plain JS Manifest V3, MessagePort-based telemetry, optional

Project structure

src/
├── components/      # React UI components
├── pages/           # Routed pages (AssistantPage, IssueReporterPage, ...)
├── services/        # Business logic — aiService, webLlmService, ttsService,
│                     #   BrowserVideoRenderer, storage, projectArchiveService
├── hooks/           # Custom React hooks
├── context/         # React context providers
└── utils/           # Helpers

App.tsx owns most cross-cutting state for the editor flow; SlideEditor.tsx is the main per-slide editing surface (Overview, Voice Settings, Audio Mixing, Batch Tools, Slide Media tabs).

🐛 Troubleshooting

See TROUBLESHOOTING.md for the full guide. Quick fixes:

Symptom	Try
WebGPU not detected	Enable hardware acceleration, update GPU drivers, switch to a supported browser
FFmpeg / dev server errors	Run via `npm run dev` — never open `index.html` directly
Model download failures	Check connection stability, clear browser cache, verify storage permissions
Out of memory	Use a smaller model, close background apps, lower export resolution
COOP/COEP warnings	Confirm the dev server (not a static file) is serving the app

🤝 Contributing

Contributions are welcome — see CONTRIBUTING.md for setup, coding standards, commit conventions, and the PR process.

When reporting a bug, please include your browser + version, OS, node -v, repro steps, and any console output. File issues at GitHub Issues.

📄 License

Licensed under the MIT.

🙏 Credits

WebLLM · Kokoro.js · FFmpeg.wasm · PDF.js · React · Tailwind CSS · Lucide · dnd-kit

⬆ Back to top

Made with ❤️ by TechMitten LLC

Name		Name	Last commit message	Last commit date
Latest commit History 374 Commits
chrome-extension		chrome-extension
functions		functions
logo		logo
public		public
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
chrome-extension.zip		chrome-extension.zip
docker-compose.yml		docker-compose.yml
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
server.ts		server.ts
tsconfig.app.json		tsconfig.app.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
tsconfig.server.json		tsconfig.server.json
vite.config.ts		vite.config.ts
wrangler.toml		wrangler.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turn PDF slides into narrated videos — entirely in your browser.

What is Origami AI?

✨ Features

🚀 Quick Start

🧭 How It Works

⚙️ Configuration

API keys: dev vs. production

🖥️ Requirements

🏗️ Tech Stack

Project structure

🐛 Troubleshooting

🤝 Contributing

📄 License

🙏 Credits

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Turn PDF slides into narrated videos — entirely in your browser.

What is Origami AI?

✨ Features

🚀 Quick Start

🧭 How It Works

⚙️ Configuration

API keys: dev vs. production

🖥️ Requirements

🏗️ Tech Stack

Project structure

🐛 Troubleshooting

🤝 Contributing

📄 License

🙏 Credits

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages