Text-to-Video AI

Text-to-video generator creating 7-second cinematic videos from natural language prompts using OpenAI DALL-E 3, Ken Burns motion effects, and ffmpeg encoding. Python 3.13+ application with Gradio UI.

Python 3.13+OpenAI DALL-E 3GradioffmpegPillow (PIL)NumPyasyncioGitHub ActionsMakefileRuff (linter)

View Code

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

Personal Project

The Problem

•

Creating cinematic videos from text descriptions required expensive video generation APIs — RunwayML Gen-2 ($0.50/video), Runway Gen-3 ($2-5/video),...

•

Existing AI video APIs had slow sequential processing — single scene generation took 20-40s, multi-scene videos required 60-120s total. No...

•

Manual workflows were complex and time-consuming — generate images in DALL-E → download individually → import to video editor (Premiere, Final Cut) →...

•

Static image outputs lacked cinematic feel — DALL-E 3 produces high-quality images but no motion. Videos need dynamic camera movement (zoom, pan) for...

•

Browser compatibility issues — many video generation tools output codecs (H.265, VP9) not universally supported in browsers. Resulted in 'codec not...

•

No cost-efficient solution for multi-scene cinematic videos — needed 3+ camera angles per story (wide establishing shot, close-up for detail, dramatic...

The Solution

•

Built a Python application combining OpenAI DALL-E 3 for image generation with custom Ken Burns motion effects and ffmpeg encoding for...

Video Generation Pipeline (5 Stages)

•

Prompt Engineering — User enters text description, system generates 3 scene-specific prompts:

•

- Wide shot: "[user prompt], wide establishing shot, cinematic composition, natural lighting"

•

- Close-up: "[user prompt], close-up detail shot, shallow depth of field, dramatic focus"

•

- Dramatic angle: "[user prompt], dramatic low-angle shot, epic composition, cinematic lighting"

•

- Ensures variety in camera angles for storytelling impact

•

Parallel Image Generation — DALL-E 3 API calls run concurrently via asyncio:

•

- `asyncio.gather()` sends 3 API requests simultaneously

•

- Each request: OpenAI DALL-E 3 standard (1024×1024, quality='standard', style='vivid')

•

- Downloads complete in ~20s total (vs ~60s sequential)

•

- Images saved temporarily for processing

•

Ken Burns Motion Effect — Each image converted to 28-frame sequence:

•

- Start frame: image at 100% scale, centered

•

- End frame: image at 120% scale, panned (random direction: left/right/up/down)

•

- Interpolation: linear zoom + pan across 28 frames (2.33s at 12 fps)

•

- Implementation: Pillow PIL for image manipulation, NumPy for affine transformations

•

- Creates dynamic motion from static images

•

Cross-Fade Transitions — Scenes stitched with smooth dissolves:

•

- Scene 1 frames: 1-28 (full motion)

•

- Transition 1→2: frames 26-28 of scene 1 alpha-blended with frames 1-3 of scene 2

•

- Scene 2 frames: 29-56 (full motion)

•

- Transition 2→3: frames 54-56 of scene 2 blended with frames 1-3 of scene 3

•

- Scene 3 frames: 57-84 (full motion)

•

- Alpha blending: `output = img1 * (1 - alpha) + img2 * alpha` with alpha 0.33/0.66/1.0

•

- Total 84 frames = 7 seconds at 12 fps

•

Video Encoding — ffmpeg H.264 MP4 output:

•

- Command: `ffmpeg -framerate 12 -i frame_%04d.png -c:v libx264 -profile:v baseline -pix_fmt yuv420p -an output.mp4`

•

- `-profile:v baseline`: most compatible H.264 profile (plays in all browsers)

•

- `-pix_fmt yuv420p`: chroma subsampling for universal compatibility

•

- `-an`: no audio track (reduces file size, not needed for short cinematic clips)

•

- Output: 2-5MB MP4 file, ~500 kbps bitrate

Application Architecture

•

app.py — Entry point, loads environment variables, launches Gradio UI:

•

`load_dotenv()` reads `.env` for OPENAI_API_KEY

•

`ui.launch()` starts Gradio server on port 7860

•

Server accessible at http://127.0.0.1:7860

•

video_generator.py — Core video pipeline logic:

•

`generate_video(prompt: str) -> str` — Main function orchestrating pipeline

•

`_generate_scene_prompts(prompt: str) -> List[str]` — Creates wide/close-up/dramatic prompts

•

`_generate_images_parallel(prompts: List[str]) -> List[Image]` — Async DALL-E 3 calls

•

`_apply_ken_burns(image: Image, direction: str) -> List[Image]` — Motion effect frames

•

`_create_crossfade(frames1: List[Image], frames2: List[Image]) -> List[Image]` — Transition blending

•

`_encode_video(frames: List[Image], output_path: str) -> None` — ffmpeg encoding

•

Error handling: retry logic for API failures, cleanup of temporary files

•

ui.py — Gradio interface definition:

•

`gr.Textbox(label="Prompt", placeholder="A futuristic city at sunset")` — User input

•

`gr.Video(label="Generated Video")` — Video preview and download

•

`gr.Button("Generate Video")` — Trigger generation

•

Examples: Pre-filled prompts ("Space station orbiting Mars", "Medieval castle in fog", "Cyberpunk street market at night")

Makefile Automation (8 commands)

•

`make install` — Install Python dependencies (requirements.txt), optionally create venv

•

`make run` — Launch Gradio UI (python app.py), opens browser to http://127.0.0.1:7860

•

`make dev` — Development mode with auto-reload (gradio app.py --reload)

•

`make lint` — Run ruff linter on all Python files

•

`make format` — Auto-format with ruff (fixes style issues)

•

`make check` — Lint + format check + syntax validation (python -m py_compile) + import check

•

`make clean` — Remove generated files (*.mp4, *.png, __pycache__/, .ruff_cache/)

•

`make test` — Run encoding test (generate sample video, verify output exists and plays)

GitHub Actions CI/CD (.github/workflows/ci.yml)

•

Runs on every push and pull request:

•

Ruff Lint — Check code quality (unused imports, undefined names, syntax errors)

•

Ruff Format — Verify code follows formatting standards (fails if not formatted)

•

Python Syntax — Compile all .py files (python -m py_compile), catch syntax errors

•

Import Validation — Verify all imports resolve (import app, import video_generator, import ui)

•

Video Pipeline Test — Generate 3 sample images, apply Ken Burns, create transitions (verify frame count = 84)

•

Encoding Test — Run ffmpeg on sample frames, verify output.mp4 exists and is >1MB

Security Configuration

•

`.env` gitignored — OPENAI_API_KEY never committed (.gitignore blocks .env, .env.*, *.env)

•

`.env.example` template — Safe to commit with placeholder: `OPENAI_API_KEY=your_openai_api_key_here`

•

Health check — app.py verifies OPENAI_API_KEY set before launching (raises error if missing)

•

No secrets in code — all sensitive values loaded from environment

Deployment Options

•

Vercel — Deploy Gradio as serverless function:

•

- Add `vercel.json` with Python runtime config

•

- Set OPENAI_API_KEY in Vercel Environment Variables

•

- Deploy: `vercel --prod`

•

GCP Cloud Run — Containerized deployment:

•

- Dockerfile provided (Python 3.13, install deps, run app.py)

•

- Build: `gcloud builds submit --tag gcr.io/PROJECT_ID/text-to-video`

•

- Deploy: `gcloud run deploy --image gcr.io/PROJECT_ID/text-to-video --set-env-vars OPENAI_API_KEY=sk-...`

•

- Auto-scales to zero when idle (cost-efficient)

•

Railway / Render — Git-push deploy:

•

- Connect GitHub repo

•

- Set start command: `python app.py`

•

- Add OPENAI_API_KEY as environment variable in dashboard

•

- Auto-deploy on git push

Cost Analysis

•

| Component | Cost per Video | Details | |-----------|---------------|----------| | DALL-E 3 standard (1024×1024) | $0.04 × 3 = $0.12 | 3 images...

•

For high volume (100 videos/month): $12 total vs $50 (RunwayML) or $200-500 (Runway Gen-3).

Design Decisions

•

Chose DALL-E 3 over video generation APIs (RunwayML, Stability AI) — 5x cheaper ($0.12 vs $0.50-5), faster with parallelization (20s vs 60-120s), more...

•

Implemented parallel image generation with asyncio — 3 concurrent DALL-E 3 API calls reduce total time to ~20s (vs ~60s sequential). Uses...

•

Applied Ken Burns motion effect instead of static images — zoom + pan creates cinematic feel from still images. Each scene gets 28 frames (2.33s at 12...

•

Used cross-fade transitions (3-frame dissolves) instead of hard cuts — alpha blending between scenes creates smooth, professional transitions. Overlaps...

•

Encoded to H.264 baseline profile with yuv420p — most compatible codec for browsers (Chrome, Safari, Firefox all support). Baseline profile ensures...

•

Set 12 fps instead of 24/30 fps — sufficient for Ken Burns motion (slow zoom/pan), reduces file size by 50-60%, and lowers ffmpeg encoding time. Higher...

•

Removed audio track (-an flag in ffmpeg) — not needed for short cinematic clips, reduces file size by 20-30%, simplifies encoding pipeline.

•

Generated 3 scene-specific prompts (wide/close-up/dramatic) instead of using user prompt directly — ensures variety in camera angles for storytelling....

•

Used Gradio for UI instead of Flask/FastAPI — simpler for ML demos (3 lines to create text input + video output), built-in file handling (video preview...

•

Implemented Makefile automation — 8 commands (install, run, dev, lint, format, check, clean, test) provide consistent workflow across developers,...

•

Added GitHub Actions CI/CD — lint/format/syntax/import/pipeline/encoding checks on every push catch errors before merge, prevent broken builds,...

•

Gitignored generated files (*.mp4, *.png, temp frames) — keeps repository clean, avoids large file commits (videos can be 2-5MB), regenerated locally...

•

Provided .env.example template — onboarding guide for new developers, shows required environment variables (OPENAI_API_KEY), safe to commit (no real...

•

Used ruff for linting/formatting instead of flake8/black — faster (10-100x), single tool for both lint and format, modern Python syntax support, fewer...

Tradeoffs & Constraints

•

DALL-E 3 standard quality instead of HD — HD costs $0.08 per image (2x more expensive), outputs 1792×1792 (larger files), but marginal quality...

•

12 fps instead of 24/30 fps — reduces file size and encoding time but motion appears slightly less smooth. Acceptable for Ken Burns slow zoom/pan but...

•

Fixed 7-second duration (84 frames) — 3 scenes × 28 frames each. Longer videos would require more DALL-E 3 images (more cost) or stretching motion...

•

No audio generation — videos are silent. Adding AI-generated music (e.g., Suno, MusicGen) would cost $0.10-0.50 per track, increase complexity, and...

•

Local ffmpeg encoding instead of cloud transcoding (AWS MediaConvert, Zencoder) — free but requires ffmpeg installed locally. Cloud transcoding would...

•

Gradio UI instead of custom React frontend — simpler to build (no frontend code) but less customizable. Can't add advanced features like video...

•

Sequential frame generation (Ken Burns, cross-fade) instead of GPU-accelerated — CPU-based Pillow/NumPy is slower (~5s per scene) but works on any...

•

No video editing features — can't trim, crop, adjust speed, or add text overlays. Would need video editor integration (MoviePy, FFmpeg filters) to...

•

Single-user Gradio app — no authentication, user management, or video history. For multi-user deployment, would need auth (OAuth), database (PostgreSQL...

•

Would improve: Add audio generation (Suno, MusicGen) for background music, implement custom transition types (wipe, zoom, rotate), support longer...

Outcome & Impact

•

Production-ready text-to-video generator creating 7-second cinematic videos from natural language prompts in ~20 seconds with cost-efficient OpenAI...

•

DALL-E 3 parallel generation produces 3 distinct camera angles (wide establishing shot, close-up detail shot, dramatic low-angle shot) concurrently...

•

Ken Burns motion effect adds cinematic zoom + pan to each scene — 28 frames per scene with linear interpolation from 100% scale (centered) to 120%...

•

Cross-fade transitions provide smooth dissolves between scenes — alpha blending overlaps last 3 frames of scene N with first 3 frames of scene N+1...

•

ffmpeg H.264 baseline encoding ensures browser compatibility — plays in Chrome, Safari, Firefox, Edge without plugins or transcoding. Uses yuv420p...

•

Gradio UI enables instant experimentation — text input field for prompts, video preview player, download button, example prompts (space station,...

•

Cost-efficient at $0.12 per video (3 × $0.04 DALL-E 3 standard + $0.00 local ffmpeg) — 75% cheaper than RunwayML Gen-2 ($0.50/video), 95% cheaper than...

•

Makefile provides consistent workflow — 8 commands simplify development: `make install` (dependencies), `make run` (launch UI), `make dev`...

•

GitHub Actions CI/CD validates every commit — ruff lint checks code quality (unused imports, undefined names), ruff format verifies formatting...

•

Comprehensive documentation in README — quick start (4 commands: clone, install, configure, run), features (3-scene generation, Ken Burns, cross-fade,...

•

Secure .env configuration isolates secrets — OPENAI_API_KEY loaded from environment, never committed (.gitignore blocks .env, .env.*, *.env),...

•

Deployment options accommodate different platforms — Vercel (serverless Gradio with vercel.json config, OPENAI_API_KEY environment variable), GCP Cloud...

•

Video generation pipeline: text prompt → 3 scene-specific prompts (wide/close-up/dramatic with cinematic composition/lighting/angle keywords) →...

•

Ruff linter/formatter integration — 10-100x faster than flake8/black, single tool for both lint and format, modern Python syntax support (match...

•

MIT license enables open experimentation — developers, students, researchers can use, modify, and distribute for personal or commercial projects with...

Tech Stack

•

Language: Python 3.13+ (asyncio for parallel API calls, type hints for clarity)

•

Image Generation: OpenAI DALL-E 3 (1024×1024, quality='standard', style='vivid')

•

Image Processing: Pillow (PIL) for Ken Burns motion frames, NumPy for affine transformations

•

Video Encoding: ffmpeg (libx264 codec, baseline profile, yuv420p pixel format, 12 fps)

•

UI Framework: Gradio (text input, video preview, download, example prompts)

•

Automation: Makefile (8 commands: install, run, dev, lint, format, check, clean, test)

•

Linting/Formatting: Ruff (fast Python linter and formatter, replaces flake8/black/isort)

•

CI/CD: GitHub Actions (lint, format, syntax, import, pipeline, encoding tests)

•

Async: asyncio (concurrent DALL-E 3 API calls, parallel image downloads)

•

Environment: python-dotenv (load .env variables), os module (environment access)

Back to Projects