Text-to-Video AI
Text-to-video generator creating 7-second cinematic videos from natural language prompts using OpenAI DALL-E 3, Ken Burns motion effects, and ffmpeg encoding. Python 3.13+ application with Gradio UI.
Role
AI Engineer & Full-stack Developer
Team
Solo
Company/Organization
Personal Project
The Problem
Creating cinematic videos from text descriptions required expensive video generation APIs — RunwayML Gen-2 ($0.50/video), Runway Gen-3 ($2-5/video),...
Existing AI video APIs had slow sequential processing — single scene generation took 20-40s, multi-scene videos required 60-120s total. No...
Manual workflows were complex and time-consuming — generate images in DALL-E → download individually → import to video editor (Premiere, Final Cut) →...
Static image outputs lacked cinematic feel — DALL-E 3 produces high-quality images but no motion. Videos need dynamic camera movement (zoom, pan) for...
Browser compatibility issues — many video generation tools output codecs (H.265, VP9) not universally supported in browsers. Resulted in 'codec not...
No cost-efficient solution for multi-scene cinematic videos — needed 3+ camera angles per story (wide establishing shot, close-up for detail, dramatic...
The Solution
Built a Python application combining OpenAI DALL-E 3 for image generation with custom Ken Burns motion effects and ffmpeg encoding for...
Video Generation Pipeline (5 Stages)
Prompt Engineering — User enters text description, system generates 3 scene-specific prompts:
- Wide shot: "[user prompt], wide establishing shot, cinematic composition, natural lighting"
- Close-up: "[user prompt], close-up detail shot, shallow depth of field, dramatic focus"
- Dramatic angle: "[user prompt], dramatic low-angle shot, epic composition, cinematic lighting"
- Ensures variety in camera angles for storytelling impact
Parallel Image Generation — DALL-E 3 API calls run concurrently via asyncio:
- `asyncio.gather()` sends 3 API requests simultaneously
- Each request: OpenAI DALL-E 3 standard (1024×1024, quality='standard', style='vivid')
- Downloads complete in ~20s total (vs ~60s sequential)
- Images saved temporarily for processing
Ken Burns Motion Effect — Each image converted to 28-frame sequence:
- Start frame: image at 100% scale, centered
- End frame: image at 120% scale, panned (random direction: left/right/up/down)
- Interpolation: linear zoom + pan across 28 frames (2.33s at 12 fps)
- Implementation: Pillow PIL for image manipulation, NumPy for affine transformations
- Creates dynamic motion from static images
Cross-Fade Transitions — Scenes stitched with smooth dissolves:
- Scene 1 frames: 1-28 (full motion)
- Transition 1→2: frames 26-28 of scene 1 alpha-blended with frames 1-3 of scene 2
- Scene 2 frames: 29-56 (full motion)
- Transition 2→3: frames 54-56 of scene 2 blended with frames 1-3 of scene 3
- Scene 3 frames: 57-84 (full motion)
- Alpha blending: `output = img1 * (1 - alpha) + img2 * alpha` with alpha 0.33/0.66/1.0
- Total 84 frames = 7 seconds at 12 fps
Video Encoding — ffmpeg H.264 MP4 output:
- Command: `ffmpeg -framerate 12 -i frame_%04d.png -c:v libx264 -profile:v baseline -pix_fmt yuv420p -an output.mp4`
- `-profile:v baseline`: most compatible H.264 profile (plays in all browsers)
- `-pix_fmt yuv420p`: chroma subsampling for universal compatibility
- `-an`: no audio track (reduces file size, not needed for short cinematic clips)
- Output: 2-5MB MP4 file, ~500 kbps bitrate
Application Architecture
app.py — Entry point, loads environment variables, launches Gradio UI:
`load_dotenv()` reads `.env` for OPENAI_API_KEY
`ui.launch()` starts Gradio server on port 7860
Server accessible at http://127.0.0.1:7860
video_generator.py — Core video pipeline logic:
`generate_video(prompt: str) -> str` — Main function orchestrating pipeline
`_generate_scene_prompts(prompt: str) -> List[str]` — Creates wide/close-up/dramatic prompts
`_generate_images_parallel(prompts: List[str]) -> List[Image]` — Async DALL-E 3 calls
`_apply_ken_burns(image: Image, direction: str) -> List[Image]` — Motion effect frames
`_create_crossfade(frames1: List[Image], frames2: List[Image]) -> List[Image]` — Transition blending
`_encode_video(frames: List[Image], output_path: str) -> None` — ffmpeg encoding
Error handling: retry logic for API failures, cleanup of temporary files
ui.py — Gradio interface definition:
`gr.Textbox(label="Prompt", placeholder="A futuristic city at sunset")` — User input
`gr.Video(label="Generated Video")` — Video preview and download
`gr.Button("Generate Video")` — Trigger generation
Examples: Pre-filled prompts ("Space station orbiting Mars", "Medieval castle in fog", "Cyberpunk street market at night")
Makefile Automation (8 commands)
`make install` — Install Python dependencies (requirements.txt), optionally create venv
`make run` — Launch Gradio UI (python app.py), opens browser to http://127.0.0.1:7860
`make dev` — Development mode with auto-reload (gradio app.py --reload)
`make lint` — Run ruff linter on all Python files
`make format` — Auto-format with ruff (fixes style issues)
`make check` — Lint + format check + syntax validation (python -m py_compile) + import check
`make clean` — Remove generated files (*.mp4, *.png, __pycache__/, .ruff_cache/)
`make test` — Run encoding test (generate sample video, verify output exists and plays)
GitHub Actions CI/CD (.github/workflows/ci.yml)
Runs on every push and pull request:
Ruff Lint — Check code quality (unused imports, undefined names, syntax errors)
Ruff Format — Verify code follows formatting standards (fails if not formatted)
Python Syntax — Compile all .py files (python -m py_compile), catch syntax errors
Import Validation — Verify all imports resolve (import app, import video_generator, import ui)
Video Pipeline Test — Generate 3 sample images, apply Ken Burns, create transitions (verify frame count = 84)
Encoding Test — Run ffmpeg on sample frames, verify output.mp4 exists and is >1MB
Security Configuration
`.env` gitignored — OPENAI_API_KEY never committed (.gitignore blocks .env, .env.*, *.env)
`.env.example` template — Safe to commit with placeholder: `OPENAI_API_KEY=your_openai_api_key_here`
Health check — app.py verifies OPENAI_API_KEY set before launching (raises error if missing)
No secrets in code — all sensitive values loaded from environment
Deployment Options
Vercel — Deploy Gradio as serverless function:
- Add `vercel.json` with Python runtime config
- Set OPENAI_API_KEY in Vercel Environment Variables
- Deploy: `vercel --prod`
GCP Cloud Run — Containerized deployment:
- Dockerfile provided (Python 3.13, install deps, run app.py)
- Build: `gcloud builds submit --tag gcr.io/PROJECT_ID/text-to-video`
- Deploy: `gcloud run deploy --image gcr.io/PROJECT_ID/text-to-video --set-env-vars OPENAI_API_KEY=sk-...`
- Auto-scales to zero when idle (cost-efficient)
Railway / Render — Git-push deploy:
- Connect GitHub repo
- Set start command: `python app.py`
- Add OPENAI_API_KEY as environment variable in dashboard
- Auto-deploy on git push
Cost Analysis
| Component | Cost per Video | Details | |-----------|---------------|----------| | DALL-E 3 standard (1024×1024) | $0.04 × 3 = $0.12 | 3 images...
For high volume (100 videos/month): $12 total vs $50 (RunwayML) or $200-500 (Runway Gen-3).
Design Decisions
Chose DALL-E 3 over video generation APIs (RunwayML, Stability AI) — 5x cheaper ($0.12 vs $0.50-5), faster with parallelization (20s vs 60-120s), more...
Implemented parallel image generation with asyncio — 3 concurrent DALL-E 3 API calls reduce total time to ~20s (vs ~60s sequential). Uses...
Applied Ken Burns motion effect instead of static images — zoom + pan creates cinematic feel from still images. Each scene gets 28 frames (2.33s at 12...
Used cross-fade transitions (3-frame dissolves) instead of hard cuts — alpha blending between scenes creates smooth, professional transitions. Overlaps...
Encoded to H.264 baseline profile with yuv420p — most compatible codec for browsers (Chrome, Safari, Firefox all support). Baseline profile ensures...
Set 12 fps instead of 24/30 fps — sufficient for Ken Burns motion (slow zoom/pan), reduces file size by 50-60%, and lowers ffmpeg encoding time. Higher...
Removed audio track (-an flag in ffmpeg) — not needed for short cinematic clips, reduces file size by 20-30%, simplifies encoding pipeline.
Generated 3 scene-specific prompts (wide/close-up/dramatic) instead of using user prompt directly — ensures variety in camera angles for storytelling....
Used Gradio for UI instead of Flask/FastAPI — simpler for ML demos (3 lines to create text input + video output), built-in file handling (video preview...
Implemented Makefile automation — 8 commands (install, run, dev, lint, format, check, clean, test) provide consistent workflow across developers,...
Added GitHub Actions CI/CD — lint/format/syntax/import/pipeline/encoding checks on every push catch errors before merge, prevent broken builds,...
Gitignored generated files (*.mp4, *.png, temp frames) — keeps repository clean, avoids large file commits (videos can be 2-5MB), regenerated locally...
Provided .env.example template — onboarding guide for new developers, shows required environment variables (OPENAI_API_KEY), safe to commit (no real...
Used ruff for linting/formatting instead of flake8/black — faster (10-100x), single tool for both lint and format, modern Python syntax support, fewer...
Tradeoffs & Constraints
DALL-E 3 standard quality instead of HD — HD costs $0.08 per image (2x more expensive), outputs 1792×1792 (larger files), but marginal quality...
12 fps instead of 24/30 fps — reduces file size and encoding time but motion appears slightly less smooth. Acceptable for Ken Burns slow zoom/pan but...
Fixed 7-second duration (84 frames) — 3 scenes × 28 frames each. Longer videos would require more DALL-E 3 images (more cost) or stretching motion...
No audio generation — videos are silent. Adding AI-generated music (e.g., Suno, MusicGen) would cost $0.10-0.50 per track, increase complexity, and...
Local ffmpeg encoding instead of cloud transcoding (AWS MediaConvert, Zencoder) — free but requires ffmpeg installed locally. Cloud transcoding would...
Gradio UI instead of custom React frontend — simpler to build (no frontend code) but less customizable. Can't add advanced features like video...
Sequential frame generation (Ken Burns, cross-fade) instead of GPU-accelerated — CPU-based Pillow/NumPy is slower (~5s per scene) but works on any...
No video editing features — can't trim, crop, adjust speed, or add text overlays. Would need video editor integration (MoviePy, FFmpeg filters) to...
Single-user Gradio app — no authentication, user management, or video history. For multi-user deployment, would need auth (OAuth), database (PostgreSQL...
Would improve: Add audio generation (Suno, MusicGen) for background music, implement custom transition types (wipe, zoom, rotate), support longer...
Outcome & Impact
Production-ready text-to-video generator creating 7-second cinematic videos from natural language prompts in ~20 seconds with cost-efficient OpenAI...
DALL-E 3 parallel generation produces 3 distinct camera angles (wide establishing shot, close-up detail shot, dramatic low-angle shot) concurrently...
Ken Burns motion effect adds cinematic zoom + pan to each scene — 28 frames per scene with linear interpolation from 100% scale (centered) to 120%...
Cross-fade transitions provide smooth dissolves between scenes — alpha blending overlaps last 3 frames of scene N with first 3 frames of scene N+1...
ffmpeg H.264 baseline encoding ensures browser compatibility — plays in Chrome, Safari, Firefox, Edge without plugins or transcoding. Uses yuv420p...
Gradio UI enables instant experimentation — text input field for prompts, video preview player, download button, example prompts (space station,...
Cost-efficient at $0.12 per video (3 × $0.04 DALL-E 3 standard + $0.00 local ffmpeg) — 75% cheaper than RunwayML Gen-2 ($0.50/video), 95% cheaper than...
Makefile provides consistent workflow — 8 commands simplify development: `make install` (dependencies), `make run` (launch UI), `make dev`...
GitHub Actions CI/CD validates every commit — ruff lint checks code quality (unused imports, undefined names), ruff format verifies formatting...
Comprehensive documentation in README — quick start (4 commands: clone, install, configure, run), features (3-scene generation, Ken Burns, cross-fade,...
Secure .env configuration isolates secrets — OPENAI_API_KEY loaded from environment, never committed (.gitignore blocks .env, .env.*, *.env),...
Deployment options accommodate different platforms — Vercel (serverless Gradio with vercel.json config, OPENAI_API_KEY environment variable), GCP Cloud...
Video generation pipeline: text prompt → 3 scene-specific prompts (wide/close-up/dramatic with cinematic composition/lighting/angle keywords) →...
Ruff linter/formatter integration — 10-100x faster than flake8/black, single tool for both lint and format, modern Python syntax support (match...
MIT license enables open experimentation — developers, students, researchers can use, modify, and distribute for personal or commercial projects with...
Tech Stack
Language: Python 3.13+ (asyncio for parallel API calls, type hints for clarity)
Image Generation: OpenAI DALL-E 3 (1024×1024, quality='standard', style='vivid')
Image Processing: Pillow (PIL) for Ken Burns motion frames, NumPy for affine transformations
Video Encoding: ffmpeg (libx264 codec, baseline profile, yuv420p pixel format, 12 fps)
UI Framework: Gradio (text input, video preview, download, example prompts)
Automation: Makefile (8 commands: install, run, dev, lint, format, check, clean, test)
Linting/Formatting: Ruff (fast Python linter and formatter, replaces flake8/black/isort)
CI/CD: GitHub Actions (lint, format, syntax, import, pipeline, encoding tests)
Async: asyncio (concurrent DALL-E 3 API calls, parallel image downloads)
Environment: python-dotenv (load .env variables), os module (environment access)