Back to Projects

Text-to-Video AI

Text-to-video generator creating 7-second cinematic videos from natural language prompts using OpenAI DALL-E 3, Ken Burns motion effects, and ffmpeg encoding. Python 3.13+ application with Gradio UI.

Python 3.13+OpenAI DALL-E 3GradioffmpegPillow (PIL)NumPyasyncioGitHub ActionsMakefileRuff (linter)

Role

AI Engineer & Full-stack Developer

Team

Solo

Company/Organization

Personal Project

The Problem

Creating cinematic videos from text descriptions required expensive video generation APIsRunwayML Gen-2 ($0.50/video), Runway Gen-3 ($2-5/video),...

Existing AI video APIs had slow sequential processingsingle scene generation took 20-40s, multi-scene videos required 60-120s total. No...

Manual workflows were complex and time-consuminggenerate images in DALL-E → download individually → import to video editor (Premiere, Final Cut) →...

Static image outputs lacked cinematic feelDALL-E 3 produces high-quality images but no motion. Videos need dynamic camera movement (zoom, pan) for...

Browser compatibility issuesmany video generation tools output codecs (H.265, VP9) not universally supported in browsers. Resulted in 'codec not...

No cost-efficient solution for multi-scene cinematic videosneeded 3+ camera angles per story (wide establishing shot, close-up for detail, dramatic...

The Solution

Built a Python application combining OpenAI DALL-E 3 for image generation with custom Ken Burns motion effects and ffmpeg encoding for...

Video Generation Pipeline (5 Stages)

Prompt Engineering — User enters text description, system generates 3 scene-specific prompts:

- Wide shot: "[user prompt], wide establishing shot, cinematic composition, natural lighting"

- Close-up: "[user prompt], close-up detail shot, shallow depth of field, dramatic focus"

- Dramatic angle: "[user prompt], dramatic low-angle shot, epic composition, cinematic lighting"

- Ensures variety in camera angles for storytelling impact

Parallel Image Generation — DALL-E 3 API calls run concurrently via asyncio:

- `asyncio.gather()` sends 3 API requests simultaneously

- Each request: OpenAI DALL-E 3 standard (1024×1024, quality='standard', style='vivid')

- Downloads complete in ~20s total (vs ~60s sequential)

- Images saved temporarily for processing

Ken Burns Motion Effect — Each image converted to 28-frame sequence:

- Start frame: image at 100% scale, centered

- End frame: image at 120% scale, panned (random direction: left/right/up/down)

- Interpolation: linear zoom + pan across 28 frames (2.33s at 12 fps)

- Implementation: Pillow PIL for image manipulation, NumPy for affine transformations

- Creates dynamic motion from static images

Cross-Fade Transitions — Scenes stitched with smooth dissolves:

- Scene 1 frames: 1-28 (full motion)

- Transition 1→2: frames 26-28 of scene 1 alpha-blended with frames 1-3 of scene 2

- Scene 2 frames: 29-56 (full motion)

- Transition 2→3: frames 54-56 of scene 2 blended with frames 1-3 of scene 3

- Scene 3 frames: 57-84 (full motion)

- Alpha blending: `output = img1 * (1 - alpha) + img2 * alpha` with alpha 0.33/0.66/1.0

- Total 84 frames = 7 seconds at 12 fps

Video Encoding — ffmpeg H.264 MP4 output:

- Command: `ffmpeg -framerate 12 -i frame_%04d.png -c:v libx264 -profile:v baseline -pix_fmt yuv420p -an output.mp4`

- `-profile:v baseline`: most compatible H.264 profile (plays in all browsers)

- `-pix_fmt yuv420p`: chroma subsampling for universal compatibility

- `-an`: no audio track (reduces file size, not needed for short cinematic clips)

- Output: 2-5MB MP4 file, ~500 kbps bitrate

Application Architecture

app.py — Entry point, loads environment variables, launches Gradio UI:

`load_dotenv()` reads `.env` for OPENAI_API_KEY

`ui.launch()` starts Gradio server on port 7860

Server accessible at http://127.0.0.1:7860

video_generator.py — Core video pipeline logic:

`generate_video(prompt: str) -> str`Main function orchestrating pipeline

`_generate_scene_prompts(prompt: str) -> List[str]`Creates wide/close-up/dramatic prompts

`_generate_images_parallel(prompts: List[str]) -> List[Image]`Async DALL-E 3 calls

`_apply_ken_burns(image: Image, direction: str) -> List[Image]`Motion effect frames

`_create_crossfade(frames1: List[Image], frames2: List[Image]) -> List[Image]`Transition blending

`_encode_video(frames: List[Image], output_path: str) -> None`ffmpeg encoding

Error handling: retry logic for API failures, cleanup of temporary files

ui.py — Gradio interface definition:

`gr.Textbox(label="Prompt", placeholder="A futuristic city at sunset")`User input

`gr.Video(label="Generated Video")`Video preview and download

`gr.Button("Generate Video")`Trigger generation

Examples: Pre-filled prompts ("Space station orbiting Mars", "Medieval castle in fog", "Cyberpunk street market at night")

Makefile Automation (8 commands)

`make install`Install Python dependencies (requirements.txt), optionally create venv

`make run`Launch Gradio UI (python app.py), opens browser to http://127.0.0.1:7860

`make dev`Development mode with auto-reload (gradio app.py --reload)

`make lint`Run ruff linter on all Python files

`make format`Auto-format with ruff (fixes style issues)

`make check`Lint + format check + syntax validation (python -m py_compile) + import check

`make clean`Remove generated files (*.mp4, *.png, __pycache__/, .ruff_cache/)

`make test`Run encoding test (generate sample video, verify output exists and plays)

GitHub Actions CI/CD (.github/workflows/ci.yml)

Runs on every push and pull request:

Ruff Lint — Check code quality (unused imports, undefined names, syntax errors)

Ruff Format — Verify code follows formatting standards (fails if not formatted)

Python Syntax — Compile all .py files (python -m py_compile), catch syntax errors

Import Validation — Verify all imports resolve (import app, import video_generator, import ui)

Video Pipeline Test — Generate 3 sample images, apply Ken Burns, create transitions (verify frame count = 84)

Encoding Test — Run ffmpeg on sample frames, verify output.mp4 exists and is >1MB

Security Configuration

`.env` gitignoredOPENAI_API_KEY never committed (.gitignore blocks .env, .env.*, *.env)

`.env.example` templateSafe to commit with placeholder: `OPENAI_API_KEY=your_openai_api_key_here`

Health checkapp.py verifies OPENAI_API_KEY set before launching (raises error if missing)

No secrets in codeall sensitive values loaded from environment

Deployment Options

Vercel — Deploy Gradio as serverless function:

- Add `vercel.json` with Python runtime config

- Set OPENAI_API_KEY in Vercel Environment Variables

- Deploy: `vercel --prod`

GCP Cloud Run — Containerized deployment:

- Dockerfile provided (Python 3.13, install deps, run app.py)

- Build: `gcloud builds submit --tag gcr.io/PROJECT_ID/text-to-video`

- Deploy: `gcloud run deploy --image gcr.io/PROJECT_ID/text-to-video --set-env-vars OPENAI_API_KEY=sk-...`

- Auto-scales to zero when idle (cost-efficient)

Railway / Render — Git-push deploy:

- Connect GitHub repo

- Set start command: `python app.py`

- Add OPENAI_API_KEY as environment variable in dashboard

- Auto-deploy on git push

Cost Analysis

| Component | Cost per Video | Details | |-----------|---------------|----------| | DALL-E 3 standard (1024×1024) | $0.04 × 3 = $0.12 | 3 images...

For high volume (100 videos/month): $12 total vs $50 (RunwayML) or $200-500 (Runway Gen-3).

Design Decisions

Chose DALL-E 3 over video generation APIs (RunwayML, Stability AI)5x cheaper ($0.12 vs $0.50-5), faster with parallelization (20s vs 60-120s), more...

Implemented parallel image generation with asyncio3 concurrent DALL-E 3 API calls reduce total time to ~20s (vs ~60s sequential). Uses...

Applied Ken Burns motion effect instead of static imageszoom + pan creates cinematic feel from still images. Each scene gets 28 frames (2.33s at 12...

Used cross-fade transitions (3-frame dissolves) instead of hard cutsalpha blending between scenes creates smooth, professional transitions. Overlaps...

Encoded to H.264 baseline profile with yuv420pmost compatible codec for browsers (Chrome, Safari, Firefox all support). Baseline profile ensures...

Set 12 fps instead of 24/30 fpssufficient for Ken Burns motion (slow zoom/pan), reduces file size by 50-60%, and lowers ffmpeg encoding time. Higher...

Removed audio track (-an flag in ffmpeg)not needed for short cinematic clips, reduces file size by 20-30%, simplifies encoding pipeline.

Generated 3 scene-specific prompts (wide/close-up/dramatic) instead of using user prompt directlyensures variety in camera angles for storytelling....

Used Gradio for UI instead of Flask/FastAPIsimpler for ML demos (3 lines to create text input + video output), built-in file handling (video preview...

Implemented Makefile automation8 commands (install, run, dev, lint, format, check, clean, test) provide consistent workflow across developers,...

Added GitHub Actions CI/CDlint/format/syntax/import/pipeline/encoding checks on every push catch errors before merge, prevent broken builds,...

Gitignored generated files (*.mp4, *.png, temp frames)keeps repository clean, avoids large file commits (videos can be 2-5MB), regenerated locally...

Provided .env.example templateonboarding guide for new developers, shows required environment variables (OPENAI_API_KEY), safe to commit (no real...

Used ruff for linting/formatting instead of flake8/blackfaster (10-100x), single tool for both lint and format, modern Python syntax support, fewer...

Tradeoffs & Constraints

DALL-E 3 standard quality instead of HDHD costs $0.08 per image (2x more expensive), outputs 1792×1792 (larger files), but marginal quality...

12 fps instead of 24/30 fpsreduces file size and encoding time but motion appears slightly less smooth. Acceptable for Ken Burns slow zoom/pan but...

Fixed 7-second duration (84 frames)3 scenes × 28 frames each. Longer videos would require more DALL-E 3 images (more cost) or stretching motion...

No audio generationvideos are silent. Adding AI-generated music (e.g., Suno, MusicGen) would cost $0.10-0.50 per track, increase complexity, and...

Local ffmpeg encoding instead of cloud transcoding (AWS MediaConvert, Zencoder)free but requires ffmpeg installed locally. Cloud transcoding would...

Gradio UI instead of custom React frontendsimpler to build (no frontend code) but less customizable. Can't add advanced features like video...

Sequential frame generation (Ken Burns, cross-fade) instead of GPU-acceleratedCPU-based Pillow/NumPy is slower (~5s per scene) but works on any...

No video editing featurescan't trim, crop, adjust speed, or add text overlays. Would need video editor integration (MoviePy, FFmpeg filters) to...

Single-user Gradio appno authentication, user management, or video history. For multi-user deployment, would need auth (OAuth), database (PostgreSQL...

Would improve: Add audio generation (Suno, MusicGen) for background music, implement custom transition types (wipe, zoom, rotate), support longer...

Outcome & Impact

Production-ready text-to-video generator creating 7-second cinematic videos from natural language prompts in ~20 seconds with cost-efficient OpenAI...

DALL-E 3 parallel generation produces 3 distinct camera angles (wide establishing shot, close-up detail shot, dramatic low-angle shot) concurrently...

Ken Burns motion effect adds cinematic zoom + pan to each scene28 frames per scene with linear interpolation from 100% scale (centered) to 120%...

Cross-fade transitions provide smooth dissolves between scenesalpha blending overlaps last 3 frames of scene N with first 3 frames of scene N+1...

ffmpeg H.264 baseline encoding ensures browser compatibilityplays in Chrome, Safari, Firefox, Edge without plugins or transcoding. Uses yuv420p...

Gradio UI enables instant experimentationtext input field for prompts, video preview player, download button, example prompts (space station,...

Cost-efficient at $0.12 per video (3 × $0.04 DALL-E 3 standard + $0.00 local ffmpeg)75% cheaper than RunwayML Gen-2 ($0.50/video), 95% cheaper than...

Makefile provides consistent workflow8 commands simplify development: `make install` (dependencies), `make run` (launch UI), `make dev`...

GitHub Actions CI/CD validates every commitruff lint checks code quality (unused imports, undefined names), ruff format verifies formatting...

Comprehensive documentation in READMEquick start (4 commands: clone, install, configure, run), features (3-scene generation, Ken Burns, cross-fade,...

Secure .env configuration isolates secretsOPENAI_API_KEY loaded from environment, never committed (.gitignore blocks .env, .env.*, *.env),...

Deployment options accommodate different platformsVercel (serverless Gradio with vercel.json config, OPENAI_API_KEY environment variable), GCP Cloud...

Video generation pipeline: text prompt → 3 scene-specific prompts (wide/close-up/dramatic with cinematic composition/lighting/angle keywords) →...

Ruff linter/formatter integration10-100x faster than flake8/black, single tool for both lint and format, modern Python syntax support (match...

MIT license enables open experimentationdevelopers, students, researchers can use, modify, and distribute for personal or commercial projects with...

Tech Stack

Language: Python 3.13+ (asyncio for parallel API calls, type hints for clarity)

Image Generation: OpenAI DALL-E 3 (1024×1024, quality='standard', style='vivid')

Image Processing: Pillow (PIL) for Ken Burns motion frames, NumPy for affine transformations

Video Encoding: ffmpeg (libx264 codec, baseline profile, yuv420p pixel format, 12 fps)

UI Framework: Gradio (text input, video preview, download, example prompts)

Automation: Makefile (8 commands: install, run, dev, lint, format, check, clean, test)

Linting/Formatting: Ruff (fast Python linter and formatter, replaces flake8/black/isort)

CI/CD: GitHub Actions (lint, format, syntax, import, pipeline, encoding tests)

Async: asyncio (concurrent DALL-E 3 API calls, parallel image downloads)

Environment: python-dotenv (load .env variables), os module (environment access)

Back to Projects