Build Visual AI Agents

Andrew Ng
Andrew NgMay 20, 2026

Why It Matters

Visual AI agents unlock rapid creation of high‑quality media, reducing production costs and accelerating time‑to‑market for brands and developers. Mastering evaluation techniques ensures consistent output, a critical hurdle in generative media adoption.

Key Takeaways

  • Course teaches evaluation pipelines using SigLIP, LLM judges, rubrics.
  • Build image agent that converts brand guidelines into UI mockups.
  • Create video agent for multi‑scene explainers with synchronized audio.
  • Master prompt engineering for high‑quality image and video generation.
  • Use Gemini CLI to build generative media applications from natural language.

Pulse Analysis

The demand for visual content has surged as companies seek engaging assets for websites, ads, and explainer videos. While large‑scale models like Google's Nano Banana and Veo can produce impressive images and clips from a single prompt, the real challenge lies in maintaining quality across dozens or hundreds of outputs. This course tackles that gap by teaching three complementary evaluation methods—SigLIP image‑text similarity scores, LLM‑based judges, and structured rubrics—allowing developers to automate quality checks and iterate rapidly.

Beyond evaluation, the program delves into prompt engineering techniques that blend large language model guidance with reference imagery and starting frames. By mastering these tactics, participants can steer generative models toward brand‑consistent visuals, reducing the need for costly post‑production editing. The hands‑on labs guide learners through building an image agent that translates brand guidelines into polished UI mockups, and a video agent that plans multi‑scene explainers, animates reference frames, and synchronizes audio, ensuring temporal consistency throughout the narrative.

Finally, the course introduces the Gemini CLI, a tool that converts natural‑language instructions into reusable agent skills. This capability empowers developers to prototype custom media pipelines without deep engineering effort, accelerating product demos and internal workflows. As visual AI moves from experimental to operational, professionals equipped with both generation and rigorous evaluation skills will be positioned to lead the next wave of automated content creation.

Original Description

Join our new short course, AI Agents for Image and Video Generation, built in partnership with Google and taught by Katie Nguyen, Developer Relations Engineer at Google Cloud AI, and Wafae Bakkali, Staff Generative AI Specialist at Google.
Most agents you've worked with probably produce text. But whether you're building a product demo, a website asset, or an explainer video, you're working with visual media. With models like Google's Nano Banana for images and Veo for video, generating a single output from a prompt is straightforward. The harder problem is producing high-quality results consistently at scale, and the bottleneck there is evaluation: there is no single correct answer to compare against, so quality depends on context and use case.
In this course, you'll learn three complementary evaluation techniques, then combine them with image and video generation to build autonomous media agents. You'll build an image agent that turns brand guidelines into UI mockups, and a video agent that plans multi-scene explainers, animates reference frames with synchronized audio, and checks consistency across scenes. In the final lesson, you'll use Gemini CLI to build a generative media agent in natural language, packaging what you've learned into reusable agent skills.
In detail, you'll:
- Get a clear mental model of the generative media landscape and the architectures behind image, video, and audio generation.
- Engineer prompts for high-quality images and video, using techniques like LLM-enhanced prompting, reference images, and starting frames.
- Build evaluation pipelines that combine SigLIP image-text similarity scores, LLM-based judges, and structured rubrics to assess output at scale.
- Build an image agent that turns brand guidelines into UI mockups, generating, evaluating, and iterating until designs pass your bar.
- Build a video agent that plans multi-scene explainers, generates and animates reference frames with audio, and evaluates temporal consistency.
- Package what you've learned into reusable agent skills, and use Gemini CLI to build a generative media application from natural language prompts.
By the end, you'll be ready to build agents that generate visual media, evaluate it, and iterate to improve outputs.

Comments

Want to join the conversation?

Loading comments...