Gesamtkunstvektoren: Perceptual Embeddings for the Performing Arts (Peter Broadwell)

UC Berkeley School of Information
UC Berkeley School of InformationApr 8, 2026

Why It Matters

These multimodal embeddings promise new ways to index, analyze, and teach performing arts, but their current gaps highlight the need for specialized training to unlock full scholarly value.

Key Takeaways

  • Multimodal embeddings now align text, audio, video, and PDFs.
  • Google Gemini 2 and Meta's open-weight model enable cross-modal search.
  • Experiments reveal embeddings capture visual cues like Mao portrait.
  • Audio‑text similarity drops during operatic coloratura, exposing model limits.
  • Fine‑tuning could improve art‑historical inference for teaching and research.

Summary

Peter Broadwell, head of AI Modeling at Stanford Libraries, presented how recent multimodal embedding models—Google’s Gemini 2, Amazon’s Nova, OpenAI’s CLIP, Meta’s open‑weight model, and audio‑focused CLAP—enable unified queries across text, images, video, audio, and even PDFs. He framed the discussion with Wagner’s Gesamtkunstvektoren concept, arguing that the performing arts have long sought a synthesis of modalities, now mirrored by AI.

Broadwell demonstrated the models’ capabilities by searching for the Japanese aesthetic concept “ma,” retrieving images that highlighted negative space, and probing Heian‑period visual art, where the system surfaced a later illustration of The Tale of Genji. He then applied the embeddings to John Adams’s opera *Nixon in China*, generating cosine‑similarity graphs that compared audio, video, and lyric embeddings over time, revealing strong audio‑video‑text alignment when a Mao portrait appears, but a sharp drop during a coloratura aria.

A striking quote highlighted Wagner’s belief that drama should dominate, with music serving it—mirrored today as the models let drama (text) drive visual and auditory retrieval. The Mao portrait example showed the vision transformer recognizing contextual cues, while the failure to link the “book” prop during the aria underscored current limitations in visual‑semantic understanding.

Broadwell concluded that while these models excel at indexing and cross‑modal search, their inconsistent grasp of nuanced artistic elements suggests a need for domain‑specific fine‑tuning. Successful integration could transform digital humanities, offering scholars powerful tools for art‑historical research, teaching, and archival discovery.

Original Description

Deep neural models capable of encoding semantic aspects of multiple perceptual modalities offer exciting new opportunities for computational analyses of cultural expressive forms in the performing arts, particularly those that derive much of their effectiveness from the layering of different forms of media.
This presentation will detail efforts to expand existing inquiries into the analysis of recorded theater performances via the application of new technologies such as Meta’s Perception Encoder Audiovisual (PE-AV) family of models, which can generate aligned semantic vector embeddings of video, audio, and language modalities singly or jointly. These models raise the possibility of being able to augment computationally the consideration of works through the lens of the Wagnerian Gesamtkunstwerk, the notion that the worth of art can derive from the totality of its interweaved components (though first acknowledging the biases inherent to both the theory and to the AI models).
Of particular interest is how these models can complement and extend previous inquiries involving AI-based stylometry of pose and action in contemporary theater, as well as prior efforts to annotate manually the degrees of intermediality in the formal sections of specific recorded performances of Japanese Noh plays.
Co-sponsored by the Berkeley Institute for Data Science, the School of Information, and the Department of Scandinavian.
Speaker
Peter Broadwell
Peter Broadwell is the head of AI modeling and inference in Research Data Services at the Stanford University Libraries, where his team’s work applies AI and machine learning, web-based visualization, and other methods of digital analysis to complex cultural data.
He has a Ph.D. in musicology from UCLA and an M.S. in computer science from the University of California, Berkeley. Recently, he has contributed to projects involving automatic translation and indexing of folklore collections in multiple languages, deep learning-based analyses of theater choreography from video sources, and web-based parsing and playback of digitized player piano rolls.
More info:

Comments

Want to join the conversation?

Loading comments...