Gesamtkunstvektoren: Perceptual Embeddings for the Performing Arts (Peter Broadwell)
Why It Matters
These multimodal embeddings promise new ways to index, analyze, and teach performing arts, but their current gaps highlight the need for specialized training to unlock full scholarly value.
Key Takeaways
- •Multimodal embeddings now align text, audio, video, and PDFs.
- •Google Gemini 2 and Meta's open-weight model enable cross-modal search.
- •Experiments reveal embeddings capture visual cues like Mao portrait.
- •Audio‑text similarity drops during operatic coloratura, exposing model limits.
- •Fine‑tuning could improve art‑historical inference for teaching and research.
Summary
Peter Broadwell, head of AI Modeling at Stanford Libraries, presented how recent multimodal embedding models—Google’s Gemini 2, Amazon’s Nova, OpenAI’s CLIP, Meta’s open‑weight model, and audio‑focused CLAP—enable unified queries across text, images, video, audio, and even PDFs. He framed the discussion with Wagner’s Gesamtkunstvektoren concept, arguing that the performing arts have long sought a synthesis of modalities, now mirrored by AI.
Broadwell demonstrated the models’ capabilities by searching for the Japanese aesthetic concept “ma,” retrieving images that highlighted negative space, and probing Heian‑period visual art, where the system surfaced a later illustration of The Tale of Genji. He then applied the embeddings to John Adams’s opera *Nixon in China*, generating cosine‑similarity graphs that compared audio, video, and lyric embeddings over time, revealing strong audio‑video‑text alignment when a Mao portrait appears, but a sharp drop during a coloratura aria.
A striking quote highlighted Wagner’s belief that drama should dominate, with music serving it—mirrored today as the models let drama (text) drive visual and auditory retrieval. The Mao portrait example showed the vision transformer recognizing contextual cues, while the failure to link the “book” prop during the aria underscored current limitations in visual‑semantic understanding.
Broadwell concluded that while these models excel at indexing and cross‑modal search, their inconsistent grasp of nuanced artistic elements suggests a need for domain‑specific fine‑tuning. Successful integration could transform digital humanities, offering scholars powerful tools for art‑historical research, teaching, and archival discovery.
Comments
Want to join the conversation?
Loading comments...