Company launches Marengo 2.6, a state-of-the-art multimodal foundation model, as the latest version of its Pegasus model enters open beta; In an industry first, Twelve Labs introduces native support for multimodality in a single API.
Twelve Labs, the video understanding company, today announced that it raised $50 million in Series A funding to fuel the ongoing development of its industry-leading foundation models dedicated to all aspects of video. The round was co-led by new investor New Enterprise Associates (NEA) and NVentures, NVIDIA's venture capital arm, which recently participated in Twelve Labs' strategic round. Previous investors, including Index Ventures, Radical Ventures, WndrCo, and Korea Investment Partners also joined the round. In addition to R&D, funds will be used to nearly double headcount. Twelve Labs plans to add more than 50 employees by the end of the year.
Twelve Labs has integrated a number of NVIDIA frameworks and services within its platform, including the NVIDIA H100 Tensor Core GPU and NVIDIA L40S GPU, as well as inference frameworks such as NVIDIA Triton Inference Server and NVIDIA TensorRT. These technologies have enabled Twelve Labs to develop first-of-their-kind foundation models for multimodal video understanding. Twelve Labs is also exploring product and research collaborations with NVIDIA to bring best-in-class multimodal foundation models and enabling frameworks to market.
"We believe the future is multimodal, and Twelve Labs is leading the charge in terms of figuring out how to make multimodal AI efficient, purposeful and enterprise grade,"
said Tiffany Luck, Partner at NEA and new Twelve Labs board member.
"The company has brought together an incredibly talented team from across the globe to solve one of the most complex and exciting problems in AI. Twelve Labs is building the future of video understanding and multimodal AI, and we're excited to support them as they execute their vision and impact our world in a positive way."
"As a core component of generative AI, multimodal video understanding is a key to delivering more robust LLMs across industries,"
said Mohamed "Sid" Siddeek, corporate vice president and head of NVentures.
"The world-class team at Twelve Labs is leveraging NVIDIA accelerated computing together with their incredible capacity for video understanding, leading to new ways for enterprise customers to take advantage of generative AI."
Pushing Video Forward
Understanding across modalities can't just be bolted on as a feature to existing LLMs. Multimodal foundation models actually have to be so from inception. Other approaches try to shoehorn video understanding into an LLM paradigm by doing transcription analysis with traditional computer vision understanding and then gluing those together to attempt video understanding. In contrast to other foundation model providers, Twelve Labs was created specifically for multimodal video understanding.
Its release of its Marengo-2.6 model, a state-of-the-art multimodal embedding model, is unlike anything currently available to companies. Marengo 2.6 offers a pioneering approach to multimodal representations tasks– not just to video but also image and audio, performing any-to-any search tasks, including Text-To-Video, Text-To-Image, Text-To-Audio, Audio-To-Video, Image-To-Video, and more. Marengo-2.6's unique architecture is based on the concept of "Gated Modality Experts." This allows for the processing of multimodal inputs through specialized encoders before combining them into a comprehensive multimodal representation. This model represents a significant leap in video understanding technology, enabling more intuitive and comprehensive search capabilities across various media types.
Twelve Labs also opened its beta of Pegasus-1, which sets a new standard in video-language modeling. Pegasus-1 is designed to understand and articulate complex video content, transforming how we interact with and analyze multimedia. It can process and generate language from video input with exceptional accuracy and detail. Persistently refined since its initial closed beta release in October, the open beta is faster and more accessible, with enhanced performance. To get there, the Twelve Labs team drastically reduced the model's size, from 80 billion parameters to 17 billion, with three components jointly trained together: video encoder, video-language alignment model, language decoder. Twelve Labs will release additional flagship Pegasus models in the coming months for organizations that can support larger models.
Multimodal Embeddings API
For optimal performance, these models leverage embeddings. But most embeddings today are "unimodal text embeddings" where the data developers can interact with are only text-based. To this end, Twelve Labs introduced its Embeddings API, which gives users direct access to the raw multimodal embeddings that power the existing Video Search API and Classify API. This first-of-its-kind API supports all data modalities (image, text, audio, and video), turning data into vectors in the same space, without relying on siloed solutions for each modality.
Its new Embeddings API is powered by the Twelve Labs' video foundation model and inference infrastructure, which are fundamentally different from those that process images one-by-one and stitch them together. By providing native support for multimodality in a single API, Twelve Labs can offset the large volume of assets models need to understand with low latency. Collectively, these advancements represent a meaningful step towards achieving Twelve Labs' mission of making videos just as easy as text.
"Through our work, particularly our perceptual-reasoning research, we are solving the problems associated with multimodal AI. We seek to become the semantic encoder for all future AI agents that need to understand the world as humans do,"
said Jae Lee, co-founder and CEO of Twelve Labs.
"Marengo-2.6, Pegasus-1, and our Embeddings API mark a significant leap forward. With our Series A funding, we can invest into further research and development, hire aggressively across all roles, as well as to extend our reach and continue building partnerships with the most innovative, forward-thinking companies in existence to eliminate the boundaries of video understanding."
Customers and partners have been captivated by Twelve Labs' technology. Since debuting its platform, Twelve Labs has 30,000 users that are utilizing its APIs for tasks such as semantic video search and summarization across notable organizations in sports, media and entertainment, advertising, automotive, and security. In doing so, the company has started establishing deep industry partnerships and integrations with companies like Vidispine, EMAM, Blackbird, and more.
Media Contact
Amber D Moore, Moore Communications, 1 5039439381, [email protected]
SOURCE Twelve Labs