ASID-Caption: Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

ASID-1M ASID-Captioner ASID-Verify Attribute-Structured Quality-Verified Audiovisual

Yunheng Li¹, Hengrui Zhang¹, Meng-Hao Guo³, Wenzhao Gao², Shaoyong Jia², Shaohui Jiao², Qibin Hou^1,†, Ming-Ming Cheng¹

¹VCIP, School of Computer Science, Nankai University ²ByteDance Inc. ³Tsinghua University

^†Corresponding author

arXiv Code

🗂️

Dataset Demo

📌

BibTeX

Abstract

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, existing models are constrained by video-instruction data that describes complex audiovisual content as single, incomplete captions without fine-grained organization or reliable annotation. We introduce (i) ASID-1M, an open-source collection of one million attribute-structured audiovisual instructions with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable curation pipeline that performs automatic verification and refinement to enforce semantic and temporal consistency with audiovisual evidence; and (iii) ASID-Captioner, a video understanding model trained via supervised fine-tuning on ASID-1M. Experiments across seven benchmarks show improved fine-grained caption quality, reduced hallucinations, and better instruction following, achieving state-of-the-art among open-source models and competitive performance with Gemini-3-Pro.

Motivation

Existing video instruction datasets often treat each video as a single unstructured caption, which leads to incomplete descriptions and makes it hard to learn controllable, fine-grained understanding. Simply making captions longer can introduce more hallucinations without systematic verification.

Our key idea is to provide attribute-structured supervision and verify each attribute against audiovisual evidence, enabling more reliable fine-grained learning.

Video Demo

📄 Show Caption

Response:
At 0s, the video opens on the deck of a grand ship bathed in warm golden sunset light, with white railings framing the scene and a sky painted in orange and pink hues; a gentle melancholic orchestral score plays softly in the background. A woman with curly reddish-brown hair styled in an updo adorned with small earrings and a delicate necklace stands beside a man in a dark coat over a crisp white shirt, their hands clasped gently as they gaze at each other. At 4s, the man speaks in a calm, reassuring tone: “Now close your eyes.” The woman responds softly, “Go on,” then closes her eyes slowly, her expression filled with trust. At 12s, he guides her hand to the metal railing and instructs her to hold on firmly. At 25s, he encourages her to step up onto the rail while supporting her balance. An overhead shot captures the pair standing side by side, arms outstretched like birds ready to take flight. At 34s, he asks, “Do you trust me?” She replies confidently, “I trust you.” At 46s, he invites her to open her eyes. She gasps in awe as the ocean shimmers below in the fading sunset, exclaiming joyfully, “I’m flying, Jack!” The camera alternates between intimate close-ups and sweeping wide shots, reinforcing the romantic and adventurous atmosphere.

📄 Show Caption

Response:
At 0s, the video opens with Elsa standing alone on a snow-covered mountain peak at night, surrounded by dark jagged mountains under a starry sky; she wears a teal dress with black sleeves, a purple cape that flows behind her, and a small golden crown, her posture tense as she looks around. The audio features a gentle melancholic piano melody creating a somber introspective atmosphere, and a female voice soft and filled with quiet sorrow sings, 'The snow is white on the mountain tonight Not a footprint to be seen A kingdom of isolation And it looks like I'm the queen.' At 5s, she takes slow deliberate steps forward, her expression shifting from contemplative to vulnerable as she glances down then back up, the wind rustling softly through the snowflakes falling gently around her. At 14s, the camera zooms in closer to her face, capturing her furrowed brow and trembling lips as she speaks, 'The wind is howling like this swirling storm inside Couldn't keep it in, heaven knows I tried Don't let them in, don't let them see Be the good girl you always have to be Conceal, don't feel, don't let them know Well now they know Let it go,' her hands clasped tightly in front of her chest before she gestures outward with one hand, her eyes wide with surprise. At 39s, she raises both arms high above her head, her cape billowing in the wind, and a bright blue light erupts from her palms, forming intricate swirling patterns of ice and snow that dance across the sky; the camera pulls back to show her full figure against the vast snowy landscape, emphasizing the scale of her power. At 42s, she continues speaking, 'Let it go Can't hold it back anymore Let it go Let it go Turn away and slam the door,' her voice growing more powerful and determined as the music swells into an uplifting anthemic chorus, and the camera shifts between close-ups of her expressive face showing relief and joy and wide shots of the swirling ice patterns filling the frame. At 56s, she spreads her arms wide, embracing the moment, her face lighting up with unbridled joy as the ice swirls continue to spin around her; the camera captures her from multiple angles, highlighting the contrast between her earlier vulnerability and current exuberance, while the wind carries snowflakes around her, enhancing the magical serene atmosphere. The video ends with a final shot of Elsa looking directly at the camera, her expression one of pure, unadulterated happiness and freedom, as the swirling ice patterns fade into the background.

📄 Show Caption

Response:
At 0s, the video opens with Thanos-a purple-skinned, bald figure clad in cracked, metallic armor-standing in a desolate, war-torn landscape where smoldering fires flicker against a hazy sky dotted with distant city lights; his right hand is raised, palm open, revealing the Infinity Gauntlet adorned with glowing stones, while his left hand rests on his hip, exuding quiet confidence. The audio begins with a deep, menacing voice speaking slowly and deliberately: 'I am inevitable.' At 1s, Thanos clenches his gauntlet's fingers into a fist, then relaxes them again, maintaining an unwavering gaze forward. At 5s, he shifts his stance slightly, still holding the gauntlet aloft. At 10s, the scene transitions to Tony Stark-Iron Man-kneeling amidst rubble and smoke in the same devastated setting; his red-and-gold suit is damaged, with cracks spiderwebbing across the chest arc reactor visible, and his face bears faint scratches and bruises. He raises his left hand, which also holds the Infinity Gauntlet, and the stones within begin to glow brighter-first a soft blue, then shifting through hues of green, yellow, pink, and purple-each stone pulsing with energy that crackles around his fingers. Stark's expression mixes determination and desperation, his posture tense yet resolute as he struggles to control the gauntlet's power. At 20s, the camera cuts back to Thanos, who remains motionless but his eyes narrow slightly, hinting at growing interest or concern. Stark continues to manipulate the gauntlet, his movements deliberate and strained, as the stones' colors shift dynamically. At 34s, Stark speaks in a low, strained voice filled with awe and fear: 'Iron Man,' before being cut off by a loud, explosive sound like a powerful energy blast followed by a sharp crack. The screen fades to white, leaving the outcome unresolved. Throughout the sequence, the camera alternates between close-ups of the gauntlet's glowing stones highlighting their vibrant colors and Stark's focused, pained expressions and wider shots of the ruined landscape emphasizing the scale of destruction, creating a tense atmosphere charged with impending conflict. The emotional tone shifts from Thanos' calm dominance to Stark's desperate struggle, underscoring the high stakes of the moment.

📄 Show Caption

Response:
At 0s, the video opens with a chaotic crowd of photographers holding cameras, their flashes illuminating a dark night. At 2s, the scene shifts to a close-up of Detective Conan Kudo Shinichi-a high school detective with spiky black hair, wearing a red bow tie and suit-standing against a starry sky dotted with pink laser beams, his expression determined. At 4s, a newspaper headline introduces him: 'His name is Kudo Shinichi,' followed by another headline reading "High School Student Detective Solves Another Case." At 5s, he says, 'I'm a high school detective, Kudo Shinichi.' At 8s, the setting transitions to a lively amusement park at night where Conan and Ran Moriyama, his childhood friend, stand among crowds; Ran points excitedly, and Conan looks on with curiosity. At 9s, Conan says, 'I went to an amusement park with my childhood friend, Moriyama, and witnessed a suspicious transaction between two men in black suits.' At 13s, a close-up shows hands exchanging a purple box filled with money, hinting at a secret transaction. At 15s, the scene moves to a dark, grassy area near a castle-like structure; Conan crouches, fixated on the transaction, but doesn't notice a man in a black coat approaching from behind. At 18s, Conan says, 'I was so obsessed with watching the transaction that I didn't notice another friend approaching me from behind.' At 24s, the man forces Conan to drink poison; Conan lies on the ground, eyes wide with shock, as the man's hand approaches his mouth. At 25s, Conan says, 'I was forced to take poison by that man, and when I woke up, my body had shrunk!' At 30s, Conan wakes up smaller, sitting on the ground looking up with a sad expression as a shadowy figure stands over him. At 32s, the scene shifts to a library where Conan-now in a blue uniform-stands before a stern older man Dr. Agasa, who warns him about being exposed. At 33s, Conan says, 'If they find out that Kudo Shinichi is alive, they will be killed again, and it will also affect the people around them.' At 39s, Dr. Agasa advises Conan to hide his true identity; Conan smiles and gestures with his hands while speaking, then changes into a new outfit. At 40s, Conan says, 'With the help of Dr. Agasa, I decided to hide my true identity.' At 44s, Conan is asked for his name by Ran, and he quickly responds, 'My name is Edogawa Conan.' At 46s, Conan runs towards Ran outside a building labeled 'Mori Detective Agency. At 47s, Conan says, 'In order to get information about them, I ran to Ran's house, where her father is a detective.' The video ends with a shot of the 'Mori Detective Agency' sign on a building facade.

📄 Show Caption

Response:
At 0s, the video opens with a sunlit urban sidewalk lined with parked cars and ornate buildings where a woman with long brown hair wearing a red cardigan over a light top and layered scarf walks briskly beside a man in a dark jacket and blue shirt; the camera follows them from behind initially, then shifts to a side view showing the woman gesturing animatedly as she speaks: "I thought that the dream space would be all about the visuals, but it's more about the feel of it." At 7s, her tone turns curious and intense as she continues: "My question is what happens when you start messing with the physics of it?" The camera cuts to a close-up of the woman looking directly at the man with an earnest expression while he stands still, listening intently against a backdrop of a white van and city street. At 12s, the camera pulls back to show both walking down the narrow street flanked by tall, classic-style buildings, their pace steady as they move forward. At 23s, the sky above the city suddenly warps into a surreal, multi-layered landscape of mirrored, geometric structures twisting the familiar streets below into a disorienting maze; the camera tilts upward to reveal this otherworldly expanse, emphasizing its complexity and scale, while the audio transitions to a deep, resonant, mysterious ambient soundscape filled with low-frequency hums and echoing tones that build suspense and wonder. At 45s, the camera returns to the pair, capturing the woman's wide-eyed amazement and the man's thoughtful gaze as they pause to take in the sight; the woman says: "Something isn't it?" At 56s, the man responds calmly: "Yes, it is," his voice carrying quiet awe. The camera alternates between close-ups of their faces highlighting the woman's wonder and the man's contemplation and wide shots of the warped sky underscoring the surrealism of the moment, then pans across the complex, symmetrical, three-dimensional structure hovering between the buildings to create a dizzying effect before returning to the two standing on the sidewalk, looking up with fascination. The video concludes with a final sweeping shot of the surreal, geometric cityscape, reinforcing the theme of reality bending under unexpected forces.

📄 Show Caption

Response:
At 0s, the video opens with a high-angle wide shot capturing six players seated around a black felt poker table in an elegantly decorated indoor room featuring red armchairs, plush seating, wooden floors, and warm lighting; on-screen graphics include GSN logo top right, and text at the bottom. The players include Esfandiari in a light-colored jacket with glasses, Laak in a dark hoodie, Ivey in a blue shirt, Elezra in a black leather jacket, and Negreanu in a gray sweater-all deeply focused on their hands and chips. A male voice says "High stakes poker with six players." At 3s, another voice adds "I wonder who's going broke next," followed by a third saying "Beat him in the pot call." At 9s, a voice states "Phil Ivy has over a million dollars now," then continues "Phil Ivy could actually have a hamburger and have enough money to pay off his million dollar bet with Drawn" while on-screen text identifies Ivey as having over $1 million in chips. At 21s, a voice says "He can actually have a whole barbecue right now if he wants," then another adds "Punish the Unabomber." At 31s, a voice says "I'm easily punishable," followed by "There's many weaknesses in my game that people can exploit." At 37s, a voice says "You can ask Durr about them," then adds "He'll give you the list." At 40s, a voice says "He's very good here," then another says "We might have to pay him something." At 42s, a voice asks "What's it?" then follows with "30 what?" as the camera zooms in for a close-up low-angle shot of a player's hand placing a chip into the pot, with on-screen graphics showing Ivey's name, his cards Ace of Diamonds and 6 of Diamonds, and the pot size increasing from $6, 300 to $14, 100. Throughout the sequence, the camera alternates between close-ups of players' hands, chips, and cards, medium shots of the table, and wider views of the room, emphasizing the strategic tension and high stakes of the game. The atmosphere remains intense yet conversational, with players discussing strategies, chip counts, and potential outcomes-creating a dynamic blend of competitive focus and casual banter.

📄 Show Caption

Response:
At 0s, the video opens with a close-up of two men in a dimly lit taxi at night-one with curly dark hair in a black coat, the other with short light brown hair in a dark jacket over a checkered shirt-both looking forward with serious expressions; the taxi's interior is dark, with blurred city lights visible through the windows. At 1s, the curly-haired man speaks in a calm, analytical tone: 'When I met you for the first time yesterday, I said Afghanistan or Iraq.' At 4s, he continues: 'You looked surprised.' At 5s, the short-haired man asks: 'How did you know?' At 6s, the curly-haired man replies: 'I didn't know, I saw.' The scene shifts to a lab with shelves of glass beakers and equipment, where the short-haired man stands with a rigid posture, looking down with a concerned expression. At 8s, the curly-haired man says: 'Your haircut, the way you hold yourself, says military.' At 10s, he adds: 'But your conversation as you entered the room has been a bit different from my day.' The scene cuts back to the taxi, where the curly-haired man looks out the window with a determined expression. At 13s, he says: 'Said trained at Bart, so army doctor, obvious.' The scene returns to the lab, where the short-haired man looks down with a somber expression. At 17s, the curly-haired man says: 'Your face is tanned.' At 19s, he continues: 'But no tan above the wrists.' At 20s, he adds: 'You've been abroad, but not sunbathing.' A close-up shows the short-haired man's arm, with the curly-haired man's hand holding a phone near it. At 22s, the curly-haired man says: 'Your lip's really bad when you walk.' At 24s, he continues: 'Don't ask for a chair when you stand like you've forgotten about it.' At 27s, he adds: 'So he's partly psychosomatic.' At 29s, he says: 'That says the original circumstances of the injury were traumatic.' At 32s, he continues: 'Wounded in action, then.' At 33s, he adds: 'Wounded in action, sun tan, Afghanistan or Iraq.' The scene shifts back to the taxi, where the curly-haired man looks at the short-haired man with a knowing expression. At 37s, the short-haired man says: 'You said I had a therapist.' At 39s, the curly-haired man replies: 'Of course you've got a therapist.' At 41s, he adds: 'Then there's your brother.' At 43s, he continues: 'Your phone is expensive, email enabled, MP3 player.' At 45s, he says: 'You're looking for a flasher.' At 46s, he adds: 'You wouldn't waste money on this.' At 47s, he continues: 'It's a gift, then.' At 48s, he says: 'Scratches, not one.' At 49s, he adds: 'Many over time.' At 50s, he continues: 'It's been in the same pocket as keys and coins.' At 52s, he says: 'That's something nice people don't treat as one luxury item like this.' At 54s, he adds: 'So it's had a previous owner.' At 56s, he says: 'Next bit's easy, you know it already.' The camera zooms in on the short-haired man's phone, showing an engraving on its back. At 57s, the short-haired man asks: 'The engraving?' The final shot is a close-up of the phone's engraving, which reads 'Harry Watson from Clara.' Throughout the video, a low, tense, and atmospheric electronic score plays in the background, creating a sense of suspense and focus. The camera alternates between close-ups of the men's faces, the short-haired man's arm, and the phone, emphasizing the curly-haired man's meticulous observations and the short-haired man's growing realization. The taxi's dim interior and lab's cluttered shelves create a focused, investigative atmosphere, while the short-haired man's rigid posture and somber expressions highlight his growing realization of his own condition.

📄 Show Caption

Response:
At 0s, the video opens on a high-angle wide shot of an intense NBA2K ONLINE 2 matchup between the Memphis Grizzlies (MEM, white jerseys) and the Orlando Magic (ORL, blue jerseys) on a polished wooden court labeled “BERLIN GAME 2026” presented by Tissot; the arena is packed with spectators behind barrier boards featuring Revolut logos, and on-screen graphics show the NBA2K ONLINE 2 logo at top left, a bottom scoreboard reading MEM 107–107 ORL with 2:38 remaining in the 4th quarter, a “4K ULTRA HD” watermark at the right, and Chinese text with a bilibili logo at bottom right. A white-jersey ballhandler works near the three-point line under pressure and the play quickly turns physical at the rim; at 5s, the commentator bursts out “Jackson!” and then immediately “No whistle! Turnover!” as possession flips and both teams sprint into transition. At 9s, another call—“Take that with you!”—rides over the roar of the crowd and the squeal of sneakers as the fast break develops, with the camera tracking the bouncing ball and cutting in on determined faces and rapid footwork. Around 15s, a blue-jersey player attacks the basket and collides mid-air with multiple white-jersey defenders, including Jackson (#45) and Williams Jr., sending bodies tumbling to the floor as the commentator reacts, “Here we go the other way,” followed by a joking “Gotcha airball,” and then, “I’m sorry, Grady Jackson,” briefly lightening the tension. Referees in black-and-white striped shirts hover nearby as teammates gasp and look on. At 21s, the commentator exclaims, “Anthony Black got me out of my seat! Did you see that?” and at 24s adds, “Of the year candidate,” setting up the highlight moment. At 25s, a new sequence shows a blue-jersey ballhandler driving forcefully through several white-jersey defenders into the paint; at 26s, the commentator shouts, “That is the dunk of the year! I’m calling that right now!” as the player explodes upward and hammers the orange ball through the rim. The broadcast then emphasizes that same dunk via multiple replay angles—cutting between a wider court view, a lower/side perspective that accentuates the takeoff and contact, and a closer look near the basket—before the final frame holds on the dunker landing and looking upward with a triumphant expression while the crowd remains fully engaged in the electric atmosphere.

📄 Show Caption

Response:
At 0s, the video opens with a breathtaking aerial view of a lush mountain valley where mist clings to rolling green fields dotted with quaint red-roofed houses; towering rocky cliffs rise on either side, one crowned by a cascading waterfall that glistens under soft diffused light, while snow-capped peaks peek through distant clouds. The camera slowly pans across the landscape, revealing more of the village nestled among the greenery, then zooms out slightly to show the vastness of the valley and surrounding mountains. A gentle serene instrumental piece-likely played on a piano or similar keyboard instrument, slow-paced and contemplative-fills the audio, creating a calm meditative atmosphere perfect for relaxation. At 10s, white elegant serif text appears over the valley: 'THE ALPS' at the top, below it Chinese characters translating to 'Alps', and at the bottom '4K SCENIC RELAXATION FILM'. The camera continues its smooth pan before transitioning. At 18s, the frame shifts to a sunlit meadow where golden grass stretches toward scattered wooden huts and small rustic cabins, with majestic snow-dusted mountains rising in the distance; warm sunlight bathes the landscape, casting long shadows and highlighting vibrant greens and earthy browns, evoking a peaceful pastoral mood. Text reading 'SCENIC RELAXATION' appears in the bottom left corner, and 'ALPE DI SUISI' is displayed in the bottom right. At 26s, the scene changes to a winding road curving through a verdant valley leading to a charming white church with a tall pointed steeple perched atop a grassy hill; behind it looms a grand snow-capped peak under a bright blue sky with fluffy clouds, and a person walks along the road adding scale to the picturesque setting. Text 'SCENIC RELAXATION' remains in the bottom left, and 'BERCHTESGADEN' is shown in the bottom right. At 34s, an aerial shot reveals Neuschwanstein Castle-a fairytale-like structure with multiple towers and turrets-nestled among autumn-hued trees in shades of orange and yellow; the castle overlooks expansive green fields and a lake, with soft evening light enhancing its fairy-tale allure. Text 'SCENIC RELAXATION' stays in the bottom left, and 'NEUSCHWANSTEIN CASTLE' is visible in the bottom right. At 42s, the frame transitions to Lake Bled's sunset: a small island with a white church featuring two towers sits in the middle of the calm turquoise water, which reflects the golden glow of the setting sun; distant mountains fade into a hazy horizon, and the sun creates a beautiful lens flare effect. Text 'SCENIC RELAXATION' is in the bottom left, and 'LAKE BLED' is in the bottom right. At 50s, the scene shifts to the Church of St. Tomasj-a small white chapel with a red-topped tower-sitting alone on a grassy hill surrounded by bare trees hinting at late autumn or early winter; rolling hills and valleys stretch into the distance under a clear sky, and the camera provides a wide scenic view emphasizing solitude and tranquility. Text 'SCENIC RELAXATION' remains in the bottom left, and 'CHURCH OF SV. TOMAZ' is in the bottom right. At 57s, the final shot shows Grindelwald's valley: a small town nestled between steep rocky mountains-one with a prominent snow patch-under a partly cloudy sky, with lush green forests surrounding the settlement. Text 'SCENIC RELAXATION' is in the bottom left, and 'GRINDELWALD' is in the bottom right. Throughout the video, there is no narration or dialogue-only the soothing instrumental music-allowing viewers to immerse themselves in the tranquil beauty of each location. The camera alternates between sweeping aerial views showcasing landscapes like the valley's mist-shrouded cliffs, Alpe di Suisi's golden meadows, Berchtesgaden's church, Neuschwanstein's autumnal turrets, Lake Bled's reflective waters, St. Tomasj's isolated hilltop, and Grindelwald's forested valley and intimate ground-level shots highlighting details such as the church's steeple, castle's towers, and chapel's red roof, balancing grandeur with intimacy. Lighting varies from soft diffused light in the opening valley to warm golden hues at sunrise/sunset, enhancing moods ranging from mystical to serene to fairy-tale-like. Each transition flows smoothly, weaving together diverse Alpine landscapes to create a cohesive relaxing journey through stunning natural beauty.

📄 Show Caption

Response:
At 0s, the video opens with a woman walking confidently down a vibrant nighttime urban street-likely a Japanese district like Shinjuku-with wet pavement reflecting neon signs in red, yellow, blue, and white; tall buildings line both sides, their facades adorned with colorful advertisements, and pedestrians bustle in the background, creating a lively yet moody atmosphere. She has long dark hair, wears large black sunglasses, hoop earrings, bold red lipstick, and a stylish ensemble: a black leather jacket over a form-fitting red dress, paired with black boots and holding a small black handbag in her right hand. The camera follows her from behind initially, capturing her steady stride against the backdrop of glowing storefronts and moving crowds. At 32s, she turns slightly left, revealing more of her profile while maintaining an upright posture that exudes self-assurance. At 37s, the camera shifts to a close-up of her face, highlighting her calm expression, the details of her sunglasses, and her hoop earrings; the reflection of the city lights is visible on the lenses of her sunglasses, adding depth to the shot. At 45s, the camera pulls back to show her upper body again, emphasizing the contrast between her dark attire and the bright, reflective surroundings. At 51s, the camera focuses on her hands resting lightly at her sides, underscoring her poised demeanor amid the bustling environment. Throughout the sequence, no one speaks, allowing the visual elements-the wet pavement's shimmering reflections, the neon-lit architecture, the woman's deliberate movements-to carry the narrative. The overall tone remains confident and stylish, blending the energy of the urban night with the woman's composed presence.

Attribute Taxonomy

We define a comprehensive attribute taxonomy for video understanding, which clearly presents 8 core categories (Scene, Object, Character, Action, Emotion, Camera, Narrative Element, Speech), each with multiple fine-grained sub-attributes.

ASID-Verify Pipeline

ASID-Verify is a three-stage pipeline designed to produce reliable, fine-grained audiovisual instructions. Instead of generating a single caption in one pass, it progressively integrates, verifies, and refines annotations to ensure semantic and temporal consistency.

S1 Multi-Source Annotation: Generate complementary captions using audiovisual models, ASR transcripts, and multimodal annotators to capture diverse visual and speech information.
S2 Ensembling & Verification: Integrate multi-source captions into a unified draft and verify them against audiovisual evidence, enforcing ASR alignment and timestamp consistency.
S3 Attribute-Level Evaluation & Refinement: Decompose captions into semantic attributes and perform per-attribute error and missing checks, followed by targeted refinement to suppress hallucinations and improve completeness.

Overview of ASID-Verify: multi-stage verification and refinement for attribute-structured captions.

ASID-Captioner

ASID-Captioner is trained via three-stage progressive SFT: attribute-wise → all-attribute (short clips) → all-attribute (long clips, up to 3 min). At inference, it supports controllable captioning conditioned on a user-selected subset of attributes.

🎯 Controllable Captioning Prompts

"Describe the scene in the video in detail. Write your answer as one coherent paragraph."
"Describe the scene, characters, actions in the video in detail. Write your answer as one coherent paragraph."
"Provide a comprehensive description of all the content in the video, leaving out no details, and naturally covering the scene, characters, objects, actions, narrative elements, speech, camera, and emotions in a single coherent account."
"Please describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions."

Performance

1 / 3

1 / 3

Citation

@article{li2026asid,
    title={Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions},
    author={Li, Yunheng and Zhang, Hengrui and Guo, Meng-Hao and Gao, Wenzhao and Jia, Shaoyong and Jiao, Shaohui and Hou, Qibin and Cheng, Ming-Ming},
    journal={arXiv preprint arXiv:2602.13013},
    year={2026}
}