JointI2V: A Multimodal Framework for Fine-Grained Joint Control of Camera and Scene Subjects

Abstract

In recent years, video generation has seen remarkable progress in both content quality and generation efficiency. However, achieving joint control of scene subjects and camera motion, as well as enabling fine-grained camera trajectory generation with large-range and arbitrary directions, remains a significant challenge. This is especially true in complex scenarios where the camera is required to execute precise motions around dynamically moving objects, imposing higher demands on generative models. To address these issues, we propose JointI2V, an end-to-end multimodal video generation framework that enables simultaneous control of both scene subjects and camera trajectories. Our method supports various forms of control inputs, including textual prompts, trajectory data, and exemplar videos-and unifies them into fine-grained frame-level manipulation signals. We introduce a novel Camera-Aware Attention module that deeply integrates camera motion cues into the generative process. To equip the model with robust joint control capabilities, we design a multi-stage curriculum learning strategy that progressively disentangles and integrates subject and camera controls. Furthermore, we construct MMJointCtrl, a large-scale, high-quality multimodal control dataset comprising over 100K control instances, organized to support progressive learning from simple to complex tasks. Experimental results demonstrate that JointI2V achieves state-of-the-art performance in fine-grained, complex multimodal joint control of camera and subject, offering a powerful solution for multimodal video generation. We will soon release our source code.

Examples

Object:A cozy home interior with a beige carpet, light walls, and a wooden console table with a green plant, books, and a decorative bowl. a wicker chair and a desk with a computer, papers, and a lamp are visible, along with a shelving unit with books and decorative items. the room is well-lit by natural light from windows with sheer curtains. a black leather chair, a wicker armchair, and a desk with a computer, printer, and speakers are also present, alongside a small plant and framed pictures. The space is tidy, with a desk cluttered with paper and a printer, and a guitar on a stand adds a personal touch. ; Camera: The camera moves according to the #trajectory.
Prompt Trajectory JointI2V

Object: A cozy home interior with a beige carpet, light walls, and a wooden console table with a green plant, books, and a decorative bowl. a wicker chair and a desk with a computer, papers, and a lamp are visible, along with a shelving unit with books and decorative items. the room is well-lit by natural light from windows with sheer curtains. a black leather chair, a wicker armchair, and a desk with a computer, printer, and speakers are also present, alongside a small plant and framed pictures. The space is tidy, with a desk cluttered with paper and a printer, and a guitar on a stand adds a personal touch. ; Camera: The camera moves according to the #trajectory.

Object: The area is furnished with a round glass-top table, wicker chairs with red cushions, and a ceiling fan with a light fixture. decorative elements include framed artwork, a potted plant, and a vase with flowers. the space opens to a living area with a sofa and armchairs, suggesting an open-plan design. The ambiance is warm and inviting, with natural light enhancing the neutral color palette and decorative items. ; Camera: The camera pans continuously to the left.

Object: A two-lane road stretches into the distance under a cloudy sky, bordered by trees on both sides. The scene is dimly lit, suggesting early morning or late evening, with a sense of quiet and solitude. The object in the image is stationary; Camera: From 1 to 3 seconds, the camera moves slowly forward.

Object: A well-maintained lawn with vibrant green grass and scattered autumn leaves leads to a charming white gazebo with a dark roof and railing, nestled in a dense forest. sunlight filters through the trees, casting dynamic shadows on the grass. the gazebo, featuring a cupola and railing, stands on a wooden platform, surrounded by mature trees with rich foliage, suggesting a secluded, tranquil setting. The scene is consistently bathed in sunlight, highlighting the lush greenery and the gazebo's architectural details, with the forest floor showing signs of seasonal change. ; Camera: The camera moves according to the #trajectory.

Object: A modern kitchen and dining area with wooden cabinetry, granite countertops, and stainless steel appliances. a breakfast bar with two grey stools is visible, and the dining space includes a wooden table set for four with a green and white patterned rug. the area is well-lit by natural light from large windows with white blinds. Decorative elements like a bowl of red apples, a vase with greenery, and a framed abstract painting add to the ambiance. The kitchen's open design extends into the dining area, creating a spacious and inviting atmosphere. ; Camera: The camera moves according to the #trajectory.

Object: A view of a wooden staircase leading to an upper floor, with a small window and a framed artwork on the wall. the scene transitions to a serene interior with a wooden handrail, a window, and a framed artwork, with a bedroom visible in the background. next, the view opens to a spacious bedroom with a slanted ceiling, a large bed, and a wooden dresser. The bedroom features a modern design with a neutral color palette, recessed lighting, and a flat-screen TV. The video concludes with a similar bedroom scene, emphasizing the room's tranquil ambiance. ; Camera: The camera moves according to the #trajectory.

Command control capability

Prompt JointI2V

Object: A boat sits on the shore of a lake with Mt Fuji in the background. Camera: The camera moves according to the #trajectory.

Object: A farm in the middle of the day. Camera: The camera moves according to the #trajectory.

Object: a large waterfall in the middle of a lush green hillside. Camera: The camera moves according to the #trajectory.

Object: a man and a boy sitting on a beach near the ocean. Camera: The camera moves according to the #trajectory.

Object: a person riding a skateboard on a concrete floor. Camera: The camera moves according to the #trajectory.

Object: a beach with a lot of buildings on the side of a cliff. Camera: The camera moves according to the #trajectory.

Object: a group of people standing on top of a green hill. Camera: The camera moves according to the #trajectory.

Object: a hot-air balloon flying over a desert landscape. Camera: The camera moves according to the #trajectory.

Object: a small bird sits on a moss covered branch. Camera: The camera moves according to the #trajectory.

Object: a snowboarder is in the air doing a trick. Camera: The camera moves according to the #trajectory.

A car speeds by as the camera smoothly moves sideways. Camera: The camera moves according to the #trajectory.

A group of children are playing as the camera quickly pulls back to show the panorama. Camera: The camera moves according to the #trajectory.

A person is dancing while the camera moves around them. Camera: The camera moves according to the #trajectory.

We borrow the source code of this project page from DreamBooth.