Using Beats to Orchestrate Multi-Shot Films with Seedance 2.5 and 2.0

A year ago, the honest answer to "can I make a finished, multi-shot commercial from a single prompt?" was no. You got one beautiful shot, a couple of seconds of motion, and then you went back to an editor to stitch. Seedance 2.0 changed that — text, image, video, and audio in; a directed sequence with native sound out — and its successor, Seedance 2.5, pushes the same idea further. But "one prompt, one film" only works if you prompt like a director instead of typing a wish into a box.

We made a batch of example films on Hedra to work out exactly how to do that — and the useful part is that everything it takes lives in one workspace: the image model that builds the references, Seedance itself, and the real-person likeness support, all under one roof. Two of the films are worth walking through in detail: Ember Coffee, a fifteen-second specialty-coffee brand film, and a Koh Samui travel diary narrated by a real person with a plush sidekick. Both are single Seedance generations. Neither was a lucky roll. The technique underneath them is something we started calling Beats — and by the end you'll see how Hedra's agent can run the whole thing for you.

A Beat is a single, timed shot

The core move is to stop writing a paragraph and start writing a shot list — with timecodes. A Beat is one directed shot, written explicitly into the prompt with its in and out points. Here is the skeleton of Ember Coffee:

Beat 1 (0–2.5s): a slow push toward the glowing "EMBER COFFEE" storefront at blue-hour dusk.
Beat 2 (2.5–5s): a macro of fresh grounds streaming into the portafilter.
Beat 3 (5–7.5s): the steam wand whipping milk into a microfoam whirlpool.
Beat 4 (7.5–9.5s): a slow espresso pull, golden crema building.
Beat 5 (9.5–12s): an overhead pour, a rosetta leaf blooming across the crema.
Beat 6 (12–15s): the barista serving a customer, then the finished latte beside the sign.

Seedance reads those timecodes as a cut list. It renders six distinct shots and holds continuity — the same café, the same grade, the same light — across all of them, inside one generation. Seedance 2.0 gives you up to fifteen seconds to work with; Seedance 2.5 extends the same approach.

One small but real lesson: tell it one clean shot per beat, no repeats. An early Koh Samui draft asked for "quick cuts of the feast" and the model dutifully flashed the same plate twice. Changing it to "show the feast exactly once" fixed it. Be explicit about what you do not want repeated.

Nail the look in the references before you ever touch the video model

This is the highest-leverage habit we picked up: generate your reference stills first, and bake the entire visual language into them. We built ours with GPT Image 2, right alongside Seedance on Hedra — references and final render in one place, with no API-juggling between separate tools. The film stock, the color grade, the lens, the lighting, the composition — all of it lives in the still before a single frame of video exists. Ember Coffee is Cinestill 800T (amber highlight halation, teal shadows). The Koh Samui diary is 16mm Kodak Ektachrome (saturated, sun-drenched, gentle light leaks). The electric-SUV spot is anamorphic teal-and-orange. The real-estate tour is Kodak Portra 400. When the model starts from a frame that already looks exactly right, you've eliminated most of the ways a generation can drift.

Put effects on your reference images, before sending them to video models

Don't one-shot the stills, and don't generate each one independently — that's how you get five frames that don't look like the same world. Establish a hero, then go image-to-image from it. For the SUV, one text-to-image hero locked the matte-silver car and the grade; every other car frame was generated from it — "the exact same SUV from the reference image" — so the vehicle stays identical shot to shot. Same trick for Peachy, the plush in the Koh Samui film: we cleaned the toy into a studio cutout, then dressed it in boho beach gear while insisting it stay "unmistakably the same peach plush," so it reads as the same character at the beach and at the restaurant table.

From Peachy to "Beachy": Incremental edits using GPT Image 2

Hook each reference to the beat it belongs to

References aren't a mood board you toss at the model — they're an ordered set, and order is a signal. Pass your references in the sequence your beats unfold, and write each beat in the language of its still. Ember Coffee's six reference frames — storefront, grinder, steam wand, espresso pull, latte pour, serve — map one-to-one to its six beats, in that order. The prompt's Beat 2 describes "a dense, continuous stream of fresh coffee grounds into a stainless portafilter" because that is precisely the coffee grinder still. When you need to remove all doubt about which frame drives which moment, you can name a reference directly in the prompt (Seedance honors image handles like @Image1). The aim is simple: never leave the model guessing which look belongs to which beat.

Hedra allows real people — and why most platforms won't let you

The Koh Samui diary is narrated by a real, identifiable 30-year-old woman, on camera, talking to the lens. That sounds unremarkable until you try it elsewhere: most video platforms refuse a reference image that contains a real person, blocking it at the door. Hedra has special API access that lets you use a real person's likeness in your references — which is exactly what makes a genuine on-camera narrator possible instead of a generic synthetic face that nobody recognizes.

That access comes with responsibility, and it should. Using someone's likeness is governed by our Biometric Data Policy, and it's intended for likenesses you actually have the right to use.

Camera lives in two places — use both

A reference still can carry the look, but it cannot carry motion. So you specify the camera twice, in two different places, and each does a different job. The still owns the static frame: lens, grade, lighting, composition. The prompt owns the move. In our films that's "a slow dolly push toward the cedar-and-glass exterior" (real estate), "a 180-degree orbit around the levitating shoe" (sneaker ad), "a whip-pan into a sweeping bend" (SUV), and "smooth macro push-ins" (Ember Coffee).

The point is to do both, deliberately. Put the entire aesthetic into the still so the model isn't guessing at the look, then describe the motion in the prompt so it isn't guessing at the choreography. It's also worth saying what you don't want — "smooth gimbal, no shake" earns its place in almost every brief. Hone the visual style hard in the reference images first; that's what gives you the best odds of a great output once it reaches the video model.

The mistake that taught us reference discipline

One we got wrong, because it's the most useful lesson here. In the electric-SUV spot, one of our reference frames placed the car on the wrong side of the road — on the oncoming side of the double-yellow line. We noticed too late, with the still already in the reference set, and tried to fix it in words: the prompt carries an explicit, capitalized instruction to keep the car "to the RIGHT of the double-yellow centre line" and to "NEVER cross" it. Seedance ignored the sentence and followed the picture.

Seedance won't fix a reference image that's wrong

The lesson is blunt and worth internalizing: a reference image is a stronger signal than a corrective instruction. The model anchors to what it sees, not to what you tell it to fix. Vet every still for the specific details you care about — which side of the road, which hand, which logo, which way the latte art swirls — before it enters the reference set. A guardrail in the prompt will not reliably override a bad frame. We should have caught it in the still, and now we check.

Audio is the part most people skip — don't

If you say nothing about audio, Seedance doesn't stay quiet. It invents, and it defaults to Mandarin dialogue over mellow, vaguely Chinese-sounding music. That single fact is why so many otherwise-great Seedance clips sound off. Always specify sound, and specify it precisely. What worked for us:

Name the language and the accent. "English" on its own tends to drift toward British RP. We asked for a warm Irish narrator on Ember Coffee, a sun-warmed Australian voice on Koh Samui, and a calm, low-register Scandinavian-accented male voice on the SUV — and we negative-prompted the defaults outright ("NOT American, NOT British RP, NOT Chinese"). Distinct accents also simply read less like stock.
Layer the mix. One narrator, a music bed, and diegetic sound effects matched to each beat. Ember Coffee has the burr grinder's whir on Beat 2, the steam wand's hiss on Beat 3, the espresso trickle on Beat 4, a soft ceramic clink on the serve — all under an indistinct café murmur. It's ASMR for coffee, and it's specified shot by shot.
Decide who speaks and who doesn't. Peachy the plush is explicitly silent — no voice, no lip-sync, a fixed embroidered smile that never opens to talk. On Ember Coffee the barista and customer don't speak; only the narrator does. If you don't say it, the model will hand voices to characters you wanted quiet.
Place the music in time. On Koh Samui we wanted wanderlust energy from the very first frame, so the brief says the track "begins immediately at the very first frame" rather than easing in under the opening shot.

Sync the narration to the picture

The reason the words land on the right shots is that we encode the voiceover twice. Once inside each beat ("narrator says: …"), and again as a full, ordered script in the audio block — "speaking these lines in order, timed to the beats above." That redundancy is what locks line five to shot five instead of letting the narration float free of the cuts.

Two more sync tricks earned their keep:

Spell hard words phonetically — and accept that pronunciation is a dice roll. Getting the narrator to say "Koh Samui" was a genuine saga, and a humbling one. Spelling it normally gave us "Ko Samoo." Re-spelling it phonetically as "Ko Sah-Moo-Ee" — and explaining the phonetics (four syllables, stress on "Moo," ending on a clear "ee") — moved it in the right direction, but it never fully landed: even the final cut you can watch above says a five-syllable "Ko sam-oo-EE-eh." That's the real lesson. Pronunciation is non-deterministic — the same prompt will say the word differently on different runs, and no amount of prompting reliably tames it. Phonetic respelling is a genuine lever that shifts the odds, but it is not a guarantee. The more dependable fix is to route around the problem entirely: we kept getting "turquoise" mangled too, so we simply dropped it and wrote "clear blue water" instead. When a word matters and the model keeps fumbling it, respell it phonetically and re-roll; when it doesn't, just pick a word the model can already say.
Give a tagline room to finish. On the SUV spot the single spoken line had to land in the middle, not the end. The brief has it begin around 3.5 seconds and finish by 6.0, leaving the final two seconds music-only so the slogan is never clipped by the cut to black. Put a line at the very end and you should expect it to get cut off.

Hedra's agent can run the whole method for you

Everything above is a workflow: generate the references, lock the look, order them to your beats, write the prompt, design the audio, vet the stills, generate, review. You can run it by hand — but you don't have to. It's exactly the kind of multi-step creative work Hedra's agent was built to orchestrate. Hand it a brief and it can move through the pipeline for you — drafting reference stills, sequencing the beats, and generating the Seedance render with synced audio — then loop you in on the calls that need a human eye. The models, the real-person likeness support, and the agent that ties them together all live in one place, which is the entire point: directing a multi-shot film shouldn't mean gluing five different tools together.

The whole method, in one breath

None of this is exotic. Generate your references first and put the entire look into them. Order them to your beats and describe each beat in the language of its frame. Specify the camera move in the prompt and the camera look in the still. Say exactly who speaks, in what accent, over what music — and write the narration into the beats and again as a timed script. Then check your references one more time before you press go.

That's the recipe, and it's the same recipe on Seedance 2.5. Watch Ember Coffee and the Koh Samui diary with this post open and you can see every one of these decisions playing out on screen. Then create your own on Hedra — the models, the references, and the agent to direct them are all in one place, waiting.

Company