How to Make AI Images Believable (With Reference Images)

A single prompt to GPT Image 2 Medium can only carry so much. Whatever you don't spell out, the model fills in with its default — the clean, generic, obvious version — and that generic default is the slop. A longer prompt doesn't fix it; there's no room to art-direct every wall, sign, machine, and face inside one sentence. The answer is pre-work: build the hard, specific parts as their own reference images first, where each one gets your full attention, then composite them into the scene. A dedicated reference holds more detail than a single prompt can, keeps repeated things consistent, and grounds the image in something real.
And, as in the companion post on prompting, these stills are inputs for video, not the end goal: an authentic video starts with an authentic input. Image-to-video and reference-driven models build from a start frame or reference image, so the more believable that input, the more believable the footage. References are how you build inputs detailed and consistent enough to be believable.
Every example here is a real GPT Image 2 Medium generation built from references. Four kinds of references, one workflow.
Why one-shots fail at specifics
The pattern is the same every time: left undirected, the model takes the obvious, low-effort path on each detail. A row of machines becomes a row of mismatched machines — it won't hold one design across repeats. A complex, unusual object becomes an "impossible object," with parts in front melting into parts behind. And signage defaults to a too-clean, brand-new board, sometimes with on-the-nose text like a bakery sign bragging "FRESH TODAY" (everything at a bakery is fresh today). None of this is the model malfunctioning — it's the model doing the average, expected thing. To get something specific, you have to supply it yourself.
Reference people: cast a non-model "actor"
To get a real-looking and consistent person — not a glossy AI "model," and not a different face in every render — generate the person once as a plain reference portrait, deliberately ordinary (real skin texture, an everyday build, no glamour), then composite that same face into every scene. Our yoga-studio owner is a synthetic "actor" cast this way; the barber and bakery owner are two more. A synthetic actor is also fully yours — no real person's likeness, no rights problem (more below).
Reference repeatable objects: make them match
AI renders specifics inconsistently — ask for a wall of machines and you get a wall of different machines. Real places don't look like that; a laundromat buys matching units so it can maintain them. So generate one accurate object as a reference, then tell the model to repeat that exact one. Our modern laundromat is rows of a single washer reference, repeated — which is what makes it read as a real, well-run shop. And when an object is too obscure to render at all (the pilates reformer that came out an "impossible object"), either reference a real one or swap it for something simpler.
Reference signage: art-direct the words
Drop a menu into a one-shot scene and it comes out too clean and generic — a crisp, brand-new board that's obviously rendered. Build the sign as its own image first, where you can give it your full attention: a gritty, hand-chalked menu board — grease, smudges, a couple of stickers, an erased line or two, real items and prices — art-directed until it looks genuinely used, then composite it in. (Think of it as giving that one element its own focused generation, instead of hoping the model nails it inside a busy frame.) Our late-night taco cart and bakery both do this. A bonus lesson the cart taught us: put the sign where it belongs — the menu goes on the side, not floating over the serving window where a customer has to stand.
Reference real locations: bring your own photo
An invented room or landscape is a giveaway, so ground the scene in a real photo instead. Our rafting sign-up booth is composited onto a genuine public-domain river photograph, so the water and banks are actually real; our barber is a real, licensed barbershop interior with a synthetic barber composited in. This is especially useful for a real business: your best inputs are your own photos — your actual space, your actual product — which can become polished marketing without hiring a photographer.
The one rule for real people
Real spaces and objects are easy: an empty room or a product has no likeness, so a permissive, commercial-use photo is fine to build from. Real people are different — using an identifiable person's face in commercial AI content is a rights question, governed by policies like our Biometric Data Policy. The clean path is to cast a synthetic actor (above) or use photos of people who've consented. Don't pull a stranger's face off the internet.
How to composite
Pass your reference(s) to GPT Image 2 Medium's image-to-image mode with a prompt that (a) names what to keep — "the woman / the machine / the menu / the river from the reference" — and (b) describes the scene around it. For repeatable objects, say "repeat this exact one." Then bring the prompting craft from Post 1 — humble camera, honest light, real imperfections — so the composite still reads as a photograph, not a clean paste-up.
The recipe
Build the hard parts first: a non-model actor for people, one accurate object for anything repeated, an art-directed board for signage, a real photo for the location. Composite, keep the references, and shoot it with real camera craft.
Then feed that believable still into a video model like Seedance as a start frame or reference, and the realism carries into the footage; our Beats walkthrough covers directing the video itself. See all six in the GPT Image 2 Medium gallery, then build your own on Hedra.
