Research & Progress
GPT Image 2 rewards structure over adjectives. Words like "stunning" or "beautiful" do very little. What works is organized, specific description in this order:
The model allocates more attention to earlier tokens in the prompt. Putting materials and format first tells the model "this is a mixed-media collage" before it starts imagining the scene, which fundamentally changes how it renders everything that follows. Burying the style at the end means the model may have already committed to a photorealistic rendering before it sees "watercolor."
GPT Image 2 follows long prompts more reliably than most other models. Our best results used 300-500 word prompts. If you're getting generic results, the answer is usually more specificity, not less. Be explicit about every element you care about.
Dry torn paper coral with visible fiber texture exists deep
underwater alongside photographic reef life.
Gold foil metallic stipple shimmers on surfaces submerged
in real ocean water.
Gouache painted characters swim through photographic
Hawaiian reef. Heavy watercolor paper base.
Full bleed 3:2 children's book illustration. Mixed media
collage blending with underwater photography.
SCENE: A dramatic steep Hawaiian reef drop-off descending
into the deep...
OLLIE: A cartoon baby octopus. [full character anchor with
colors, proportions, anti-patterns]...
DOT: Solid natural Tahitian pearl. [full character anchor
with eye shape, anti-patterns]...
ENVIRONMENT: Rich shallow coral reef dropping into deep...
RENDERING: Warm golden god-ray light transitioning to deep
navy. [hex palette]. Full bleed 3:2 aspect ratio.
GPT Image 2 achieves ~99% accuracy on text (vs ~60% for DALL-E 3). Place exact text in quotation marks or ALL CAPS, specify font style and placement explicitly. Best at 1-5 words per text element. Not relevant for our book illustrations (we avoid in-image text).
Shortly after GPT Image 2 launched on April 21, 2026, users reported a consistent visual artifact: generated images were layered with persistent tiling textures and grime artifacts:
Primary cause: Steganographic Watermarking. OpenAI embeds a dual-layer watermark into every generated image: (1) C2PA metadata and (2) an imperceptible pixel-level watermark with no public detector yet. The pixel watermark is baked into the generation process itself, creating visible texture patterns especially on smooth, uniform surfaces. Downstream workflows like upscaling or compositing can amplify these artifacts.
Secondary cause: Context-dependent noise accumulation. The model propagates noise from previous generations within the same conversation. Artifacts carry over and compound with each successive image in a chat thread. OpenAI fixed a noise amplification bug, but some baseline texture persists.
Tertiary cause: Training data contamination. GPT Image 2 has been observed generating images containing visible Gemini branding and other AI-model artifacts — evidence that AI-generated content in the training data introduces learned artifacts.
smooth matte finish, clean flat colorquality: "high" for finals — lower quality settings show more compression artifactsAnalysis of our 5-spread OpenAI batch revealed likely artifact amplifiers in our prompts:
| Parameter | Value |
|---|---|
| Supported sizes | 1024×1024 (1:1), 1536×1024 (3:2), 1024×1536 (2:3), auto |
| Our output | 1536×1024 (3:2 landscape) |
Key finding: GPT-Image-2 only supports 4 fixed sizes: 1024×1024 (1:1), 1536×1024 (3:2 landscape), 1024×1536 (2:3 portrait), and auto. For print-quality book illustrations, generate at 1536×1024 (3:2) and upscale externally (Real-ESRGAN, Topaz Gigapixel).
GPT Image 2 struggles with complex multi-species underwater compositions. When a prompt lists many species, the model often:
Spread 05 requested maximum density with 10+ named species and showed morphed fish and anatomically incorrect creatures. Spread 01 and 08, which had fewer species described more carefully, produced better individual creature accuracy. For future batches: prioritize quality of each creature over quantity of species.
GPT Image 2 processes all reference images at high fidelity automatically with no adjustable knob.
Change: [exactly what should change]
Preserve: [face, identity, pose, lighting, framing, background, geometry, text, layout]
Constraints: [no extra objects, no redesign, no logo drift, no watermark]
| Setting | Use Case | Cost (1024x1024) |
|---|---|---|
low | Fast drafts, thumbnails | ~$0.006 |
medium | General use | ~$0.053 |
high | Final assets, dense layouts | ~$0.211 |
auto | Let the model decide | varies |
| Format | Notes |
|---|---|
png | Default. Lossless. Best for illustration work. |
jpeg | Faster than PNG. Good when latency matters. |
webp | Smallest file size. Good for web delivery. |
| Feature | GPT Image 2 | DALL-E 3 |
|---|---|---|
| Text rendering accuracy | ~99% | ~60% |
| Texture/photorealism | More natural, contextually grounded | Good but less precise |
| Character consistency | Strong with proper anchoring | Inconsistent |
| Prompt following | Follows long, structured prompts well | Better with shorter prompts |
| Max resolution | Up to 2K (2048x2048) | 1024x1024 |
quality: "high" and png output format for all finalsquality: "low", rapid iteration on composition and posequality: "medium", nail down details and expressionsquality: "high", png format, full resolutionMidjourney's --sref (Style Reference) is a parameter that transfers the aesthetic of a reference image or a numeric style code to a new generation. It operates in two modes:
--sref [image URL] — Midjourney analyzes the image and attempts to transfer its style.--sref [code] — A numeric identifier mapping to a pre-defined internal style within Midjourney's latent space.Midjourney has not published their architecture, but based on community analysis:
| Feature | --sref (Style Reference) | --cref (Character Reference) |
|---|---|---|
| Purpose | Keep the "camera"/art direction the same | Keep the "actor"/character the same |
| Captures | Colors, textures, lighting, composition, rendering style | Facial features, hair, body proportions, clothing |
| Weight param | --sw (0-1000, default 100) | --cw (0-100, default 100) |
| Metaphor | Same cinematographer, different scene | Same actor, different scene |
| Capability | Nano Banana Pro (3 Pro) | Nano Banana 2 (3.1 Flash) |
|---|---|---|
| Character reference images | Up to 5 | Up to 4 |
| Object fidelity images | Up to 6 | Up to 10 |
| Total reference images | Up to 11 | Up to 14 |
| Supported formats | PNG, JPEG, WebP, HEIC, HEIF | PNG, JPEG, WebP, HEIC, HEIF |
Important: The 14-image limit is NOT freely allocatable. Character and object quotas are independent.
GPT-Image-2 supports up to 16 reference images (JPEG, PNG, WebP, under 30MB each) via the images.edit endpoint.
Image 1: Character reference sheet showing Ollie the fox
Image 2: Approved illustration showing target color palette
Image 3: Scene composition reference
Generate a new illustration of Ollie walking through a forest.
Apply the watercolor style and color palette from Image 2.
Maintain Ollie's exact appearance from Image 1.
Use similar composition depth as Image 3.
| Quality | Approx Cost per Image (1024x1024) |
|---|---|
| Low | ~$0.006 |
| Medium | ~$0.053 |
| High | ~$0.211 |
| Batch (50% discount) | Half of above |
Both Gemini and Claude can analyze an approved illustration and produce a structured style description, identifying:
Key takeaway: For our purposes, using a vision model to extract a structured text description is more practical than training custom models. It is free/cheap, fast to iterate, and the output (text) can be directly used as prompt input for both NB2 and GPT-Image-2.
[STYLE BLOCK - immutable]
Watercolor children's book illustration. Soft pencil outlines
with warm earthy palette.
[CHARACTER BLOCK - immutable per character]
Ollie: small orange fox with oversized round head, bright
curious eyes (#4169E1 blue), wearing a green scarf (#2E8B57).
[SCENE BLOCK - variable]
Ollie standing at the edge of a meadow, looking up at
fireflies. Golden hour lighting from the left.
Strategy: Vision-Extracted Style Profile + Multi-Reference Generation
| Factor | Rationale |
|---|---|
| No LoRA needed | Works natively with NB2 and GPT-Image-2 APIs |
| Low setup cost | Vision-based extraction is essentially free |
| Iterative | Profile can be refined as more illustrations are approved |
| Cross-model | Same style profile works for both NB2 and GPT-Image-2 |
| Automated QC | Vision models can score consistency without human bottleneck |
Select gold standards, run style extraction, merge into canonical profile, validate with text-only generations.
Test 0, 1, and 3 reference images on both NB2 and GPT-Image-2. Identify optimal consistency-to-cost ratio.
Test prompt ordering variations: style-first, scene-first, and interleaved. Measure style fidelity.
Test vision-model-based quality scoring. Validate against human judgment (>80% agreement target).
| Phase | Estimated Cost |
|---|---|
| Phase 1 | $0.50-$1.00 |
| Phase 2 | $2.00-$5.00 |
| Phase 3 | $2.00-$3.00 |
| Phase 4 | $0.00 |
| Total | $4.50-$9.00 |
Estimated generations: 50-70 images total across all phases.
A 10-category JSON schema covering medium, color palette, line work, lighting, texture, character rendering, background treatment, mood/atmosphere, negative constraints, and prompt templates. The schema supports cross-model adaptation (NB2 and GPT-Image-2) with model-specific formatting differences.
NB2 weights the start of the prompt heavily. Put your subject first, then style/environment details. Starting with style descriptors produces softer, less focused results.
Do: "A small brown dog wearing a red cape, standing on a hilltop at sunset, digital illustration style"
Don't: "Digital illustration style, sunset lighting, a small brown dog on a hilltop"
NB2 is a thinking model that understands intent, physics, and composition. Write prompts as if briefing a human illustrator. Avoid comma-separated tag lists.
Community benchmarks show JSON-structured prompts improve consistency by ~25% over plain text. Structure with labeled fields:
{
"subject": "A small brown dog named Ollie wearing a red cape",
"pose": "standing heroically, looking to the right",
"environment": "grassy hilltop at golden hour",
"style": "soft watercolor children's book illustration",
"lighting": "warm golden backlight with soft shadows",
"camera": "low angle, medium shot"
}
For different content in different parts of an image, divide the canvas into regions with a prompt for each. Useful for complex scenes with foreground characters and detailed backgrounds.
NB2 supports up to 10 reference images simultaneously. When using uploaded images, clearly define each role: "Use Image A for the character's pose," "Use Image B for the art style," "Use Image C for the background environment."
The most reliable consistency method:
NB2 can retrieve real-world reference images via Google Search during generation. This is unique to NB2 (not available in Pro) and helps ground outputs in real-world visual context.
Include terms like "HD," "4K," or "HDR" for better clarity. Start every prompt with a fixed seed for reproducibility. Combine with weighted descriptors like "(same face:1.3)" to enforce consistency.
| Feature | Nano Banana 2 | Nano Banana Pro |
|---|---|---|
| Architecture | Gemini 3.1 Flash | Gemini 3 Pro |
| Speed | 4-6 sec at 1K | 10-20 sec at 1K |
| Quality | ~95% of Pro | Best absolute quality |
| Textures/Lighting | Very good | Superior |
| Unique Features | Image Search Grounding, Thinking Mode | N/A |
| Cost | Significantly cheaper | Higher per image |
| Best For | Speed, batch generation, stylized work | Precision, identity preservation |
| Capability | NB2 | GPT-Image-2 |
|---|---|---|
| Photorealism | Superior | Strong |
| Text/Typography | Good | Superior (best in class) |
| Anime/Illustration | Superior | Good |
| Structural Control | Moderate | Superior |
| Speed | Faster | Slower |
| Batch Consistency | Good | Superior |
| Cinematic Lighting | Superior | Good |
| Cost (API) | $60/M output tokens | $30/M output tokens |
Choose NB2 for: stylized illustration, photorealism, cinematic scenes, rapid iteration, anime. Choose GPT-Image-2 for: text-heavy designs, precise layouts, batch production, structural accuracy.
Single-model approaches force compromise. Every model has blind spots. The 2026 best practice is to route each task to the model that handles it best.
For children's book illustration: GPT-Image-2 excels at structural accuracy, text rendering, and batch consistency. NB2 excels at stylized illustration, cinematic lighting, and rapid iteration. Combining them produces results neither can achieve alone.
| Capability | NB2 | GPT-Image-2 | Best Choice |
|---|---|---|---|
| Photorealism | Excellent | Very Good | NB2 |
| Stylized Illustration | Excellent | Good | NB2 |
| Text/Typography | Good | Best in class | GPT-Image-2 |
| Precise Layout | Moderate | Excellent | GPT-Image-2 |
| Cinematic Lighting | Superior | Good | NB2 |
| Batch Consistency | Good | Excellent | GPT-Image-2 |
| Character Identity | Good (improving) | Strong | GPT-Image-2 |
| Speed | 4-6 sec/image | Slower | NB2 |
| Cost | $60/M output | $30/M output | GPT-Image-2 |
| Conversational Editing | Excellent | Good | NB2 |
| Multi-Reference | Up to 10 | Limited | NB2 |
Flow: GPT-Image-2 (structure) → NB2 (style transfer)
Flow: NB2 (ideation) → GPT-Image-2 (final production)
Both models generate independently on the same prompt. Compare and select the best per scene. Best when you have prompt budget and want maximum quality.
Phase 1 — Character Design (GPT-Image-2): Generate definitive character sheets. Lock proportions, colors, features. GPT-Image-2's batch consistency ensures reliable reference sheets.
Phase 2 — Scene Exploration (NB2): Rapid scene iteration with character sheet references. Explore compositions and lighting at 3-5x speed. Multi-reference conditioning with character + environment + style references. Generate 4-6 variants per scene.
Phase 3 — Final Production (Model-Dependent): Text scenes → GPT-Image-2. Atmospheric/stylistic scenes → NB2. Precise character placement → GPT-Image-2. Wide environmental scenes → NB2.
Phase 4 — Consistency Pass: Review all finals. Use NB2 conversational editing for minor adjustments. Use GPT-Image-2 for structural corrections.
Leonardo AI's Pro Upscaler is purpose-built for AI-generated images, capable of scaling up to 105 megapixels. Key differentiator: while most upscalers are trained on real photographs, Leonardo's understands synthetic grain and AI textures, cleaning them rather than amplifying them.
Verdict: Strong candidate. 105MP ceiling far exceeds print needs (12×8 inch spread at 300 DPI = ~8.6MP).
Leonardo's Canvas mode includes Inpainting/Outpainting tools for extending images beyond original boundaries.
Converting 1536×1024 (3:2) to 2048×1024 (2:1) requires ~512px per side extension — moderate and within the tool's sweet spot. Works best when extending backgrounds/environments rather than adding character elements.
| Plan | Price | Tokens/Month | API Access |
|---|---|---|---|
| Free | $0 | 150/day | No |
| Apprentice | $12/mo | 8,500 | Limited |
| Artisan | $30/mo | 25,000 | Yes |
| Maestro | $48-60/mo | 60,000 | Full + priority |
Token costs: Standard generation 2-25 tokens, upscaling ~2x generation cost. Cost optimization: generate low-res, only upscale selections (saves 40-50%). Artisan plan handles ~500-1,000 upscale operations. $5 free API credit for testing.
| Tool | Best For | API | Pricing | AI-Image Aware |
|---|---|---|---|---|
| Leonardo Pro | AI-generated images | Yes | Token-based | Yes (trained on synthetic) |
| Real-ESRGAN | General-purpose, free | Self-hosted | Free | No |
| Topaz Photo AI | Photographers | No | $199 one-time | No |
| Magnific AI | Creative upscaling | Yes | Subscription | Partial (over-hallucination risk) |
Recommendation: Leonardo Pro is the strongest candidate: purpose-built for AI images, API available, token-based pricing, outpainting and upscaling in one platform. Runner-up: Real-ESRGAN as free fallback. Avoid: Magnific — tendency to hallucinate new details could compromise illustration consistency.
| Version | Release | Status |
|---|---|---|
| V6 | Late 2024 | Legacy |
| V7 | 2025 | Stable, was default |
| V8.0 Alpha | March 17, 2026 | Superseded |
| V8.1 Alpha | April 14, 2026 | Current latest |
Hands/bodies/objects dramatically better from V7 onward. V8 further improves physical coherence. V8.1 corrects V8.0's over-processed default look. Some legacy features still missing (image prompting, in-painting). Reduced ability for abstract/surreal compositions compared to V7.
The --sref parameter extracts aesthetic qualities from reference images: color palette, lighting, texture treatment, composition tendencies. Unlike regular image prompts, it focuses on style elements rather than content.
--sv--sv 6: Default (latest)New in V8: shape aesthetics visually. Enter a prompt, get aesthetic options, select preferences, receive a unique reusable style code. Web-only (midjourney.com), not Discord.
Curate collections of images to define persistent style. Upload images, name and save for reuse, apply across generations.
Recommended workflow:
No public API. Enterprise dashboard users only, must apply. Unofficial APIs (PiAPI, APIFRAME, ImagineAPI) violate ToS and risk account bans. Midjourney cannot be reliably integrated into an automated pipeline.
Best used as a manual tool for style exploration and reference generation, not batch production.
| Plan | Price | Images/Month |
|---|---|---|
| Basic | $10/mo | ~200 |
| Standard | $30/mo | ~900 |
| Pro | $60/mo | ~1,800 fast |
For our use case (style references, not production), Basic at $10/mo would suffice.
| Capability | Score |
|---|---|
| Image quality (V8.1) | 9/10 |
| Style consistency (sref/moodboards) | 8/10 |
| Artifact issues | 8/10 |
| API for pipeline | 2/10 |
| Cross-model style transfer | 7/10 |
| Cost efficiency | 8/10 |
| Overall pipeline fit | 6/10 (style reference tool only) |
A structured text description of an image's visual style, produced by sending approved reference images to a vision model (Gemini, Claude, GPT-4o) and asking it to analyze and catalog style attributes. The output is a reusable text block for generation prompts.
| Count | Pros | Cons |
|---|---|---|
| 1 image | Simple, fast | Overfits to one scene's lighting/composition |
| 2 images | Better than one | Still too few to separate scene from style |
| 3-5 images | Sweet spot: enough diversity to find common thread | Recommended |
| 6-10 images | Very thorough | Diminishing returns, may average too aggressively |
| 10+ images | Comprehensive | Descriptions become generic |
Key principle: Images should be diverse in SCENE but consistent in STYLE. Forces the vision model to identify style (what stays the same) vs. content (what changes).
A 10-category extraction covering: (1) Rendering Technique, (2) Color Palette with hex codes, (3) Line Work, (4) Lighting and Shadow with contrast ratios, (5) Texture and Surface, (6) Character Rendering proportions, (7) Background Treatment, (8) Composition and Framing, (9) What This Style Is NOT, (10) One-Sentence Test. Forces specificity, anchors to observable details, includes negatives.
The 2025-2026 standard: 70-78% of professional AI creators use JSON-formatted style guides. Structured prompts improve accuracy by 60-80% for complex scenes. Eliminates ambiguity, machine-readable, reproducible. GPT-Image-2 responds especially well to structured input.
Midjourney uses a dedicated internal style encoder (likely CLIP-based) to extract style features from a reference image and inject them as a separate conditioning signal. Style is kept separate from content.
| Feature | Midjourney --sref | NB2/GPT-Image-2 |
|---|---|---|
| Style encoding | Dedicated encoder, separate channel | General vision encoder, mixed with content |
| Style vs content separation | Explicit | Implicit (must be prompted) |
| Reproducibility | Deterministic codes | No equivalent |
| Weight control | --sw 0-1000 | No direct param |
| Technique | Can Use with NB2? | Can Use with GPT-Image-2? | Post-Processing? | Open Source? |
|---|---|---|---|---|
| StyleTokenizer (ECCV 2024) | No | No | Potentially | Yes |
| StyleBrush (2024) | No | No | Yes (promising) | Yes |
| InstantStyle (2024) | No | No | Yes (via ComfyUI) | Yes |
| IP-Adapter (2023-2025) | No | No | Yes (via ComfyUI) | Yes |
| Z-STAR (CVPR 2024) | No | No | Potentially | Yes |
Bottom line: None can be plugged directly into NB2 or GPT-Image-2 APIs. All require Stable Diffusion. However, they CAN be used as post-processing layers, and their design principles (separate style encoding, targeted injection, JSON prompts) inform better prompt writing.
Generate a base image with GPT-Image-2 (composition, character consistency) then run through a style transfer layer for the specific aesthetic.
Current recommendation: Don't add post-processing yet. Focus on making GPT-Image-2/NB2 produce the right style natively. Post-processing is the fallback. If needed later, start with Option B.
"Colorful watercolor children's book illustration with warm tones" describes half of all children's books. Fix: Use the detailed extraction prompt with hex codes, ratios, named techniques.
Single image = profile captures scene-specific attributes instead of style attributes. Fix: Use 3-5 images with different scenes but same style. Intersection = style.
Style buried in middle/end of prompt. Fix: Style FIRST, clear section markers, repeat critical elements at end.
Vision model defaults to "what is in the image." Fix: Explicitly say "Do NOT describe what is depicted. ONLY describe HOW it is depicted."
Measurable attributes correct but images still look "off." Fix: Always pair text with reference images. Use one-sentence descriptor for emergent quality. Use negative constraints. Iterate.
Model's style distribution bias. Fix: Try the other model, use fine-tuned SD, apply post-processing, or adjust target style.
STEP 1: STYLE EXTRACTION (One-time, ~30 min)
- User provides 3-5 gold standard images
- Run extraction prompt on each via Gemini/Claude
- Aggregate into canonical JSON style profile
- User reviews and approves
STEP 2: PROMPT TEMPLATE CREATION (~15 min)
- Build three-block template: [STYLE] + [CHARACTER] + [SCENE]
- Test with 2-3 generations
- Adjust style profile based on test results
STEP 3: BATCH GENERATION (Per illustration, ~5-10 min)
- Fill in [SCENE] block
- Select 2-3 reference images
- Generate 3-5 candidates
- Automated QC: vision model scores consistency
- Present top 2-3 to user
STEP 4: ITERATIVE REFINEMENT (As needed)
- Fix one attribute at a time
- Re-run with updated profile
| Step | Images | Est. Cost |
|---|---|---|
| Style extraction (vision, text only) | 0 | $0 |
| Template validation | 3-5 | $0.30-$1.00 |
| Per illustration (3-5 candidates) | 3-5 | $0.30-$1.00 |
| Automated QC (vision, text only) | 0 | $0 |
| Full 24-page book (12 spreads) | 36-60 | $3.60-$12.00 |