Ollie & Dot
Home
Draft 3 ▾
Draft 3 Home Planning Generations Feedback: Draft 2 Style Tracing
Draft 2 ▾
Draft 2 Home Generations Feedback Mockup Downloads Planning Instructions Characters References Regen Plan Comments Final Batch Corrected Batch Progression Style Tracing Font Compare Export Renderer
Draft 1 ▾
Draft 1 Home Book Drafts Book Viewer Character Sheets Style Transfers Prompter Drafter Generations Storyboards Characters AI Styles My Styles Reference Library Page Formatting

Draft 3 Planning

Research & Progress

8
Research Reports
7
Research Complete
0
In Progress
1
Experiments Pending
Research Reports
RESEARCH
GPT-Image-2 Prompt Engineering
Complete
Research Finding
Skill Status
Prompt Structure
Scene → Subject → Details → Use Case → Constraints ordering. Model gives more attention to earlier tokens.
✓ In Skill 8-step prompt structure template.
Longer prompts (300-500 words) outperform shorter ones. Structure over adjectives.
✓ In Skill Long prompt guidance encoded.
Texture Artifacts & Watermarking
Steganographic watermarking causes muddy grain on smooth surfaces. Worse on photorealistic renders, better on collage/mixed-media styles.
✓ In Skill Fresh API call rule + surface quality language.
Global texture language ("gold foil stipple shimmers on surfaces") amplifies artifacts by giving model permission to add texture everywhere.
✓ In Skill Stipple locked to specific elements only.
Community workaround: 35mm film aesthetic / intentional grain language can mask artifacts by making texture look intentional (Kodak Portra 400 look)
To Explore Test in next batch.
Upscale-then-downscale pipeline smooths over watermark artifacts during reconstruction
To Explore Post-processing option, not yet standard.
"Maximum density" language worsened artifacts in spread 05 by encouraging noise fill
✓ In Skill Scene accuracy rules encoded.
Resolution & Sizes
Only 4 fixed sizes: 1024×1024, 1536×1024, 1024×1536, auto. Generate at 1536×1024 for book spreads, upscale externally for print.
✓ In Skill 1536x1024 locked in API params.
Character Consistency
Full character anchors must repeat in every prompt. Model does NOT remember characters between API calls. Up to 16 reference images, labeled by role.
✓ In Skill Canonical character anchors + pre-send checklist.
Dot rendered with mouths/crescents instead of glowing U-shapes. Ollie got ears. Anti-patterns and ALL-CAPS emphasis needed.
✓ In Skill Anti-patterns for Dot and Ollie encoded.
Vague descriptions produce different characters every time. Short character blocks give model too much creative freedom.
✓ In Skill Exhaustive character anchors required.
Scene Accuracy / Anatomical Errors
Multi-species compositions cause morphed creatures, extra fins, blob-like fish. Worse when species listed vaguely or in excess.
✓ In Skill Species limit and anatomy rules.
Background creatures can be simplified (silhouettes, "schools of small fish") to save attention budget for foreground species.
Not Started Could be formalized as a rule.
API Parameters
Quality tiers: low (~$0.006), medium (~$0.053), high (~$0.19). Output formats: png (lossless), jpeg (faster), webp (smallest).
✓ In Skill Quality tiers and cost table encoded.
Generate 4 variants per spread for selection. Concept → refinement → final render workflow.
✓ In Skill 4-variant workflow encoded.
Full Report — 2026-05-05/06

1. Prompt Structure

The Recommended Order

GPT Image 2 rewards structure over adjectives. Words like "stunning" or "beautiful" do very little. What works is organized, specific description in this order:

  • Material/Medium Block — What the image is made of (torn paper, gold foil, gouache, watercolor paper)
  • Use Case / Format — What the image is for (children's book illustration, full bleed 3:2)
  • Scene/Setting — The environment, background, lighting conditions
  • Subject(s) — Each character described individually with full anchors
  • Environment Details — Specific visual attributes, sea life, props
  • Rendering / Palette — Colors, light direction, depth zones
  • Constraints — What to avoid or preserve

Why Order Matters

The model allocates more attention to earlier tokens in the prompt. Putting materials and format first tells the model "this is a mixed-media collage" before it starts imagining the scene, which fundamentally changes how it renders everything that follows. Burying the style at the end means the model may have already committed to a photorealistic rendering before it sees "watercolor."

Prompt Length

GPT Image 2 follows long prompts more reliably than most other models. Our best results used 300-500 word prompts. If you're getting generic results, the answer is usually more specificity, not less. Be explicit about every element you care about.

Concrete Example (from our Spread 01 prompt)

Dry torn paper coral with visible fiber texture exists deep
underwater alongside photographic reef life.
Gold foil metallic stipple shimmers on surfaces submerged
in real ocean water.
Gouache painted characters swim through photographic
Hawaiian reef. Heavy watercolor paper base.

Full bleed 3:2 children's book illustration. Mixed media
collage blending with underwater photography.

SCENE: A dramatic steep Hawaiian reef drop-off descending
into the deep...

OLLIE: A cartoon baby octopus. [full character anchor with
colors, proportions, anti-patterns]...

DOT: Solid natural Tahitian pearl. [full character anchor
with eye shape, anti-patterns]...

ENVIRONMENT: Rich shallow coral reef dropping into deep...

RENDERING: Warm golden god-ray light transitioning to deep
navy. [hex palette]. Full bleed 3:2 aspect ratio.

Text Rendering

GPT Image 2 achieves ~99% accuracy on text (vs ~60% for DALL-E 3). Place exact text in quotation marks or ALL CAPS, specify font style and placement explicitly. Best at 1-5 words per text element. Not relevant for our book illustrations (we avoid in-image text).

2. Texture Artifacts and Watermarking

What the Artifacts Are

Shortly after GPT Image 2 launched on April 21, 2026, users reported a consistent visual artifact: generated images were layered with persistent tiling textures and grime artifacts:

  • Muddy smears or dirt overlays on photorealistic outputs
  • Tiling noise patterns visible across interiors, landscapes, and other scenes
  • Subtle grime textures that appear embedded at the pixel level
  • Noise pattern amplification where artifacts compound across successive generations

What Causes It

Primary cause: Steganographic Watermarking. OpenAI embeds a dual-layer watermark into every generated image: (1) C2PA metadata and (2) an imperceptible pixel-level watermark with no public detector yet. The pixel watermark is baked into the generation process itself, creating visible texture patterns especially on smooth, uniform surfaces. Downstream workflows like upscaling or compositing can amplify these artifacts.

Secondary cause: Context-dependent noise accumulation. The model propagates noise from previous generations within the same conversation. Artifacts carry over and compound with each successive image in a chat thread. OpenAI fixed a noise amplification bug, but some baseline texture persists.

Tertiary cause: Training data contamination. GPT Image 2 has been observed generating images containing visible Gemini branding and other AI-model artifacts — evidence that AI-generated content in the training data introduces learned artifacts.

Which Images Show Artifacts MORE

  • Photorealistic renders with smooth surfaces (skin, fabric, plastic)
  • Sky gradients and large uniform color fields
  • Interior scenes with walls, floors, ceilings
  • Organic landscapes with dense foliage (forests reveal a distinct synthetic rendering pattern)
  • Any image where you zoom in on micro-textures (skin pores, fine detail)

Which Images Show Artifacts LESS

  • Heavily textured subjects where noise blends in (collage, mixed media)
  • Watercolor and gouache styles (intentional texture masks artifact texture)
  • Subjects with inherent visual complexity (dense coral reefs, busy compositions)
  • Stylized/illustrated looks rather than photorealistic

Community Workarounds

  • AI eraser spot cleanup: Export the image, use an AI eraser tool to brush over specific artifact areas — faster than regenerating the whole image
  • Upscale-then-downscale pipeline: Upscale with an external tool (Real-ESRGAN, Topaz), then downscale back. The upscaler smooths over watermark artifacts during reconstruction
  • Controlled grain language: Specify intentional grain ("35mm film aesthetic," "Kodak Portra 400 look") to replace artifact grain with controlled, aesthetically pleasing grain
  • Fresh API calls: Each generation should be a fresh API call with no conversation history
  • Positive surface language: Describe what you want: smooth matte finish, clean flat color
  • Use quality: "high" for finals — lower quality settings show more compression artifacts

Our Batch Analysis (May 5-6, 2026)

Analysis of our 5-spread OpenAI batch revealed likely artifact amplifiers in our prompts:

  • Global texture language: Phrases like "gold foil metallic stipple shimmers on surfaces" gave the model permission to add texture everywhere, including areas that should be smooth. Fix: target texture to specific elements ("gold foil stipple on Ollie's skin only")
  • "Maximum density" language: Spread 05 used "maximum density" which likely worsened artifacts by encouraging the model to fill negative space with noise/texture
  • Mixed photorealistic/illustrated language: Asking for "photographic Hawaiian reef" alongside "gouache painted characters" forces a hybrid rendering mode where photorealistic elements are more susceptible to watermark artifacts
  • Our style inherently helps: The mixed-media collage look (torn paper, watercolor, gouache) naturally masks artifacts that would be obvious on photorealistic surfaces

3. Resolution Capabilities

Actual Specifications

ParameterValue
Supported sizes1024×1024 (1:1), 1536×1024 (3:2), 1024×1536 (2:3), auto
Our output1536×1024 (3:2 landscape)

Key finding: GPT-Image-2 only supports 4 fixed sizes: 1024×1024 (1:1), 1536×1024 (3:2 landscape), 1024×1536 (2:3 portrait), and auto. For print-quality book illustrations, generate at 1536×1024 (3:2) and upscale externally (Real-ESRGAN, Topaz Gigapixel).

4. Character Consistency

What Works

  • Detailed character anchors repeated in every prompt — full physical description with colors, proportions, materials. Our Ollie anchor is ~100 words; Dot is ~80 words. Both must appear in EVERY prompt
  • Explicit anti-pattern lists — "NO mouth, NO ears, NO antenna, NO hardware" prevents the most common character drift
  • Reference images labeled by role — "Image 1: character sheet. Image 2: approved style reference." Up to 16 reference images supported
  • Iterative refinement language — "Maintain the same composition and character but change [specific element]"

What Does Not Work

  • Vague descriptions — "a cute octopus" produces different characters every time
  • Relying on model memory — the model does NOT remember characters between separate API calls
  • Short character blocks — fewer details = more creative freedom for the model = more drift

Our Batch 1 Issues

  • Dot rendered with mouths, crescents, and arcs rather than the specified glowing U-shapes. Needed more aggressive anti-patterns and ALL-CAPS emphasis on the eye shape
  • Ollie occasionally got ears despite "NO EARS" in the prompt. Repeating "NO EARS" in the rendering block as well as the character block improved compliance
  • Scene concepts came through well (reef, depth zones, color palette), but character-level detail was the weak point. The model allocates attention broadly across long prompts, so character-critical details need emphasis (CAPS, repetition, anti-patterns)

5. Scene Accuracy and Anatomical Errors

The Multi-Species Problem

GPT Image 2 struggles with complex multi-species underwater compositions. When a prompt lists many species, the model often:

  • Morphs multiple species together (fish with wrong fin configurations, hybrid creatures)
  • Generates blob-like creatures that do not match any real species
  • Adds extra fins, limbs, or appendages to animals
  • Gets species colors wrong (yellow tang rendered as blue, etc.)

How to Fix

  • Limit unique species to 4-6 per scene — fewer species = more attention per species = better anatomy
  • Describe each creature individually with real anatomy: "a Moorish idol with its distinctive tall triangular dorsal fin, black-and-white vertical banding, and trailing yellow caudal filament"
  • Be explicit about count and placement: "three yellow tangs swimming left-to-right through the midground"
  • Separate foreground from background creatures: give detailed anatomy only to foreground species; background species can be silhouettes or "schools of small fish"
  • Use real species names with key identifying features: the model renders more accurately when given the correct name plus 1-2 anatomical features

Our Batch Findings

Spread 05 requested maximum density with 10+ named species and showed morphed fish and anatomically incorrect creatures. Spread 01 and 08, which had fewer species described more carefully, produced better individual creature accuracy. For future batches: prioritize quality of each creature over quantity of species.

6. Editing and Reference Images

How Reference Images Work

GPT Image 2 processes all reference images at high fidelity automatically with no adjustable knob.

Multi-Image Input

  • Label each by number and role: "Image 1: character reference. Image 2: style reference. Image 3: background scene."
  • Describe how they interact: "Apply the style from Image 1 to the character in Image 2, placed in the setting of Image 3."
  • Downscale references to what the task actually needs

The Edit Pattern

Change: [exactly what should change]
Preserve: [face, identity, pose, lighting, framing, background, geometry, text, layout]
Constraints: [no extra objects, no redesign, no logo drift, no watermark]

7. API Parameters

SettingUse CaseCost (1024x1024)
lowFast drafts, thumbnails~$0.006
mediumGeneral use~$0.053
highFinal assets, dense layouts~$0.211
autoLet the model decidevaries

Output Format

FormatNotes
pngDefault. Lossless. Best for illustration work.
jpegFaster than PNG. Good when latency matters.
webpSmallest file size. Good for web delivery.

8. GPT Image 2 vs DALL-E 3

FeatureGPT Image 2DALL-E 3
Text rendering accuracy~99%~60%
Texture/photorealismMore natural, contextually groundedGood but less precise
Character consistencyStrong with proper anchoringInconsistent
Prompt followingFollows long, structured prompts wellBetter with shorter prompts
Max resolutionUp to 2K (2048x2048)1024x1024

9. Recommendations for Ollie and Dot

Avoiding Texture Issues

  • Fresh API call for each image — never iterate within a conversation
  • Use quality: "high" and png output format for all finals
  • Target texture language to specific elements, not globally
  • Keep "maximum density" language out of prompts — specify exact counts and placements
  • Post-process with AI eraser for spot artifacts rather than regenerating entire images
  • Our mixed-media style is a natural advantage — lean into it

Improving Character Accuracy

  • Repeat character anchors verbatim in every prompt (100+ words per character)
  • Use ALL-CAPS for critical anti-patterns: "NO EARS," "NO MOUTH"
  • Repeat critical constraints in both the character block AND the rendering block
  • Attach 2-3 approved reference images labeled by role

Improving Scene Accuracy

  • Limit to 4-6 unique species per scene
  • Describe each creature with real anatomical features
  • Specify exact counts and positions: "three yellow tangs in the midground"
  • Use "schools of small colorful fish" for background density

Iteration Workflow

  • Concept phase: quality: "low", rapid iteration on composition and pose
  • Refinement phase: quality: "medium", nail down details and expressions
  • Final render: quality: "high", png format, full resolution
  • Post-processing: AI eraser for spot artifacts, external upscaling for print resolution

10. Sources

  • OpenAI Developer Community — GPT-image-generator 2.0 issues, bugs, and workarounds thread. Primary source for texture artifact reports, noise amplification fix, community workarounds
  • Startup Fortune — GPT Image 2 grime artifacts and watermark strategy analysis. Steganographic watermark analysis, pixel-level provenance embedding
  • OpenAI Help Center — C2PA in ChatGPT Images. Official C2PA metadata documentation
  • OpenAI Cookbook — Image Generation Models Prompting Guide. Official prompt structure recommendations
  • fal.ai — GPT Image 2 Prompting Guide. Structured prompting best practices, edit patterns
  • a2a-mcp.org — GPT Image 2 Prompts Guide 2026. Advanced prompt techniques, character consistency
  • Framia — GPT Image 2 Resolution analysis. Native 2K ceiling, resolution specifications
  • PixVerse — GPT Image 2 Review 2026. Prompt guide, use cases, quality comparison
  • PixNova — GPT Image 2 48-Hour Stress Test. Post-processing workflows, AI eraser technique
  • ImagesPlatform — GPT Image 2 In-Depth Technical Review. 4K claims vs reality, character consistency
  • Our own generation experiments — Batch 1 (5 spreads, May 5-6 2026): prompt analysis of spreads 01, 05, 07, 08, 13
RESEARCH
Style Profile System Research
Complete
Research Finding
Skill Status
Reference Image Capabilities
NB2 supports up to 14 reference images (4 character + 10 object). Attach 2-3 approved illustrations and say "match this style."
✓ In Skill NB2 reference image strategy encoded.
GPT-Image-2 supports up to 16 reference images via edit endpoint. Label each by role ("Image 1: style reference, Image 2: character reference").
✓ In Skill GPT-Image-2 labeled role system encoded.
NBP (Gemini 3 Pro) supports up to 11 references (5 character + 6 object). Higher quality, slower.
✓ In Skill NBP reserved for hero images.
Midjourney --sref Analysis
Midjourney --sref uses a dedicated style encoder to separate content from style. Extracts colors, textures, lighting, brushwork as a conditioning signal. Not available via API.
✓ In Skill Vision-extracted style profiles as alternative.
--sref has numeric codes for reproducible styles, weight control (--sw), and can blend multiple references. --cref separates character from style.
Not Started Prompt weighting experiments needed.
Vision-Extracted Style Profiles
Feed approved illustrations to Gemini/Claude and extract structured style description: materials, textures, colors, lighting, rendering technique. Common elements become the reusable "style profile."
✓ In Skill Full style profile built.
Style profile should be a 10-category JSON schema covering medium, color palette, line work, lighting, texture, character rendering, background treatment, mood, constraints, and prompt templates.
✓ In Skill All 10 categories covered in skill.
Academic Papers
StyleTokenizer (ECCV 2024): trains a style feature extractor on 30K images across 300+ categories. Single-image style extraction and injection into diffusion models.
To Explore Could inform future custom model training.
StyleBrush (2024): single-image style transfer via diffusion model trained on Wikiart. Proves single-image style extraction is viable for production.
To Explore Potential alternative pipeline.
Semantic Guidance (Nature, 2025): uses semantic segmentation to decouple multiple styles from reference images with parallel decoupling adapters.
To Explore Worth monitoring for future use.
QC Scoring
Automated QC: feed generated image + approved reference to vision model. Score 1-10 on material authenticity, color match, texture quality, lighting. Below 7 = reject, 7-8 = review, 9-10 = pass.
✓ In Skill 10-dimension scoring checklist encoded.
Materials / Textures / Color Palette
4 locked materials (torn paper coral, gold foil stipple, gouache characters, watercolor paper base). 6 locked textures. 7-layer media stack. Color palette varies by depth zone.
✓ In Skill Materials, textures, palette fully encoded.
Anti-patterns identified: pure digital look, flat vector, smooth plastic, anime style, pure black outlines, neon colors, cardboard feel, collage on top of photo.
✓ In Skill 12 anti-patterns in dedicated table.
Experiment Plan
4 phases: style extraction ($0.50-$1), reference image testing ($2-$5), prompt template optimization ($2-$3), automated QC validation ($0). Total: ~$4.50-$9.00, 50-70 images.
Not Started Phases 2-4 not yet executed.
Full Report — 2026-05-06

1. Midjourney --sref: How It Works

Mechanism

Midjourney's --sref (Style Reference) is a parameter that transfers the aesthetic of a reference image or a numeric style code to a new generation. It operates in two modes:

  • Image URL mode: --sref [image URL] — Midjourney analyzes the image and attempts to transfer its style.
  • Numeric code mode: --sref [code] — A numeric identifier mapping to a pre-defined internal style within Midjourney's latent space.

What --sref Captures

  • Drawing/collage techniques and rendering method
  • Photography styles (if photographic)
  • Color palettes and contrast levels
  • Light and shadow ratios
  • Compositions and layouts
  • Texture and surface quality
  • Overall mood/atmosphere

Internal Implementation

Midjourney has not published their architecture, but based on community analysis:

  • Style codes are points in a learned style latent space. Each numeric code maps to a specific region producing a reproducible aesthetic.
  • Image-based --sref likely uses a vision encoder (possibly CLIP-based) to extract style features, then conditions the diffusion process on those features.
  • The system separates content from style. --sref transfers aesthetics without subject matter; --cref does the opposite.
  • Style versions (--sv) indicate different encoder architectures. The current default is --sv 6.

--sref vs --cref

Feature--sref (Style Reference)--cref (Character Reference)
PurposeKeep the "camera"/art direction the sameKeep the "actor"/character the same
CapturesColors, textures, lighting, composition, rendering styleFacial features, hair, body proportions, clothing
Weight param--sw (0-1000, default 100)--cw (0-100, default 100)
MetaphorSame cinematographer, different sceneSame actor, different scene

Why It Works Well

  • Dedicated style encoding — Style is extracted as a separate conditioning signal
  • Numeric reproducibility — Same code always produces the same aesthetic
  • Weight control — The --sw parameter lets you blend style influence
  • Multiple references — You can provide multiple style images to blend

Limitations

  • Illustrations get more creative reinterpretation than photographs
  • Not available via API for external tooling (Discord-only or Midjourney web)
  • Style codes are opaque; cannot engineer a code for a specific style without trial and error

2. Nano Banana 2 / Gemini Style Capabilities

Model Overview

  • Nano Banana Pro = Gemini 3 Pro Image — highest quality, up to 5 character + 6 object references (11 total)
  • Nano Banana 2 = Gemini 3.1 Flash Image — faster, cheaper, up to 4 character + 10 object references (14 total)

Style Reference Image Support

CapabilityNano Banana Pro (3 Pro)Nano Banana 2 (3.1 Flash)
Character reference imagesUp to 5Up to 4
Object fidelity imagesUp to 6Up to 10
Total reference imagesUp to 11Up to 14
Supported formatsPNG, JPEG, WebP, HEIC, HEIFPNG, JPEG, WebP, HEIC, HEIF

Important: The 14-image limit is NOT freely allocatable. Character and object quotas are independent.

Best Approach for Style Consistency

  • Provide 2-4 approved style reference images alongside each generation prompt
  • Use explicit referential language: "Use the same watercolor style, color palette, and line weight as the reference images."
  • Maintain a character sheet and use it as reference for all subsequent generations
  • Text prompts should precede reference images in the API call
  • Recommended order: Subject > Composition > Action > Location > Style > Editing instructions

3. GPT-Image-2 Style Reference Capabilities

GPT-Image-2 supports up to 16 reference images (JPEG, PNG, WebP, under 30MB each) via the images.edit endpoint.

How Style Matching Works

  • Analyzes provided reference images for visual features at high fidelity automatically
  • Applies style, palette, edge treatment, and silhouette language from references
  • Maintains character consistency across variations, style transfers, and partial edits

Best Prompt Structure for Style Consistency

Image 1: Character reference sheet showing Ollie the fox
Image 2: Approved illustration showing target color palette
Image 3: Scene composition reference

Generate a new illustration of Ollie walking through a forest.
Apply the watercolor style and color palette from Image 2.
Maintain Ollie's exact appearance from Image 1.
Use similar composition depth as Image 3.

Cost Structure

QualityApprox Cost per Image (1024x1024)
Low~$0.006
Medium~$0.053
High~$0.211
Batch (50% discount)Half of above

4. Automated Style Extraction Approaches

Vision Models as Style Analyzers

Both Gemini and Claude can analyze an approved illustration and produce a structured style description, identifying:

  • Color palette: Dominant colors, accent colors, approximate hex values
  • Rendering technique: Watercolor, gouache, digital paint, pencil
  • Line quality: Weight, softness, color of outlines
  • Texture: Paper grain, brush strokes, noise patterns
  • Lighting: Direction, warmth, contrast ratio, shadow softness
  • Composition patterns: Typical framing, depth, character-to-background ratio
  • Character proportions: Head-to-body ratio, stylization level

Practical Workflow

  • Select 3-5 approved "gold standard" illustrations
  • Send each to a vision model with a structured analysis prompt
  • Aggregate and normalize the extracted attributes into a single style profile
  • Use the style profile as a prompt prefix for all subsequent generations
  • Periodically re-validate by comparing new generations against gold standards

Academic Research

  • StyleTokenizer (ECCV 2024): Trains a dedicated style feature extractor on 30,000 images across 300+ style categories. Extracts style from a single reference image and injects it into a diffusion model.
  • StyleBrush (2024): Extracts and transfers style from a single image using a diffusion model trained on Wikiart. Demonstrates single-image style extraction is viable for production.
  • Semantic Guidance (Nature, 2025): Uses semantic segmentation to decouple multiple styles from reference images with parallel decoupling adapters.

Key takeaway: For our purposes, using a vision model to extract a structured text description is more practical than training custom models. It is free/cheap, fast to iterate, and the output (text) can be directly used as prompt input for both NB2 and GPT-Image-2.

5. Professional AI Illustration Studio Best Practices

Tier 1: LoRA Fine-Tuning (Most Reliable, Most Effort)

  • Train a custom LoRA model on 15-30 character reference images
  • Produces 95%+ consistency for books over 20 pages
  • Not applicable since NB2 and GPT-Image-2 don't support custom LoRA injection

Tier 2: Reference Image + Prompt Engineering (Our Sweet Spot)

  • Create a canonical character sheet (front view, side view, expressions, outfit variations)
  • Build a style guide with palette hex codes, line weight, shading rules
  • Use a base prompt template with immutable character traits separated from variable scene elements
  • Append scene-specific details to the template for each illustration
[STYLE BLOCK - immutable]
Watercolor children's book illustration. Soft pencil outlines
with warm earthy palette.

[CHARACTER BLOCK - immutable per character]
Ollie: small orange fox with oversized round head, bright
curious eyes (#4169E1 blue), wearing a green scarf (#2E8B57).

[SCENE BLOCK - variable]
Ollie standing at the edge of a meadow, looking up at
fireflies. Golden hour lighting from the left.

Quality Control Process

  • Automated similarity scoring — SSIM and LPIPS metrics vs reference standards
  • Human review — Side-by-side comparison with approved references
  • Targeted regeneration — Failed pages get img2img refinement, not full recreation

6. Recommended Approach for Ollie and Dot

Strategy: Vision-Extracted Style Profile + Multi-Reference Generation

  • Step 1: Select 3-5 gold standard illustrations from existing corpus
  • Step 2: Extract structured style profiles via vision model (Gemini/Claude)
  • Step 3: Create three-block prompt templates (Style + Character + Scene)
  • Step 4: Generate with 2-3 gold standards + 1-2 character sheets as reference images
  • Step 5: Automated QC loop — vision model scores consistency, regenerate if score < 7
FactorRationale
No LoRA neededWorks natively with NB2 and GPT-Image-2 APIs
Low setup costVision-based extraction is essentially free
IterativeProfile can be refined as more illustrations are approved
Cross-modelSame style profile works for both NB2 and GPT-Image-2
Automated QCVision models can score consistency without human bottleneck

7. Experiment Plan

Phase 1: Style Profile Extraction (Cost: ~$0-1)

Select gold standards, run style extraction, merge into canonical profile, validate with text-only generations.

Phase 2: Reference Image Strategy Testing (Cost: ~$2-5)

Test 0, 1, and 3 reference images on both NB2 and GPT-Image-2. Identify optimal consistency-to-cost ratio.

Phase 3: Prompt Template Optimization (Cost: ~$2-3)

Test prompt ordering variations: style-first, scene-first, and interleaved. Measure style fidelity.

Phase 4: Automated QC Validation (Cost: ~$0)

Test vision-model-based quality scoring. Validate against human judgment (>80% agreement target).

PhaseEstimated Cost
Phase 1$0.50-$1.00
Phase 2$2.00-$5.00
Phase 3$2.00-$3.00
Phase 4$0.00
Total$4.50-$9.00

Estimated generations: 50-70 images total across all phases.

8. Proposed Style Profile Schema

A 10-category JSON schema covering medium, color palette, line work, lighting, texture, character rendering, background treatment, mood/atmosphere, negative constraints, and prompt templates. The schema supports cross-model adaptation (NB2 and GPT-Image-2) with model-specific formatting differences.

RESEARCH
Character Consistency Research
Not Started
Research Finding
Skill Status
Turnaround Sheets
Turnaround sheet effectiveness across NB2 and GPT-Image-2 needs testing.
Not Started Generate Ollie turnaround candidates first.
Reference Image Count
Optimal reference image count per model for character stability. NB2 (14 max), GPT-Image-2 (16 max), NBP (11 max).
Not Started Test 1 vs 3 vs 5 reference images.
Dot Description Variants
Dot text description variants: text-only vs. reference images. Early finding: text-only produces better iridescent/pearlescent results.
✓ In Skill Text-only Dot rule encoded.
Cross-Scene Stability
Cross-scene stability scoring methodology — how to measure if a character looks "the same" across different scenes/depth zones.
Not Started Adapt QC scoring for character measurement.
RESEARCH
NB2 Prompt Engineering
Complete
Research Finding
Skill Status
Prompt Structure
Subject-first ordering: NB2 weights start of prompt heavily. Put subject first, then style/environment. Natural language over tag soup. JSON-structured prompts improve consistency ~25%.
✓ In Skill JSON prompt template built.
Regional prompts: divide canvas into regions with separate prompts for different areas. Useful for complex foreground/background scenes.
To Explore Test for spread compositions.
Reference Image Strategy
Up to 10 reference images simultaneously. Clearly define each role: pose, style, environment. 360-degree character sheets are the most reliable consistency method.
✓ In Skill Multi-reference strategy encoded.
Image Search Grounding (NB2 exclusive): retrieves real-world reference images via Google Search during generation for visual grounding.
To Explore Test for reef reference grounding.
Temperature / Parameters
Temperature 0-0.5 for consistency-critical work (character sheets, batches). 0.8-1.2 for balanced illustration. 1.5-2.0 for exploration. Thinking Mode: Minimal (fast), High (best quality), Dynamic (auto).
✓ In Skill Low temp for batches, high for hero.
Seed control for reproducibility. Include "HD"/"4K" keywords for sharpness. Weighted descriptors like "(same face:1.3)" enforce consistency.
✓ In Skill Seed + quality keywords encoded.
Character Consistency
Named characters tracked better across multi-turn sessions. Consistency anchors ("same character", "maintain facial features") needed in every prompt. Iterative editing (not regeneration) preserves identity better.
✓ In Skill Named characters + anchors encoded.
Style locking: prompt "same art style" across sequences. Define style guide once, reference every prompt. Combine with negative prompts to prevent common failures.
✓ In Skill Style lock rules in prompt template.
NB2 vs GPT-Image-2 Strengths
NB2 superior: photorealism, stylized illustration, cinematic lighting, speed (4-6s), anime, conversational editing. GPT-Image-2 superior: text rendering, precise layouts, batch consistency, character identity. NB2 reaches ~95% of Pro quality at 3-5x speed.
✓ In Skill Model routing rules encoded.
Full Report — 2026-05-06

1. Prompt Structure Best Practices

Subject-First Ordering

NB2 weights the start of the prompt heavily. Put your subject first, then style/environment details. Starting with style descriptors produces softer, less focused results.

Do: "A small brown dog wearing a red cape, standing on a hilltop at sunset, digital illustration style"

Don't: "Digital illustration style, sunset lighting, a small brown dog on a hilltop"

Natural Language Over Tag Soup

NB2 is a thinking model that understands intent, physics, and composition. Write prompts as if briefing a human illustrator. Avoid comma-separated tag lists.

JSON-Structured Prompts

Community benchmarks show JSON-structured prompts improve consistency by ~25% over plain text. Structure with labeled fields:

{
  "subject": "A small brown dog named Ollie wearing a red cape",
  "pose": "standing heroically, looking to the right",
  "environment": "grassy hilltop at golden hour",
  "style": "soft watercolor children's book illustration",
  "lighting": "warm golden backlight with soft shadows",
  "camera": "low angle, medium shot"
}

Regional Prompts

For different content in different parts of an image, divide the canvas into regions with a prompt for each. Useful for complex scenes with foreground characters and detailed backgrounds.

2. Reference Image Strategy

Multi-Reference Conditioning

NB2 supports up to 10 reference images simultaneously. When using uploaded images, clearly define each role: "Use Image A for the character's pose," "Use Image B for the art style," "Use Image C for the background environment."

Character Sheets (360-Degree)

The most reliable consistency method:

  • Generate 2-3 images of your character within a single frame showing multiple angles (front, left, right, back)
  • Use this sheet as reference input for all subsequent generations
  • The character sheet becomes the "ingredient" that maintains identity across scenes

Image Search Grounding (NB2 Exclusive)

NB2 can retrieve real-world reference images via Google Search during generation. This is unique to NB2 (not available in Pro) and helps ground outputs in real-world visual context.

3. Temperature and Parameter Optimization

Temperature Settings

  • Range: 0 to 2 (default: 1)
  • Low (0-0.5): More deterministic, predictable. Best for character sheets and batch generation.
  • Medium (0.8-1.2): Balanced creativity and adherence. Good default for most illustration work.
  • High (1.5-2.0): More diverse, surprising. Good for exploration, less predictable.

Thinking Mode (NB2 Exclusive)

  • Minimal: Fastest generation, good for rapid iteration
  • High: Best quality, slower. Use for final/hero images
  • Dynamic: Model decides the appropriate level per request

Quality Keywords and Seed Control

Include terms like "HD," "4K," or "HDR" for better clarity. Start every prompt with a fixed seed for reproducibility. Combine with weighted descriptors like "(same face:1.3)" to enforce consistency.

4. Character Consistency Approaches

  • Naming characters: Assign distinct names; model tracks named entities better across multi-turn sessions
  • Consistency anchors: "same character," "maintain facial features," "preserve proportions," "consistent with previous image"
  • Multi-turn continuity: Remind NB2 to "use same character identity as the previous image"
  • Style locking: Prompt "same art style" across sequences. Define style guide once, reference in every prompt
  • Iterative editing: If 80% correct, edit don't regenerate. NB2 excels at conversational edits
  • Negative prompts: Combine positive traits with negatives to prevent common failure modes

5. NB2 vs Nano Banana Pro

FeatureNano Banana 2Nano Banana Pro
ArchitectureGemini 3.1 FlashGemini 3 Pro
Speed4-6 sec at 1K10-20 sec at 1K
Quality~95% of ProBest absolute quality
Textures/LightingVery goodSuperior
Unique FeaturesImage Search Grounding, Thinking ModeN/A
CostSignificantly cheaperHigher per image
Best ForSpeed, batch generation, stylized workPrecision, identity preservation

6. NB2 vs GPT-Image-2

CapabilityNB2GPT-Image-2
PhotorealismSuperiorStrong
Text/TypographyGoodSuperior (best in class)
Anime/IllustrationSuperiorGood
Structural ControlModerateSuperior
SpeedFasterSlower
Batch ConsistencyGoodSuperior
Cinematic LightingSuperiorGood
Cost (API)$60/M output tokens$30/M output tokens

Choose NB2 for: stylized illustration, photorealism, cinematic scenes, rapid iteration, anime. Choose GPT-Image-2 for: text-heavy designs, precise layouts, batch production, structural accuracy.

7. Community Tips

  • Prompt like a Creative Director — describe the scene, mood, intent
  • Use the 80/20 edit rule — if mostly right, edit don't regenerate
  • Build character sheets first — invest upfront, save time across the project
  • Test with 1-2 images before scaling
  • Use specific color palettes: "ochre, sap green, dusty blue" beats "warm earth tones"
  • Limit core colors to 3-5 plus occasional accents
  • Lock your style guide in writing — define and reuse
RESEARCH
Multi-Model Pipeline (NB2 + GPT-Image-2)
Complete
Research Finding
Skill Status
Complementary Strengths
GPT-Image-2 excels at structural accuracy, text rendering, batch consistency. NB2 excels at stylized illustration, cinematic lighting, rapid iteration. Combining produces results neither achieves alone.
✓ In Skill Model routing table encoded.
NB2 supports up to 10 references, Image Search Grounding, and conversational editing. GPT-Image-2 has better identity preservation and strict layout adherence.
✓ In Skill Per-model capability rules.
Pipeline Architectures
Architecture A: GPT-Image-2 for composition → NB2 for style. Best for scenes with precise spatial requirements.
To Explore Test on one spread.
Architecture B: NB2 for rapid exploration → GPT-Image-2 for final production. Fast ideation with polished finals.
To Explore Test on one spread.
Architecture C: Parallel generation with best-of selection. Both models on same prompt, pick best per scene.
Not Started
Architecture D: Layered composition. NB2 for backgrounds (lighting), GPT-Image-2 for characters (accuracy), then composite.
Not Started
Ollie & Dot Recommendation
4-phase pipeline: (1) Character Design via GPT-Image-2 for reliable sheets, (2) Scene Exploration via NB2 at 3-5x speed, (3) Final Production routed by scene type, (4) Consistency Pass using NB2 conversational editing.
✓ In Skill 4-phase pipeline encoded.
Cost Optimization
Cheap models for iteration (NB2), expensive models for delivery (GPT-Image-2 where strengths matter). Never iterate at scale on the expensive model. Validate with 1-2 test images first. ComfyUI or custom Python script for orchestration.
✓ In Skill Cost-aware routing rules.
Full Report — 2026-05-06

1. Why Multi-Model?

Single-model approaches force compromise. Every model has blind spots. The 2026 best practice is to route each task to the model that handles it best.

For children's book illustration: GPT-Image-2 excels at structural accuracy, text rendering, and batch consistency. NB2 excels at stylized illustration, cinematic lighting, and rapid iteration. Combining them produces results neither can achieve alone.

2. Model Strengths Comparison

CapabilityNB2GPT-Image-2Best Choice
PhotorealismExcellentVery GoodNB2
Stylized IllustrationExcellentGoodNB2
Text/TypographyGoodBest in classGPT-Image-2
Precise LayoutModerateExcellentGPT-Image-2
Cinematic LightingSuperiorGoodNB2
Batch ConsistencyGoodExcellentGPT-Image-2
Character IdentityGood (improving)StrongGPT-Image-2
Speed4-6 sec/imageSlowerNB2
Cost$60/M output$30/M outputGPT-Image-2
Conversational EditingExcellentGoodNB2
Multi-ReferenceUp to 10LimitedNB2

3. Pipeline Architectures

Architecture A: GPT-Image-2 for Composition, NB2 for Style

Flow: GPT-Image-2 (structure) → NB2 (style transfer)

  • GPT-Image-2 generates the base composition with precise character placement and layout
  • NB2 receives the output as reference and applies stylistic treatment: watercolor textures, warm lighting, soft edges
  • Best for scenes with specific spatial requirements

Architecture B: NB2 for Exploration, GPT-Image-2 for Production

Flow: NB2 (ideation) → GPT-Image-2 (final production)

  • NB2 generates rapid concept variations at 3-5x the speed
  • Select best concepts from NB2 explorations
  • GPT-Image-2 produces final version with batch consistency
  • Best when still figuring out composition/mood

Architecture C: Parallel Generation with Best-Of Selection

Both models generate independently on the same prompt. Compare and select the best per scene. Best when you have prompt budget and want maximum quality.

Architecture D: Layered Composition

  • NB2 generates backgrounds/environments (superior cinematic lighting)
  • GPT-Image-2 generates characters (better identity preservation)
  • Composite layers, then final pass for unified style
  • Best for complex scenes where both background quality and character accuracy are critical

4. Ollie & Dot Recommended Pipeline

Phase 1 — Character Design (GPT-Image-2): Generate definitive character sheets. Lock proportions, colors, features. GPT-Image-2's batch consistency ensures reliable reference sheets.

Phase 2 — Scene Exploration (NB2): Rapid scene iteration with character sheet references. Explore compositions and lighting at 3-5x speed. Multi-reference conditioning with character + environment + style references. Generate 4-6 variants per scene.

Phase 3 — Final Production (Model-Dependent): Text scenes → GPT-Image-2. Atmospheric/stylistic scenes → NB2. Precise character placement → GPT-Image-2. Wide environmental scenes → NB2.

Phase 4 — Consistency Pass: Review all finals. Use NB2 conversational editing for minor adjustments. Use GPT-Image-2 for structural corrections.

5. Cost Optimization

  • Cheap models for iteration (NB2 for rapid exploration)
  • Expensive models for delivery (GPT-Image-2 for finals where its strengths matter)
  • Don't iterate at scale on the expensive model — validate with 1-2 test images first

6. Pipeline Orchestration Tools

  • ComfyUI: Open-source visual workflow builder. Chain multiple models, apply controls, automate batches.
  • MindStudio: Visual builder for chaining 200+ models. Manages credit costs and routing.
  • Custom Python pipeline: Maximum control. Call GPT-Image-2 API for base, pass to NB2 via Gemini API as reference, automated QC via Gemini text model.

7. Character Consistency Across Models

  • Shared character sheet as anchor for both models
  • Detailed character description block prepended to every prompt
  • Style guide document with palette, lighting, texture rules
  • Post-generation review via Gemini text model to flag inconsistencies
  • One model for character, one for environment to avoid cross-model drift

8. Key Takeaways

  • Don't pick one model — route by task
  • Use NB2 for speed/exploration, GPT-Image-2 for precision/production
  • Character sheets are the bridge between models
  • NB2's conversational editing is the best finisher
  • Automate the pipeline with ComfyUI or custom scripts
  • Always test with 1-2 images before scaling
RESEARCH
Leonardo AI Post-Processing
Complete
Research Finding
Skill Status
Upscaling
Pro Upscaler scales up to 105MP. Purpose-built for AI-generated images — understands synthetic grain, cleans rather than amplifies. At 2x (3072×2048) results are highly usable. 12×8 inch spread at 300 DPI only needs ~8.6MP.
To Explore Test on GPT-Image-2 outputs.
Maintains structural integrity under intense magnification. Standard upscalers (Real-ESRGAN) may amplify AI-generated artifacts that Leonardo handles correctly.
Not Started
Outpainting (3:2 → 2:1)
Canvas mode outpainting: extend 1536×1024 (3:2) to 2048×1024 (2:1) by adding ~512px per side. 60/40 overlap ratio within tool's comfort zone. Best for extending backgrounds, not adding character elements.
To Explore Test aspect ratio conversion.
API / Pricing
Developer API with pay-as-you-go. Artisan plan ($30/mo, 25K tokens) handles ~500-1,000 upscales. Upscaling costs ~2x generation tokens. $5 free credit for new API accounts. Generate low-res, upscale selections only saves 40-50% tokens.
Not Started
vs Other Upscalers
Leonardo Pro: Best for AI images, API available, up to 105MP. Real-ESRGAN: Free, open-source fallback, doesn't understand AI patterns. Topaz: Desktop only, no API. Magnific: Over-hallucination risk, may add unwanted detail to illustrations.
✓ In Skill Leonardo as primary, ESRGAN as fallback.
Full Report — 2026-05-06

1. Upscaling GPT-Image-2 Outputs

Pro Upscaler Overview

Leonardo AI's Pro Upscaler is purpose-built for AI-generated images, capable of scaling up to 105 megapixels. Key differentiator: while most upscalers are trained on real photographs, Leonardo's understands synthetic grain and AI textures, cleaning them rather than amplifying them.

Performance

  • 1536×1024 to print resolution: At 2x (3072×2048), results are highly usable. At 4x (6144×4096), minor artifacts but still outperforms traditional upscalers.
  • AI-generated image handling: Specifically handles noise patterns and stylistic grain from AI models. Standard upscalers misread these as defects.
  • Structural integrity: Maintained under intense magnification — important for character consistency.

Verdict: Strong candidate. 105MP ceiling far exceeds print needs (12×8 inch spread at 300 DPI = ~8.6MP).

2. Outpainting (3:2 to 2:1 Aspect Ratio)

Leonardo's Canvas mode includes Inpainting/Outpainting tools for extending images beyond original boundaries.

How It Works

  • Place generation box overlapping ~60% existing image, ~40% empty canvas
  • Keep original prompt or leave blank
  • AI extends the scene into empty space

Suitability for 3:2 to 2:1

Converting 1536×1024 (3:2) to 2048×1024 (2:1) requires ~512px per side extension — moderate and within the tool's sweet spot. Works best when extending backgrounds/environments rather than adding character elements.

3. Other Post-Processing

  • Alchemy v4: Proprietary pipeline with Hyper-Realism and Abstract Concept modes. Abstract mode more relevant for Ollie & Dot.
  • Background removal: Available as post-processing action
  • Image-to-image: Style refinement passes
  • Canvas editor: Manual touch-up and detail refinement

4. API and Pricing

PlanPriceTokens/MonthAPI Access
Free$0150/dayNo
Apprentice$12/mo8,500Limited
Artisan$30/mo25,000Yes
Maestro$48-60/mo60,000Full + priority

Token costs: Standard generation 2-25 tokens, upscaling ~2x generation cost. Cost optimization: generate low-res, only upscale selections (saves 40-50%). Artisan plan handles ~500-1,000 upscale operations. $5 free API credit for testing.

5. Comparison with Other Upscalers

ToolBest ForAPIPricingAI-Image Aware
Leonardo ProAI-generated imagesYesToken-basedYes (trained on synthetic)
Real-ESRGANGeneral-purpose, freeSelf-hostedFreeNo
Topaz Photo AIPhotographersNo$199 one-timeNo
Magnific AICreative upscalingYesSubscriptionPartial (over-hallucination risk)

Recommendation: Leonardo Pro is the strongest candidate: purpose-built for AI images, API available, token-based pricing, outpainting and upscaling in one platform. Runner-up: Real-ESRGAN as free fallback. Avoid: Magnific — tendency to hallucinate new details could compromise illustration consistency.

RESEARCH
Midjourney V8.1
Complete
Research Finding
Skill Status
Latest Model
V8.1 Alpha (April 14, 2026): native 2K resolution, ~5x faster than V7, HD now default, physically grounded lighting, much better text rendering. Returns to V7's aesthetic while keeping V8's technical improvements.
✓ In Skill V8.1 capabilities noted.
Artifacts/morphing largely resolved from V7 onward. V8.1 fixed V8.0's over-polished look and unstable moodboards. Some legacy features (in-painting) still missing.
Not Started
Style References (--sref)
--sref extracts aesthetic qualities from reference images: color palette, lighting, texture, composition. Versioned system (--sv 6 default). "Super-stable" moodboards and srefs in V8.1 — major V8.0 fix. Style Creator tool generates unique reusable style codes.
✓ In Skill Vision-extracted profiles as alternative.
Style codes are model-version dependent: V6 codes may differ in V8. Moodboards allow persistent style from curated image collections. Backward compatible with V7 profiles.
To Explore Generate style targets for extraction.
Artifacts Status
Hands/bodies/objects dramatically better from V7+. V8.1 corrects V8.0's over-processed look. Age drift issues from V8.0 appear improved. Reduced ability for abstract/surreal compositions vs V7.
Not Started
API Access
No public API. Enterprise-only via application. Unofficial APIs (PiAPI, APIFRAME, ImagineAPI) violate ToS and risk bans. Cannot be reliably integrated into automated pipelines.
✓ In Skill Manual style tool only, not production.
Cross-Model Style Transfer
Generate style exemplars in Midjourney → use vision model (GPT-4o, Gemini) to extract structured style description → convert to prompt components for NB2/GPT-Image-2. One-time investment, reusable indefinitely. Basic plan ($10/mo) sufficient.
To Explore Test style extraction workflow.
Full Report — 2026-05-06

1. Latest Model: V8.1 Alpha

Version Timeline

VersionReleaseStatus
V6Late 2024Legacy
V72025Stable, was default
V8.0 AlphaMarch 17, 2026Superseded
V8.1 AlphaApril 14, 2026Current latest

V7 Improvements Over V6

  • Much smarter text prompt interpretation
  • Significantly better coherence for hands, bodies, objects
  • Draft Mode: half cost, 10x speed for iteration
  • Omni Reference replaced --cref with a dedicated reference tab

V8 Improvements Over V7

  • ~5x faster rendering
  • Native 2K resolution with --hd flag
  • Sharper photorealism with physically grounded lighting
  • Much better text rendering in images
  • More accurate handling of complex prompts

V8.1 Improvements Over V8.0

  • HD now default (native 2K without --hd flag)
  • HD mode 3x faster and 3x cheaper
  • Standard resolution 50% faster, 25% cheaper
  • Returns to V7's consistent aesthetic, fixing V8.0's over-polished look
  • Super-stable moodboards and srefs — the major V8.0 weak point fixed

2. Artifact and Morphing Status

Hands/bodies/objects dramatically better from V7 onward. V8 further improves physical coherence. V8.1 corrects V8.0's over-processed default look. Some legacy features still missing (image prompting, in-painting). Reduced ability for abstract/surreal compositions compared to V7.

3. Style References (--sref)

How Style References Work

The --sref parameter extracts aesthetic qualities from reference images: color palette, lighting, texture treatment, composition tendencies. Unlike regular image prompts, it focuses on style elements rather than content.

SREF System

  • Versioned system with 6 sub-versions via --sv
  • --sv 6: Default (latest)
  • V8.1 srefs are "super-stable" — major V8.0 fix
  • Backward compatible with V7 profiles
  • Style codes are model-version dependent

Style Creator Tool

New in V8: shape aesthetics visually. Enter a prompt, get aesthetic options, select preferences, receive a unique reusable style code. Web-only (midjourney.com), not Discord.

Moodboards

Curate collections of images to define persistent style. Upload images, name and save for reuse, apply across generations.

4. Cross-Model Style Transfer

Recommended workflow:

  • Generate style exemplars in Midjourney using crafted prompts + style codes
  • Feed those images as references to NB2 or GPT-Image-2
  • Or: use a vision model (GPT-4o, Gemini) to analyze the Midjourney outputs and produce structured style descriptions
  • Convert descriptions into prompt components for other models
  • One-time investment, reusable indefinitely

5. API Access

No public API. Enterprise dashboard users only, must apply. Unofficial APIs (PiAPI, APIFRAME, ImagineAPI) violate ToS and risk account bans. Midjourney cannot be reliably integrated into an automated pipeline.

Best used as a manual tool for style exploration and reference generation, not batch production.

6. Strategic Recommendation

Use For (Manual, Occasional)

  • Style exploration via Style Creator and moodboards
  • Reference image generation (high-quality style exemplars)
  • Style extraction via vision model analysis
  • Quality benchmarking against GPT-Image-2 outputs

Do NOT Use For

  • Batch production (no reliable API)
  • Automated workflows (too fragile)
  • Character consistency at scale (better via GPT-Image-2)

7. Pricing

PlanPriceImages/Month
Basic$10/mo~200
Standard$30/mo~900
Pro$60/mo~1,800 fast

For our use case (style references, not production), Basic at $10/mo would suffice.

8. Summary Scorecard

CapabilityScore
Image quality (V8.1)9/10
Style consistency (sref/moodboards)8/10
Artifact issues8/10
API for pipeline2/10
Cross-model style transfer7/10
Cost efficiency8/10
Overall pipeline fit6/10 (style reference tool only)
RESEARCH
Vision-Extracted Style Profiles
Complete
Research Finding
Skill Status
Extraction Best Practices
Optimal: 3-5 images, diverse in SCENE but consistent in STYLE. Forces vision model to identify what stays the same (style) vs. what changes (content). 10-category extraction prompt covering rendering, palette, lines, lighting, texture, character rendering, backgrounds, composition, negatives, one-sentence test.
✓ In Skill Extraction prompt template built.
JSON style guides adopted by 70-78% of professional AI creators. Structured prompts improve task accuracy 60-80% for complex scenes. Eliminates ambiguity, machine-readable, reproducible.
✓ In Skill JSON style profile in use.
Replicating --sref
No direct equivalent on NB2/GPT-Image-2. Practical approximations: (1) Reference images + explicit style instructions, (2) IP-Adapter via ComfyUI (closest open-source equivalent), (3) exactly.ai (managed LoRA), (4) Custom LoRA fine-tuning. Only option 1 works with our models natively.
✓ In Skill Reference + explicit instructions used.
Academic Techniques
StyleTokenizer (ECCV 2024): style features aligned with text embedding space. StyleBrush: dual-branch style transfer from single image. InstantStyle: targeted style injection into specific attention layers. IP-Adapter: CLIP-based image conditioning. Z-STAR: training-free attention reweighting. None directly usable with NB2/GPT-Image-2 APIs — all require Stable Diffusion.
To Explore Post-processing layer potential.
Style Layering
Generate base with GPT-Image-2 (composition, character) → style transfer layer for aesthetic. Four options evaluated: (A) Neural Style Transfer (not recommended, looks processed), (B) IP-Adapter via ComfyUI (recommended investigation), (C) StyleBrush (promising if available), (D) Img2Img with low denoising (quick and dirty). Current recommendation: don't add post-processing yet — focus on native generation quality first.
Not Started Fallback option if native fails.
Why It Failed Before
6 diagnosed failure modes: (1) descriptions too generic, (2) not enough reference images, (3) correct profile but wrong prompt placement, (4) vision model described content not style, (5) profile misses "ineffable" quality, (6) model cannot produce the style. Each has specific fixes documented.
✓ In Skill Anti-patterns encoded.
Practical Workflow
4-step pipeline: (1) Style Extraction with 3-5 gold standards, ~30 min, (2) Prompt Template Creation with 3-block structure, ~15 min, (3) Batch Generation with 3-5 candidates per spread, ~5-10 min each, (4) Iterative Refinement as needed. Full 24-page book estimated $3.60-$12.00 for 36-60 images.
✓ In Skill 4-step pipeline encoded.
Full Report — 2026-05-06

1. Vision-Extracted Style Profiles: Best Practices

What Is a Vision-Extracted Style Profile?

A structured text description of an image's visual style, produced by sending approved reference images to a vision model (Gemini, Claude, GPT-4o) and asking it to analyze and catalog style attributes. The output is a reusable text block for generation prompts.

How Many Images to Analyze

CountProsCons
1 imageSimple, fastOverfits to one scene's lighting/composition
2 imagesBetter than oneStill too few to separate scene from style
3-5 imagesSweet spot: enough diversity to find common threadRecommended
6-10 imagesVery thoroughDiminishing returns, may average too aggressively
10+ imagesComprehensiveDescriptions become generic

Key principle: Images should be diverse in SCENE but consistent in STYLE. Forces the vision model to identify style (what stays the same) vs. content (what changes).

The Extraction Prompt

A 10-category extraction covering: (1) Rendering Technique, (2) Color Palette with hex codes, (3) Line Work, (4) Lighting and Shadow with contrast ratios, (5) Texture and Surface, (6) Character Rendering proportions, (7) Background Treatment, (8) Composition and Framing, (9) What This Style Is NOT, (10) One-Sentence Test. Forces specificity, anchors to observable details, includes negatives.

Structuring Output

  • Run extraction on each of 3-5 gold standard images separately
  • Per-image profiles: keep each one
  • Intersection profile: attributes in ALL profiles = core style
  • Conflict resolution: note ranges, decide canonical values
  • Final canonical profile: merged, validated, human-approved

JSON Style Guides

The 2025-2026 standard: 70-78% of professional AI creators use JSON-formatted style guides. Structured prompts improve accuracy by 60-80% for complex scenes. Eliminates ambiguity, machine-readable, reproducible. GPT-Image-2 responds especially well to structured input.

2. Replicating Midjourney --sref

What --sref Actually Does

Midjourney uses a dedicated internal style encoder (likely CLIP-based) to extract style features from a reference image and inject them as a separate conditioning signal. Style is kept separate from content.

No Direct Equivalent Exists

FeatureMidjourney --srefNB2/GPT-Image-2
Style encodingDedicated encoder, separate channelGeneral vision encoder, mixed with content
Style vs content separationExplicitImplicit (must be prompted)
ReproducibilityDeterministic codesNo equivalent
Weight control--sw 0-1000No direct param

Practical Approximations

  • Approach 1: Reference images + explicit style instructions (available now, works with our models)
  • Approach 2: IP-Adapter via ComfyUI (closest open-source, requires Stable Diffusion)
  • Approach 3: exactly.ai (managed LoRA, tied to their platform)
  • Approach 4: Custom LoRA fine-tuning (SD only)

3. Academic Techniques

TechniqueCan Use with NB2?Can Use with GPT-Image-2?Post-Processing?Open Source?
StyleTokenizer (ECCV 2024)NoNoPotentiallyYes
StyleBrush (2024)NoNoYes (promising)Yes
InstantStyle (2024)NoNoYes (via ComfyUI)Yes
IP-Adapter (2023-2025)NoNoYes (via ComfyUI)Yes
Z-STAR (CVPR 2024)NoNoPotentiallyYes

Bottom line: None can be plugged directly into NB2 or GPT-Image-2 APIs. All require Stable Diffusion. However, they CAN be used as post-processing layers, and their design principles (separate style encoding, targeted injection, JSON prompts) inform better prompt writing.

4. Style Layering as Post-Processing

Generate a base image with GPT-Image-2 (composition, character consistency) then run through a style transfer layer for the specific aesthetic.

Options Evaluated

  • Option A: Neural Style Transfer — Not recommended. Produces "painterly filter" effect, looks obviously processed.
  • Option B: IP-Adapter via ComfyUI — Recommended investigation. Best balance of quality, control, accessibility. Uses ControlNet for composition preservation.
  • Option C: StyleBrush — Promising if accessible. Adjustable style strength, better separation than IP-Adapter.
  • Option D: Img2Img with low denoising — Quick and dirty. Good for experimentation, risky for production.

Current recommendation: Don't add post-processing yet. Focus on making GPT-Image-2/NB2 produce the right style natively. Post-processing is the fallback. If needed later, start with Option B.

5. Why It Hasn't Worked Before

Problem 1: Descriptions Too Generic

"Colorful watercolor children's book illustration with warm tones" describes half of all children's books. Fix: Use the detailed extraction prompt with hex codes, ratios, named techniques.

Problem 2: Not Enough Reference Images

Single image = profile captures scene-specific attributes instead of style attributes. Fix: Use 3-5 images with different scenes but same style. Intersection = style.

Problem 3: Wrong Prompt Placement

Style buried in middle/end of prompt. Fix: Style FIRST, clear section markers, repeat critical elements at end.

Problem 4: Content Described, Not Style

Vision model defaults to "what is in the image." Fix: Explicitly say "Do NOT describe what is depicted. ONLY describe HOW it is depicted."

Problem 5: Missing the "Ineffable" Quality

Measurable attributes correct but images still look "off." Fix: Always pair text with reference images. Use one-sentence descriptor for emergent quality. Use negative constraints. Iterate.

Problem 6: Model Cannot Produce the Style

Model's style distribution bias. Fix: Try the other model, use fine-tuned SD, apply post-processing, or adjust target style.

6. Practical Workflow

STEP 1: STYLE EXTRACTION (One-time, ~30 min)
- User provides 3-5 gold standard images
- Run extraction prompt on each via Gemini/Claude
- Aggregate into canonical JSON style profile
- User reviews and approves

STEP 2: PROMPT TEMPLATE CREATION (~15 min)
- Build three-block template: [STYLE] + [CHARACTER] + [SCENE]
- Test with 2-3 generations
- Adjust style profile based on test results

STEP 3: BATCH GENERATION (Per illustration, ~5-10 min)
- Fill in [SCENE] block
- Select 2-3 reference images
- Generate 3-5 candidates
- Automated QC: vision model scores consistency
- Present top 2-3 to user

STEP 4: ITERATIVE REFINEMENT (As needed)
- Fix one attribute at a time
- Re-run with updated profile

Cost Estimate

StepImagesEst. Cost
Style extraction (vision, text only)0$0
Template validation3-5$0.30-$1.00
Per illustration (3-5 candidates)3-5$0.30-$1.00
Automated QC (vision, text only)0$0
Full 24-page book (12 spreads)36-60$3.60-$12.00

7. Key Principles

  • Start with extraction, not generation
  • Reference images are not optional — always include 2-3
  • The profile is a living document — refine after test generations
  • Separate style from content in prompts with clear section markers
  • JSON > prose for style descriptions
  • Iterate on one attribute at a time
Next Steps
  1. Style Profile Builder — Run vision extraction on 5-8 approved illustrations. User provides: favorite images. Output: structured style profile text block.
  2. Character Consistency Testing — Generate Ollie turnaround candidates. Test reference image counts (1 vs 3 vs 5). Optimize Dot text description. User provides: approved Ollie reference images.
  3. NB2 Prompt Engineering — Research and build NB2-specific prompt skill (separate from GPT-Image-2).
  4. Post-Processing Pipeline — Research Leonardo AI for upscaling/outpainting, Midjourney for style generation.
  5. Multi-Model Workflow — Test complementary use of GPT-Image-2 + NB2 (composition vs style strengths).
© 2026 electrosaas