Draft 3 Planning — Ollie & Dot

RESEARCH

GPT-Image-2 Prompt Engineering

Complete

Research Finding

Skill Status

Prompt Structure

Scene → Subject → Details → Use Case → Constraints ordering. Model gives more attention to earlier tokens.

✓ In Skill 8-step prompt structure template.

Longer prompts (300-500 words) outperform shorter ones. Structure over adjectives.

✓ In Skill Long prompt guidance encoded.

Texture Artifacts & Watermarking

Steganographic watermarking causes muddy grain on smooth surfaces. Worse on photorealistic renders, better on collage/mixed-media styles.

✓ In Skill Fresh API call rule + surface quality language.

Global texture language ("gold foil stipple shimmers on surfaces") amplifies artifacts by giving model permission to add texture everywhere.

✓ In Skill Stipple locked to specific elements only.

Community workaround: 35mm film aesthetic / intentional grain language can mask artifacts by making texture look intentional (Kodak Portra 400 look)

To Explore Test in next batch.

Upscale-then-downscale pipeline smooths over watermark artifacts during reconstruction

To Explore Post-processing option, not yet standard.

"Maximum density" language worsened artifacts in spread 05 by encouraging noise fill

✓ In Skill Scene accuracy rules encoded.

Resolution & Sizes

Only 4 fixed sizes: 1024×1024, 1536×1024, 1024×1536, auto. Generate at 1536×1024 for book spreads, upscale externally for print.

✓ In Skill 1536x1024 locked in API params.

Character Consistency

Full character anchors must repeat in every prompt. Model does NOT remember characters between API calls. Up to 16 reference images, labeled by role.

✓ In Skill Canonical character anchors + pre-send checklist.

Dot rendered with mouths/crescents instead of glowing U-shapes. Ollie got ears. Anti-patterns and ALL-CAPS emphasis needed.

✓ In Skill Anti-patterns for Dot and Ollie encoded.

Vague descriptions produce different characters every time. Short character blocks give model too much creative freedom.

✓ In Skill Exhaustive character anchors required.

Scene Accuracy / Anatomical Errors

Multi-species compositions cause morphed creatures, extra fins, blob-like fish. Worse when species listed vaguely or in excess.

✓ In Skill Species limit and anatomy rules.

Background creatures can be simplified (silhouettes, "schools of small fish") to save attention budget for foreground species.

Not Started Could be formalized as a rule.

API Parameters

Quality tiers: low (~$0.006), medium (~$0.053), high (~$0.19). Output formats: png (lossless), jpeg (faster), webp (smallest).

✓ In Skill Quality tiers and cost table encoded.

Generate 4 variants per spread for selection. Concept → refinement → final render workflow.

✓ In Skill 4-variant workflow encoded.

Full Report — 2026-05-05/06

1. Prompt Structure

The Recommended Order

GPT Image 2 rewards structure over adjectives. Words like "stunning" or "beautiful" do very little. What works is organized, specific description in this order:

Material/Medium Block — What the image is made of (torn paper, gold foil, gouache, watercolor paper)
Use Case / Format — What the image is for (children's book illustration, full bleed 3:2)
Scene/Setting — The environment, background, lighting conditions
Subject(s) — Each character described individually with full anchors
Environment Details — Specific visual attributes, sea life, props
Rendering / Palette — Colors, light direction, depth zones
Constraints — What to avoid or preserve

Why Order Matters

The model allocates more attention to earlier tokens in the prompt. Putting materials and format first tells the model "this is a mixed-media collage" before it starts imagining the scene, which fundamentally changes how it renders everything that follows. Burying the style at the end means the model may have already committed to a photorealistic rendering before it sees "watercolor."

Prompt Length

GPT Image 2 follows long prompts more reliably than most other models. Our best results used 300-500 word prompts. If you're getting generic results, the answer is usually more specificity, not less. Be explicit about every element you care about.

Concrete Example (from our Spread 01 prompt)

Dry torn paper coral with visible fiber texture exists deep
underwater alongside photographic reef life.
Gold foil metallic stipple shimmers on surfaces submerged
in real ocean water.
Gouache painted characters swim through photographic
Hawaiian reef. Heavy watercolor paper base.

Full bleed 3:2 children's book illustration. Mixed media
collage blending with underwater photography.

SCENE: A dramatic steep Hawaiian reef drop-off descending
into the deep...

OLLIE: A cartoon baby octopus. [full character anchor with
colors, proportions, anti-patterns]...

DOT: Solid natural Tahitian pearl. [full character anchor
with eye shape, anti-patterns]...

ENVIRONMENT: Rich shallow coral reef dropping into deep...

RENDERING: Warm golden god-ray light transitioning to deep
navy. [hex palette]. Full bleed 3:2 aspect ratio.

Text Rendering

GPT Image 2 achieves ~99% accuracy on text (vs ~60% for DALL-E 3). Place exact text in quotation marks or ALL CAPS, specify font style and placement explicitly. Best at 1-5 words per text element. Not relevant for our book illustrations (we avoid in-image text).

2. Texture Artifacts and Watermarking

What the Artifacts Are

Shortly after GPT Image 2 launched on April 21, 2026, users reported a consistent visual artifact: generated images were layered with persistent tiling textures and grime artifacts:

Muddy smears or dirt overlays on photorealistic outputs
Tiling noise patterns visible across interiors, landscapes, and other scenes
Subtle grime textures that appear embedded at the pixel level
Noise pattern amplification where artifacts compound across successive generations

What Causes It

Primary cause: Steganographic Watermarking. OpenAI embeds a dual-layer watermark into every generated image: (1) C2PA metadata and (2) an imperceptible pixel-level watermark with no public detector yet. The pixel watermark is baked into the generation process itself, creating visible texture patterns especially on smooth, uniform surfaces. Downstream workflows like upscaling or compositing can amplify these artifacts.

Secondary cause: Context-dependent noise accumulation. The model propagates noise from previous generations within the same conversation. Artifacts carry over and compound with each successive image in a chat thread. OpenAI fixed a noise amplification bug, but some baseline texture persists.

Tertiary cause: Training data contamination. GPT Image 2 has been observed generating images containing visible Gemini branding and other AI-model artifacts — evidence that AI-generated content in the training data introduces learned artifacts.

Which Images Show Artifacts MORE

Photorealistic renders with smooth surfaces (skin, fabric, plastic)
Sky gradients and large uniform color fields
Interior scenes with walls, floors, ceilings
Organic landscapes with dense foliage (forests reveal a distinct synthetic rendering pattern)
Any image where you zoom in on micro-textures (skin pores, fine detail)

Which Images Show Artifacts LESS

Heavily textured subjects where noise blends in (collage, mixed media)
Watercolor and gouache styles (intentional texture masks artifact texture)
Subjects with inherent visual complexity (dense coral reefs, busy compositions)
Stylized/illustrated looks rather than photorealistic

Community Workarounds

AI eraser spot cleanup: Export the image, use an AI eraser tool to brush over specific artifact areas — faster than regenerating the whole image
Upscale-then-downscale pipeline: Upscale with an external tool (Real-ESRGAN, Topaz), then downscale back. The upscaler smooths over watermark artifacts during reconstruction
Controlled grain language: Specify intentional grain ("35mm film aesthetic," "Kodak Portra 400 look") to replace artifact grain with controlled, aesthetically pleasing grain
Fresh API calls: Each generation should be a fresh API call with no conversation history
Positive surface language: Describe what you want: smooth matte finish, clean flat color
Use quality: "high" for finals — lower quality settings show more compression artifacts

Our Batch Analysis (May 5-6, 2026)

Analysis of our 5-spread OpenAI batch revealed likely artifact amplifiers in our prompts:

Global texture language: Phrases like "gold foil metallic stipple shimmers on surfaces" gave the model permission to add texture everywhere, including areas that should be smooth. Fix: target texture to specific elements ("gold foil stipple on Ollie's skin only")
"Maximum density" language: Spread 05 used "maximum density" which likely worsened artifacts by encouraging the model to fill negative space with noise/texture
Mixed photorealistic/illustrated language: Asking for "photographic Hawaiian reef" alongside "gouache painted characters" forces a hybrid rendering mode where photorealistic elements are more susceptible to watermark artifacts
Our style inherently helps: The mixed-media collage look (torn paper, watercolor, gouache) naturally masks artifacts that would be obvious on photorealistic surfaces

3. Resolution Capabilities

Actual Specifications

Parameter	Value
Supported sizes	1024×1024 (1:1), 1536×1024 (3:2), 1024×1536 (2:3), auto
Our output	1536×1024 (3:2 landscape)

Key finding: GPT-Image-2 only supports 4 fixed sizes: 1024×1024 (1:1), 1536×1024 (3:2 landscape), 1024×1536 (2:3 portrait), and auto. For print-quality book illustrations, generate at 1536×1024 (3:2) and upscale externally (Real-ESRGAN, Topaz Gigapixel).

4. Character Consistency

What Works

Detailed character anchors repeated in every prompt — full physical description with colors, proportions, materials. Our Ollie anchor is ~100 words; Dot is ~80 words. Both must appear in EVERY prompt
Explicit anti-pattern lists — "NO mouth, NO ears, NO antenna, NO hardware" prevents the most common character drift
Reference images labeled by role — "Image 1: character sheet. Image 2: approved style reference." Up to 16 reference images supported
Iterative refinement language — "Maintain the same composition and character but change [specific element]"

What Does Not Work

Vague descriptions — "a cute octopus" produces different characters every time
Relying on model memory — the model does NOT remember characters between separate API calls
Short character blocks — fewer details = more creative freedom for the model = more drift

Our Batch 1 Issues

Dot rendered with mouths, crescents, and arcs rather than the specified glowing U-shapes. Needed more aggressive anti-patterns and ALL-CAPS emphasis on the eye shape
Ollie occasionally got ears despite "NO EARS" in the prompt. Repeating "NO EARS" in the rendering block as well as the character block improved compliance
Scene concepts came through well (reef, depth zones, color palette), but character-level detail was the weak point. The model allocates attention broadly across long prompts, so character-critical details need emphasis (CAPS, repetition, anti-patterns)

5. Scene Accuracy and Anatomical Errors

The Multi-Species Problem

GPT Image 2 struggles with complex multi-species underwater compositions. When a prompt lists many species, the model often:

Morphs multiple species together (fish with wrong fin configurations, hybrid creatures)
Generates blob-like creatures that do not match any real species
Adds extra fins, limbs, or appendages to animals
Gets species colors wrong (yellow tang rendered as blue, etc.)

How to Fix

Limit unique species to 4-6 per scene — fewer species = more attention per species = better anatomy
Describe each creature individually with real anatomy: "a Moorish idol with its distinctive tall triangular dorsal fin, black-and-white vertical banding, and trailing yellow caudal filament"
Be explicit about count and placement: "three yellow tangs swimming left-to-right through the midground"
Separate foreground from background creatures: give detailed anatomy only to foreground species; background species can be silhouettes or "schools of small fish"
Use real species names with key identifying features: the model renders more accurately when given the correct name plus 1-2 anatomical features

Our Batch Findings

Spread 05 requested maximum density with 10+ named species and showed morphed fish and anatomically incorrect creatures. Spread 01 and 08, which had fewer species described more carefully, produced better individual creature accuracy. For future batches: prioritize quality of each creature over quantity of species.

6. Editing and Reference Images

How Reference Images Work

GPT Image 2 processes all reference images at high fidelity automatically with no adjustable knob.

Multi-Image Input

Label each by number and role: "Image 1: character reference. Image 2: style reference. Image 3: background scene."
Describe how they interact: "Apply the style from Image 1 to the character in Image 2, placed in the setting of Image 3."
Downscale references to what the task actually needs

The Edit Pattern

Change: [exactly what should change]
Preserve: [face, identity, pose, lighting, framing, background, geometry, text, layout]
Constraints: [no extra objects, no redesign, no logo drift, no watermark]

7. API Parameters

Setting	Use Case	Cost (1024x1024)
`low`	Fast drafts, thumbnails	~$0.006
`medium`	General use	~$0.053
`high`	Final assets, dense layouts	~$0.211
`auto`	Let the model decide	varies

Output Format

Format	Notes
`png`	Default. Lossless. Best for illustration work.
`jpeg`	Faster than PNG. Good when latency matters.
`webp`	Smallest file size. Good for web delivery.

8. GPT Image 2 vs DALL-E 3

Feature	GPT Image 2	DALL-E 3
Text rendering accuracy	~99%	~60%
Texture/photorealism	More natural, contextually grounded	Good but less precise
Character consistency	Strong with proper anchoring	Inconsistent
Prompt following	Follows long, structured prompts well	Better with shorter prompts
Max resolution	Up to 2K (2048x2048)	1024x1024

9. Recommendations for Ollie and Dot

Avoiding Texture Issues

Fresh API call for each image — never iterate within a conversation
Use quality: "high" and png output format for all finals
Target texture language to specific elements, not globally
Keep "maximum density" language out of prompts — specify exact counts and placements
Post-process with AI eraser for spot artifacts rather than regenerating entire images
Our mixed-media style is a natural advantage — lean into it

Improving Character Accuracy

Repeat character anchors verbatim in every prompt (100+ words per character)
Use ALL-CAPS for critical anti-patterns: "NO EARS," "NO MOUTH"
Repeat critical constraints in both the character block AND the rendering block
Attach 2-3 approved reference images labeled by role

Improving Scene Accuracy

Limit to 4-6 unique species per scene
Describe each creature with real anatomical features
Specify exact counts and positions: "three yellow tangs in the midground"
Use "schools of small colorful fish" for background density

Iteration Workflow

Concept phase: quality: "low", rapid iteration on composition and pose
Refinement phase: quality: "medium", nail down details and expressions
Final render: quality: "high", png format, full resolution
Post-processing: AI eraser for spot artifacts, external upscaling for print resolution

10. Sources

OpenAI Developer Community — GPT-image-generator 2.0 issues, bugs, and workarounds thread. Primary source for texture artifact reports, noise amplification fix, community workarounds
Startup Fortune — GPT Image 2 grime artifacts and watermark strategy analysis. Steganographic watermark analysis, pixel-level provenance embedding
OpenAI Help Center — C2PA in ChatGPT Images. Official C2PA metadata documentation
OpenAI Cookbook — Image Generation Models Prompting Guide. Official prompt structure recommendations
fal.ai — GPT Image 2 Prompting Guide. Structured prompting best practices, edit patterns
a2a-mcp.org — GPT Image 2 Prompts Guide 2026. Advanced prompt techniques, character consistency
Framia — GPT Image 2 Resolution analysis. Native 2K ceiling, resolution specifications
PixVerse — GPT Image 2 Review 2026. Prompt guide, use cases, quality comparison
PixNova — GPT Image 2 48-Hour Stress Test. Post-processing workflows, AI eraser technique
ImagesPlatform — GPT Image 2 In-Depth Technical Review. 4K claims vs reality, character consistency
Our own generation experiments — Batch 1 (5 spreads, May 5-6 2026): prompt analysis of spreads 01, 05, 07, 08, 13

RESEARCH

Style Profile System Research

Complete

Research Finding

Skill Status

Reference Image Capabilities

NB2 supports up to 14 reference images (4 character + 10 object). Attach 2-3 approved illustrations and say "match this style."

✓ In Skill NB2 reference image strategy encoded.

GPT-Image-2 supports up to 16 reference images via edit endpoint. Label each by role ("Image 1: style reference, Image 2: character reference").

✓ In Skill GPT-Image-2 labeled role system encoded.

NBP (Gemini 3 Pro) supports up to 11 references (5 character + 6 object). Higher quality, slower.

✓ In Skill NBP reserved for hero images.

Midjourney --sref Analysis

Midjourney --sref uses a dedicated style encoder to separate content from style. Extracts colors, textures, lighting, brushwork as a conditioning signal. Not available via API.

✓ In Skill Vision-extracted style profiles as alternative.

--sref has numeric codes for reproducible styles, weight control (--sw), and can blend multiple references. --cref separates character from style.

Not Started Prompt weighting experiments needed.

Vision-Extracted Style Profiles

Feed approved illustrations to Gemini/Claude and extract structured style description: materials, textures, colors, lighting, rendering technique. Common elements become the reusable "style profile."

✓ In Skill Full style profile built.

Style profile should be a 10-category JSON schema covering medium, color palette, line work, lighting, texture, character rendering, background treatment, mood, constraints, and prompt templates.

✓ In Skill All 10 categories covered in skill.

Academic Papers

StyleTokenizer (ECCV 2024): trains a style feature extractor on 30K images across 300+ categories. Single-image style extraction and injection into diffusion models.

To Explore Could inform future custom model training.

StyleBrush (2024): single-image style transfer via diffusion model trained on Wikiart. Proves single-image style extraction is viable for production.

To Explore Potential alternative pipeline.

Semantic Guidance (Nature, 2025): uses semantic segmentation to decouple multiple styles from reference images with parallel decoupling adapters.

To Explore Worth monitoring for future use.

QC Scoring

Automated QC: feed generated image + approved reference to vision model. Score 1-10 on material authenticity, color match, texture quality, lighting. Below 7 = reject, 7-8 = review, 9-10 = pass.

✓ In Skill 10-dimension scoring checklist encoded.

Materials / Textures / Color Palette

4 locked materials (torn paper coral, gold foil stipple, gouache characters, watercolor paper base). 6 locked textures. 7-layer media stack. Color palette varies by depth zone.

✓ In Skill Materials, textures, palette fully encoded.

Anti-patterns identified: pure digital look, flat vector, smooth plastic, anime style, pure black outlines, neon colors, cardboard feel, collage on top of photo.

✓ In Skill 12 anti-patterns in dedicated table.

Experiment Plan

4 phases: style extraction ($0.50-$1), reference image testing ($2-$5), prompt template optimization ($2-$3), automated QC validation ($0). Total: ~$4.50-$9.00, 50-70 images.

Not Started Phases 2-4 not yet executed.

Full Report — 2026-05-06

1. Midjourney --sref: How It Works

Mechanism

Midjourney's --sref (Style Reference) is a parameter that transfers the aesthetic of a reference image or a numeric style code to a new generation. It operates in two modes:

Image URL mode: --sref [image URL] — Midjourney analyzes the image and attempts to transfer its style.
Numeric code mode: --sref [code] — A numeric identifier mapping to a pre-defined internal style within Midjourney's latent space.

What --sref Captures

Drawing/collage techniques and rendering method
Photography styles (if photographic)
Color palettes and contrast levels
Light and shadow ratios
Compositions and layouts
Texture and surface quality
Overall mood/atmosphere

Internal Implementation

Midjourney has not published their architecture, but based on community analysis:

Style codes are points in a learned style latent space. Each numeric code maps to a specific region producing a reproducible aesthetic.
Image-based --sref likely uses a vision encoder (possibly CLIP-based) to extract style features, then conditions the diffusion process on those features.
The system separates content from style. --sref transfers aesthetics without subject matter; --cref does the opposite.
Style versions (--sv) indicate different encoder architectures. The current default is --sv 6.

--sref vs --cref

Feature	--sref (Style Reference)	--cref (Character Reference)
Purpose	Keep the "camera"/art direction the same	Keep the "actor"/character the same
Captures	Colors, textures, lighting, composition, rendering style	Facial features, hair, body proportions, clothing
Weight param	--sw (0-1000, default 100)	--cw (0-100, default 100)
Metaphor	Same cinematographer, different scene	Same actor, different scene

Why It Works Well

Dedicated style encoding — Style is extracted as a separate conditioning signal
Numeric reproducibility — Same code always produces the same aesthetic
Weight control — The --sw parameter lets you blend style influence
Multiple references — You can provide multiple style images to blend

Limitations

Illustrations get more creative reinterpretation than photographs
Not available via API for external tooling (Discord-only or Midjourney web)
Style codes are opaque; cannot engineer a code for a specific style without trial and error

2. Nano Banana 2 / Gemini Style Capabilities

Model Overview

Nano Banana Pro = Gemini 3 Pro Image — highest quality, up to 5 character + 6 object references (11 total)
Nano Banana 2 = Gemini 3.1 Flash Image — faster, cheaper, up to 4 character + 10 object references (14 total)

Style Reference Image Support

Capability	Nano Banana Pro (3 Pro)	Nano Banana 2 (3.1 Flash)
Character reference images	Up to 5	Up to 4
Object fidelity images	Up to 6	Up to 10
Total reference images	Up to 11	Up to 14
Supported formats	PNG, JPEG, WebP, HEIC, HEIF	PNG, JPEG, WebP, HEIC, HEIF

Important: The 14-image limit is NOT freely allocatable. Character and object quotas are independent.

Best Approach for Style Consistency

Provide 2-4 approved style reference images alongside each generation prompt
Use explicit referential language: "Use the same watercolor style, color palette, and line weight as the reference images."
Maintain a character sheet and use it as reference for all subsequent generations
Text prompts should precede reference images in the API call
Recommended order: Subject > Composition > Action > Location > Style > Editing instructions

3. GPT-Image-2 Style Reference Capabilities

GPT-Image-2 supports up to 16 reference images (JPEG, PNG, WebP, under 30MB each) via the images.edit endpoint.

How Style Matching Works

Analyzes provided reference images for visual features at high fidelity automatically
Applies style, palette, edge treatment, and silhouette language from references
Maintains character consistency across variations, style transfers, and partial edits

Best Prompt Structure for Style Consistency

Image 1: Character reference sheet showing Ollie the fox
Image 2: Approved illustration showing target color palette
Image 3: Scene composition reference

Generate a new illustration of Ollie walking through a forest.
Apply the watercolor style and color palette from Image 2.
Maintain Ollie's exact appearance from Image 1.
Use similar composition depth as Image 3.

Cost Structure

Quality	Approx Cost per Image (1024x1024)
Low	~$0.006
Medium	~$0.053
High	~$0.211
Batch (50% discount)	Half of above

4. Automated Style Extraction Approaches

Vision Models as Style Analyzers

Both Gemini and Claude can analyze an approved illustration and produce a structured style description, identifying:

Color palette: Dominant colors, accent colors, approximate hex values
Rendering technique: Watercolor, gouache, digital paint, pencil
Line quality: Weight, softness, color of outlines
Texture: Paper grain, brush strokes, noise patterns
Lighting: Direction, warmth, contrast ratio, shadow softness
Composition patterns: Typical framing, depth, character-to-background ratio
Character proportions: Head-to-body ratio, stylization level

Practical Workflow

Select 3-5 approved "gold standard" illustrations
Send each to a vision model with a structured analysis prompt
Aggregate and normalize the extracted attributes into a single style profile
Use the style profile as a prompt prefix for all subsequent generations
Periodically re-validate by comparing new generations against gold standards

Academic Research

StyleTokenizer (ECCV 2024): Trains a dedicated style feature extractor on 30,000 images across 300+ style categories. Extracts style from a single reference image and injects it into a diffusion model.
StyleBrush (2024): Extracts and transfers style from a single image using a diffusion model trained on Wikiart. Demonstrates single-image style extraction is viable for production.
Semantic Guidance (Nature, 2025): Uses semantic segmentation to decouple multiple styles from reference images with parallel decoupling adapters.

Key takeaway: For our purposes, using a vision model to extract a structured text description is more practical than training custom models. It is free/cheap, fast to iterate, and the output (text) can be directly used as prompt input for both NB2 and GPT-Image-2.

5. Professional AI Illustration Studio Best Practices

Tier 1: LoRA Fine-Tuning (Most Reliable, Most Effort)

Train a custom LoRA model on 15-30 character reference images
Produces 95%+ consistency for books over 20 pages
Not applicable since NB2 and GPT-Image-2 don't support custom LoRA injection

Tier 2: Reference Image + Prompt Engineering (Our Sweet Spot)

Create a canonical character sheet (front view, side view, expressions, outfit variations)
Build a style guide with palette hex codes, line weight, shading rules
Use a base prompt template with immutable character traits separated from variable scene elements
Append scene-specific details to the template for each illustration

[STYLE BLOCK - immutable]
Watercolor children's book illustration. Soft pencil outlines
with warm earthy palette.

[CHARACTER BLOCK - immutable per character]
Ollie: small orange fox with oversized round head, bright
curious eyes (#4169E1 blue), wearing a green scarf (#2E8B57).

[SCENE BLOCK - variable]
Ollie standing at the edge of a meadow, looking up at
fireflies. Golden hour lighting from the left.

Quality Control Process

Automated similarity scoring — SSIM and LPIPS metrics vs reference standards
Human review — Side-by-side comparison with approved references
Targeted regeneration — Failed pages get img2img refinement, not full recreation

6. Recommended Approach for Ollie and Dot

Strategy: Vision-Extracted Style Profile + Multi-Reference Generation

Step 1: Select 3-5 gold standard illustrations from existing corpus
Step 2: Extract structured style profiles via vision model (Gemini/Claude)
Step 3: Create three-block prompt templates (Style + Character + Scene)
Step 4: Generate with 2-3 gold standards + 1-2 character sheets as reference images
Step 5: Automated QC loop — vision model scores consistency, regenerate if score < 7

Factor	Rationale
No LoRA needed	Works natively with NB2 and GPT-Image-2 APIs
Low setup cost	Vision-based extraction is essentially free
Iterative	Profile can be refined as more illustrations are approved
Cross-model	Same style profile works for both NB2 and GPT-Image-2
Automated QC	Vision models can score consistency without human bottleneck

7. Experiment Plan

Phase 1: Style Profile Extraction (Cost: ~$0-1)

Select gold standards, run style extraction, merge into canonical profile, validate with text-only generations.

Phase 2: Reference Image Strategy Testing (Cost: ~$2-5)

Test 0, 1, and 3 reference images on both NB2 and GPT-Image-2. Identify optimal consistency-to-cost ratio.

Phase 3: Prompt Template Optimization (Cost: ~$2-3)

Test prompt ordering variations: style-first, scene-first, and interleaved. Measure style fidelity.

Phase 4: Automated QC Validation (Cost: ~$0)

Test vision-model-based quality scoring. Validate against human judgment (>80% agreement target).

Phase	Estimated Cost
Phase 1	$0.50-$1.00
Phase 2	$2.00-$5.00
Phase 3	$2.00-$3.00
Phase 4	$0.00
Total	$4.50-$9.00

Estimated generations: 50-70 images total across all phases.

8. Proposed Style Profile Schema

A 10-category JSON schema covering medium, color palette, line work, lighting, texture, character rendering, background treatment, mood/atmosphere, negative constraints, and prompt templates. The schema supports cross-model adaptation (NB2 and GPT-Image-2) with model-specific formatting differences.

RESEARCH

Character Consistency Research

Not Started

Research Finding

Skill Status

Turnaround Sheets

Turnaround sheet effectiveness across NB2 and GPT-Image-2 needs testing.

Not Started Generate Ollie turnaround candidates first.

Reference Image Count

Optimal reference image count per model for character stability. NB2 (14 max), GPT-Image-2 (16 max), NBP (11 max).

Not Started Test 1 vs 3 vs 5 reference images.

Dot Description Variants

Dot text description variants: text-only vs. reference images. Early finding: text-only produces better iridescent/pearlescent results.

✓ In Skill Text-only Dot rule encoded.

Cross-Scene Stability

Cross-scene stability scoring methodology — how to measure if a character looks "the same" across different scenes/depth zones.

Not Started Adapt QC scoring for character measurement.

RESEARCH

NB2 Prompt Engineering

Complete

Research Finding

Skill Status

Prompt Structure

Subject-first ordering: NB2 weights start of prompt heavily. Put subject first, then style/environment. Natural language over tag soup. JSON-structured prompts improve consistency ~25%.

✓ In Skill JSON prompt template built.

Regional prompts: divide canvas into regions with separate prompts for different areas. Useful for complex foreground/background scenes.

To Explore Test for spread compositions.

Reference Image Strategy

Up to 10 reference images simultaneously. Clearly define each role: pose, style, environment. 360-degree character sheets are the most reliable consistency method.

✓ In Skill Multi-reference strategy encoded.

Image Search Grounding (NB2 exclusive): retrieves real-world reference images via Google Search during generation for visual grounding.

To Explore Test for reef reference grounding.

Temperature / Parameters

Temperature 0-0.5 for consistency-critical work (character sheets, batches). 0.8-1.2 for balanced illustration. 1.5-2.0 for exploration. Thinking Mode: Minimal (fast), High (best quality), Dynamic (auto).

✓ In Skill Low temp for batches, high for hero.

Seed control for reproducibility. Include "HD"/"4K" keywords for sharpness. Weighted descriptors like "(same face:1.3)" enforce consistency.

✓ In Skill Seed + quality keywords encoded.

Character Consistency

Named characters tracked better across multi-turn sessions. Consistency anchors ("same character", "maintain facial features") needed in every prompt. Iterative editing (not regeneration) preserves identity better.

✓ In Skill Named characters + anchors encoded.

Style locking: prompt "same art style" across sequences. Define style guide once, reference every prompt. Combine with negative prompts to prevent common failures.

✓ In Skill Style lock rules in prompt template.

NB2 vs GPT-Image-2 Strengths

NB2 superior: photorealism, stylized illustration, cinematic lighting, speed (4-6s), anime, conversational editing. GPT-Image-2 superior: text rendering, precise layouts, batch consistency, character identity. NB2 reaches ~95% of Pro quality at 3-5x speed.

✓ In Skill Model routing rules encoded.

Full Report — 2026-05-06

1. Prompt Structure Best Practices

Subject-First Ordering

NB2 weights the start of the prompt heavily. Put your subject first, then style/environment details. Starting with style descriptors produces softer, less focused results.

Do: "A small brown dog wearing a red cape, standing on a hilltop at sunset, digital illustration style"

Don't: "Digital illustration style, sunset lighting, a small brown dog on a hilltop"

Natural Language Over Tag Soup

NB2 is a thinking model that understands intent, physics, and composition. Write prompts as if briefing a human illustrator. Avoid comma-separated tag lists.

JSON-Structured Prompts

Community benchmarks show JSON-structured prompts improve consistency by ~25% over plain text. Structure with labeled fields:

{
  "subject": "A small brown dog named Ollie wearing a red cape",
  "pose": "standing heroically, looking to the right",
  "environment": "grassy hilltop at golden hour",
  "style": "soft watercolor children's book illustration",
  "lighting": "warm golden backlight with soft shadows",
  "camera": "low angle, medium shot"
}

Regional Prompts

For different content in different parts of an image, divide the canvas into regions with a prompt for each. Useful for complex scenes with foreground characters and detailed backgrounds.

2. Reference Image Strategy

Multi-Reference Conditioning

NB2 supports up to 10 reference images simultaneously. When using uploaded images, clearly define each role: "Use Image A for the character's pose," "Use Image B for the art style," "Use Image C for the background environment."

Character Sheets (360-Degree)

The most reliable consistency method:

Generate 2-3 images of your character within a single frame showing multiple angles (front, left, right, back)
Use this sheet as reference input for all subsequent generations
The character sheet becomes the "ingredient" that maintains identity across scenes

Image Search Grounding (NB2 Exclusive)

NB2 can retrieve real-world reference images via Google Search during generation. This is unique to NB2 (not available in Pro) and helps ground outputs in real-world visual context.

3. Temperature and Parameter Optimization

Temperature Settings

Range: 0 to 2 (default: 1)
Low (0-0.5): More deterministic, predictable. Best for character sheets and batch generation.
Medium (0.8-1.2): Balanced creativity and adherence. Good default for most illustration work.
High (1.5-2.0): More diverse, surprising. Good for exploration, less predictable.

Thinking Mode (NB2 Exclusive)

Minimal: Fastest generation, good for rapid iteration
High: Best quality, slower. Use for final/hero images
Dynamic: Model decides the appropriate level per request

Quality Keywords and Seed Control

Include terms like "HD," "4K," or "HDR" for better clarity. Start every prompt with a fixed seed for reproducibility. Combine with weighted descriptors like "(same face:1.3)" to enforce consistency.

4. Character Consistency Approaches

Naming characters: Assign distinct names; model tracks named entities better across multi-turn sessions
Consistency anchors: "same character," "maintain facial features," "preserve proportions," "consistent with previous image"
Multi-turn continuity: Remind NB2 to "use same character identity as the previous image"
Style locking: Prompt "same art style" across sequences. Define style guide once, reference in every prompt
Iterative editing: If 80% correct, edit don't regenerate. NB2 excels at conversational edits
Negative prompts: Combine positive traits with negatives to prevent common failure modes

5. NB2 vs Nano Banana Pro

Feature	Nano Banana 2	Nano Banana Pro
Architecture	Gemini 3.1 Flash	Gemini 3 Pro
Speed	4-6 sec at 1K	10-20 sec at 1K
Quality	~95% of Pro	Best absolute quality
Textures/Lighting	Very good	Superior
Unique Features	Image Search Grounding, Thinking Mode	N/A
Cost	Significantly cheaper	Higher per image
Best For	Speed, batch generation, stylized work	Precision, identity preservation

6. NB2 vs GPT-Image-2

Capability	NB2	GPT-Image-2
Photorealism	Superior	Strong
Text/Typography	Good	Superior (best in class)
Anime/Illustration	Superior	Good
Structural Control	Moderate	Superior
Speed	Faster	Slower
Batch Consistency	Good	Superior
Cinematic Lighting	Superior	Good
Cost (API)	$60/M output tokens	$30/M output tokens

Choose NB2 for: stylized illustration, photorealism, cinematic scenes, rapid iteration, anime. Choose GPT-Image-2 for: text-heavy designs, precise layouts, batch production, structural accuracy.

7. Community Tips

Prompt like a Creative Director — describe the scene, mood, intent
Use the 80/20 edit rule — if mostly right, edit don't regenerate
Build character sheets first — invest upfront, save time across the project
Test with 1-2 images before scaling
Use specific color palettes: "ochre, sap green, dusty blue" beats "warm earth tones"
Limit core colors to 3-5 plus occasional accents
Lock your style guide in writing — define and reuse

RESEARCH

Multi-Model Pipeline (NB2 + GPT-Image-2)

Complete

Research Finding

Skill Status

Complementary Strengths

GPT-Image-2 excels at structural accuracy, text rendering, batch consistency. NB2 excels at stylized illustration, cinematic lighting, rapid iteration. Combining produces results neither achieves alone.

✓ In Skill Model routing table encoded.

NB2 supports up to 10 references, Image Search Grounding, and conversational editing. GPT-Image-2 has better identity preservation and strict layout adherence.

✓ In Skill Per-model capability rules.

Pipeline Architectures

Architecture A: GPT-Image-2 for composition → NB2 for style. Best for scenes with precise spatial requirements.

To Explore Test on one spread.

Architecture B: NB2 for rapid exploration → GPT-Image-2 for final production. Fast ideation with polished finals.

To Explore Test on one spread.

Architecture C: Parallel generation with best-of selection. Both models on same prompt, pick best per scene.

Not Started

Architecture D: Layered composition. NB2 for backgrounds (lighting), GPT-Image-2 for characters (accuracy), then composite.

Not Started

Ollie & Dot Recommendation

4-phase pipeline: (1) Character Design via GPT-Image-2 for reliable sheets, (2) Scene Exploration via NB2 at 3-5x speed, (3) Final Production routed by scene type, (4) Consistency Pass using NB2 conversational editing.

✓ In Skill 4-phase pipeline encoded.

Cost Optimization

Cheap models for iteration (NB2), expensive models for delivery (GPT-Image-2 where strengths matter). Never iterate at scale on the expensive model. Validate with 1-2 test images first. ComfyUI or custom Python script for orchestration.

✓ In Skill Cost-aware routing rules.

Full Report — 2026-05-06

1. Why Multi-Model?

Single-model approaches force compromise. Every model has blind spots. The 2026 best practice is to route each task to the model that handles it best.

For children's book illustration: GPT-Image-2 excels at structural accuracy, text rendering, and batch consistency. NB2 excels at stylized illustration, cinematic lighting, and rapid iteration. Combining them produces results neither can achieve alone.

2. Model Strengths Comparison

Capability	NB2	GPT-Image-2	Best Choice
Photorealism	Excellent	Very Good	NB2
Stylized Illustration	Excellent	Good	NB2
Text/Typography	Good	Best in class	GPT-Image-2
Precise Layout	Moderate	Excellent	GPT-Image-2
Cinematic Lighting	Superior	Good	NB2
Batch Consistency	Good	Excellent	GPT-Image-2
Character Identity	Good (improving)	Strong	GPT-Image-2
Speed	4-6 sec/image	Slower	NB2
Cost	$60/M output	$30/M output	GPT-Image-2
Conversational Editing	Excellent	Good	NB2
Multi-Reference	Up to 10	Limited	NB2

3. Pipeline Architectures

Architecture A: GPT-Image-2 for Composition, NB2 for Style

Flow: GPT-Image-2 (structure) → NB2 (style transfer)

GPT-Image-2 generates the base composition with precise character placement and layout
NB2 receives the output as reference and applies stylistic treatment: watercolor textures, warm lighting, soft edges
Best for scenes with specific spatial requirements

Architecture B: NB2 for Exploration, GPT-Image-2 for Production

Flow: NB2 (ideation) → GPT-Image-2 (final production)

NB2 generates rapid concept variations at 3-5x the speed
Select best concepts from NB2 explorations
GPT-Image-2 produces final version with batch consistency
Best when still figuring out composition/mood

Architecture C: Parallel Generation with Best-Of Selection

Both models generate independently on the same prompt. Compare and select the best per scene. Best when you have prompt budget and want maximum quality.

Architecture D: Layered Composition

NB2 generates backgrounds/environments (superior cinematic lighting)
GPT-Image-2 generates characters (better identity preservation)
Composite layers, then final pass for unified style
Best for complex scenes where both background quality and character accuracy are critical

4. Ollie & Dot Recommended Pipeline

Phase 1 — Character Design (GPT-Image-2): Generate definitive character sheets. Lock proportions, colors, features. GPT-Image-2's batch consistency ensures reliable reference sheets.

Phase 2 — Scene Exploration (NB2): Rapid scene iteration with character sheet references. Explore compositions and lighting at 3-5x speed. Multi-reference conditioning with character + environment + style references. Generate 4-6 variants per scene.

Phase 3 — Final Production (Model-Dependent): Text scenes → GPT-Image-2. Atmospheric/stylistic scenes → NB2. Precise character placement → GPT-Image-2. Wide environmental scenes → NB2.

Phase 4 — Consistency Pass: Review all finals. Use NB2 conversational editing for minor adjustments. Use GPT-Image-2 for structural corrections.

5. Cost Optimization

Cheap models for iteration (NB2 for rapid exploration)
Expensive models for delivery (GPT-Image-2 for finals where its strengths matter)
Don't iterate at scale on the expensive model — validate with 1-2 test images first

6. Pipeline Orchestration Tools

ComfyUI: Open-source visual workflow builder. Chain multiple models, apply controls, automate batches.
MindStudio: Visual builder for chaining 200+ models. Manages credit costs and routing.
Custom Python pipeline: Maximum control. Call GPT-Image-2 API for base, pass to NB2 via Gemini API as reference, automated QC via Gemini text model.

7. Character Consistency Across Models

Shared character sheet as anchor for both models
Detailed character description block prepended to every prompt
Style guide document with palette, lighting, texture rules
Post-generation review via Gemini text model to flag inconsistencies
One model for character, one for environment to avoid cross-model drift

8. Key Takeaways

Don't pick one model — route by task
Use NB2 for speed/exploration, GPT-Image-2 for precision/production
Character sheets are the bridge between models
NB2's conversational editing is the best finisher
Automate the pipeline with ComfyUI or custom scripts
Always test with 1-2 images before scaling

RESEARCH

Leonardo AI Post-Processing

Complete

Research Finding

Skill Status

Upscaling

Pro Upscaler scales up to 105MP. Purpose-built for AI-generated images — understands synthetic grain, cleans rather than amplifies. At 2x (3072×2048) results are highly usable. 12×8 inch spread at 300 DPI only needs ~8.6MP.

To Explore Test on GPT-Image-2 outputs.

Maintains structural integrity under intense magnification. Standard upscalers (Real-ESRGAN) may amplify AI-generated artifacts that Leonardo handles correctly.

Not Started

Outpainting (3:2 → 2:1)

Canvas mode outpainting: extend 1536×1024 (3:2) to 2048×1024 (2:1) by adding ~512px per side. 60/40 overlap ratio within tool's comfort zone. Best for extending backgrounds, not adding character elements.

To Explore Test aspect ratio conversion.

API / Pricing

Developer API with pay-as-you-go. Artisan plan ($30/mo, 25K tokens) handles ~500-1,000 upscales. Upscaling costs ~2x generation tokens. $5 free credit for new API accounts. Generate low-res, upscale selections only saves 40-50% tokens.

Not Started

vs Other Upscalers

Leonardo Pro: Best for AI images, API available, up to 105MP. Real-ESRGAN: Free, open-source fallback, doesn't understand AI patterns. Topaz: Desktop only, no API. Magnific: Over-hallucination risk, may add unwanted detail to illustrations.

✓ In Skill Leonardo as primary, ESRGAN as fallback.

Full Report — 2026-05-06

1. Upscaling GPT-Image-2 Outputs

Pro Upscaler Overview

Leonardo AI's Pro Upscaler is purpose-built for AI-generated images, capable of scaling up to 105 megapixels. Key differentiator: while most upscalers are trained on real photographs, Leonardo's understands synthetic grain and AI textures, cleaning them rather than amplifying them.

Performance

1536×1024 to print resolution: At 2x (3072×2048), results are highly usable. At 4x (6144×4096), minor artifacts but still outperforms traditional upscalers.
AI-generated image handling: Specifically handles noise patterns and stylistic grain from AI models. Standard upscalers misread these as defects.
Structural integrity: Maintained under intense magnification — important for character consistency.

Verdict: Strong candidate. 105MP ceiling far exceeds print needs (12×8 inch spread at 300 DPI = ~8.6MP).

2. Outpainting (3:2 to 2:1 Aspect Ratio)

Leonardo's Canvas mode includes Inpainting/Outpainting tools for extending images beyond original boundaries.

How It Works

Place generation box overlapping ~60% existing image, ~40% empty canvas
Keep original prompt or leave blank
AI extends the scene into empty space

Suitability for 3:2 to 2:1

Converting 1536×1024 (3:2) to 2048×1024 (2:1) requires ~512px per side extension — moderate and within the tool's sweet spot. Works best when extending backgrounds/environments rather than adding character elements.

3. Other Post-Processing

Alchemy v4: Proprietary pipeline with Hyper-Realism and Abstract Concept modes. Abstract mode more relevant for Ollie & Dot.
Background removal: Available as post-processing action
Image-to-image: Style refinement passes
Canvas editor: Manual touch-up and detail refinement

4. API and Pricing

Plan	Price	Tokens/Month	API Access
Free	$0	150/day	No
Apprentice	$12/mo	8,500	Limited
Artisan	$30/mo	25,000	Yes
Maestro	$48-60/mo	60,000	Full + priority

Token costs: Standard generation 2-25 tokens, upscaling ~2x generation cost. Cost optimization: generate low-res, only upscale selections (saves 40-50%). Artisan plan handles ~500-1,000 upscale operations. $5 free API credit for testing.

5. Comparison with Other Upscalers

Tool	Best For	API	Pricing	AI-Image Aware
Leonardo Pro	AI-generated images	Yes	Token-based	Yes (trained on synthetic)
Real-ESRGAN	General-purpose, free	Self-hosted	Free	No
Topaz Photo AI	Photographers	No	$199 one-time	No
Magnific AI	Creative upscaling	Yes	Subscription	Partial (over-hallucination risk)

Recommendation: Leonardo Pro is the strongest candidate: purpose-built for AI images, API available, token-based pricing, outpainting and upscaling in one platform. Runner-up: Real-ESRGAN as free fallback. Avoid: Magnific — tendency to hallucinate new details could compromise illustration consistency.

RESEARCH

Midjourney V8.1

Complete

Research Finding

Skill Status

Latest Model

V8.1 Alpha (April 14, 2026): native 2K resolution, ~5x faster than V7, HD now default, physically grounded lighting, much better text rendering. Returns to V7's aesthetic while keeping V8's technical improvements.

✓ In Skill V8.1 capabilities noted.

Artifacts/morphing largely resolved from V7 onward. V8.1 fixed V8.0's over-polished look and unstable moodboards. Some legacy features (in-painting) still missing.

Not Started

Style References (--sref)

--sref extracts aesthetic qualities from reference images: color palette, lighting, texture, composition. Versioned system (--sv 6 default). "Super-stable" moodboards and srefs in V8.1 — major V8.0 fix. Style Creator tool generates unique reusable style codes.

✓ In Skill Vision-extracted profiles as alternative.

Style codes are model-version dependent: V6 codes may differ in V8. Moodboards allow persistent style from curated image collections. Backward compatible with V7 profiles.

To Explore Generate style targets for extraction.

Artifacts Status

Hands/bodies/objects dramatically better from V7+. V8.1 corrects V8.0's over-processed look. Age drift issues from V8.0 appear improved. Reduced ability for abstract/surreal compositions vs V7.

Not Started

API Access

No public API. Enterprise-only via application. Unofficial APIs (PiAPI, APIFRAME, ImagineAPI) violate ToS and risk bans. Cannot be reliably integrated into automated pipelines.

✓ In Skill Manual style tool only, not production.

Cross-Model Style Transfer

Generate style exemplars in Midjourney → use vision model (GPT-4o, Gemini) to extract structured style description → convert to prompt components for NB2/GPT-Image-2. One-time investment, reusable indefinitely. Basic plan ($10/mo) sufficient.

To Explore Test style extraction workflow.

Full Report — 2026-05-06

1. Latest Model: V8.1 Alpha

Version Timeline

Version	Release	Status
V6	Late 2024	Legacy
V7	2025	Stable, was default
V8.0 Alpha	March 17, 2026	Superseded
V8.1 Alpha	April 14, 2026	Current latest

V7 Improvements Over V6

Much smarter text prompt interpretation
Significantly better coherence for hands, bodies, objects
Draft Mode: half cost, 10x speed for iteration
Omni Reference replaced --cref with a dedicated reference tab

V8 Improvements Over V7

~5x faster rendering
Native 2K resolution with --hd flag
Sharper photorealism with physically grounded lighting
Much better text rendering in images
More accurate handling of complex prompts

V8.1 Improvements Over V8.0

HD now default (native 2K without --hd flag)
HD mode 3x faster and 3x cheaper
Standard resolution 50% faster, 25% cheaper
Returns to V7's consistent aesthetic, fixing V8.0's over-polished look
Super-stable moodboards and srefs — the major V8.0 weak point fixed

2. Artifact and Morphing Status

Hands/bodies/objects dramatically better from V7 onward. V8 further improves physical coherence. V8.1 corrects V8.0's over-processed default look. Some legacy features still missing (image prompting, in-painting). Reduced ability for abstract/surreal compositions compared to V7.

3. Style References (--sref)

How Style References Work

The --sref parameter extracts aesthetic qualities from reference images: color palette, lighting, texture treatment, composition tendencies. Unlike regular image prompts, it focuses on style elements rather than content.

SREF System

Versioned system with 6 sub-versions via --sv
--sv 6: Default (latest)
V8.1 srefs are "super-stable" — major V8.0 fix
Backward compatible with V7 profiles
Style codes are model-version dependent

Style Creator Tool

New in V8: shape aesthetics visually. Enter a prompt, get aesthetic options, select preferences, receive a unique reusable style code. Web-only (midjourney.com), not Discord.

Moodboards

Curate collections of images to define persistent style. Upload images, name and save for reuse, apply across generations.

4. Cross-Model Style Transfer

Recommended workflow:

Generate style exemplars in Midjourney using crafted prompts + style codes
Feed those images as references to NB2 or GPT-Image-2
Or: use a vision model (GPT-4o, Gemini) to analyze the Midjourney outputs and produce structured style descriptions
Convert descriptions into prompt components for other models
One-time investment, reusable indefinitely

5. API Access

No public API. Enterprise dashboard users only, must apply. Unofficial APIs (PiAPI, APIFRAME, ImagineAPI) violate ToS and risk account bans. Midjourney cannot be reliably integrated into an automated pipeline.

Best used as a manual tool for style exploration and reference generation, not batch production.

6. Strategic Recommendation

Use For (Manual, Occasional)

Style exploration via Style Creator and moodboards
Reference image generation (high-quality style exemplars)
Style extraction via vision model analysis
Quality benchmarking against GPT-Image-2 outputs

Do NOT Use For

Batch production (no reliable API)
Automated workflows (too fragile)
Character consistency at scale (better via GPT-Image-2)

7. Pricing

Plan	Price	Images/Month
Basic	$10/mo	~200
Standard	$30/mo	~900
Pro	$60/mo	~1,800 fast

For our use case (style references, not production), Basic at $10/mo would suffice.

8. Summary Scorecard

Capability	Score
Image quality (V8.1)	9/10
Style consistency (sref/moodboards)	8/10
Artifact issues	8/10
API for pipeline	2/10
Cross-model style transfer	7/10
Cost efficiency	8/10
Overall pipeline fit	6/10 (style reference tool only)

RESEARCH

Vision-Extracted Style Profiles

Complete

Research Finding

Skill Status

Extraction Best Practices

Optimal: 3-5 images, diverse in SCENE but consistent in STYLE. Forces vision model to identify what stays the same (style) vs. what changes (content). 10-category extraction prompt covering rendering, palette, lines, lighting, texture, character rendering, backgrounds, composition, negatives, one-sentence test.

✓ In Skill Extraction prompt template built.

JSON style guides adopted by 70-78% of professional AI creators. Structured prompts improve task accuracy 60-80% for complex scenes. Eliminates ambiguity, machine-readable, reproducible.

✓ In Skill JSON style profile in use.

Replicating --sref

No direct equivalent on NB2/GPT-Image-2. Practical approximations: (1) Reference images + explicit style instructions, (2) IP-Adapter via ComfyUI (closest open-source equivalent), (3) exactly.ai (managed LoRA), (4) Custom LoRA fine-tuning. Only option 1 works with our models natively.

✓ In Skill Reference + explicit instructions used.

Academic Techniques

StyleTokenizer (ECCV 2024): style features aligned with text embedding space. StyleBrush: dual-branch style transfer from single image. InstantStyle: targeted style injection into specific attention layers. IP-Adapter: CLIP-based image conditioning. Z-STAR: training-free attention reweighting. None directly usable with NB2/GPT-Image-2 APIs — all require Stable Diffusion.

To Explore Post-processing layer potential.

Style Layering

Generate base with GPT-Image-2 (composition, character) → style transfer layer for aesthetic. Four options evaluated: (A) Neural Style Transfer (not recommended, looks processed), (B) IP-Adapter via ComfyUI (recommended investigation), (C) StyleBrush (promising if available), (D) Img2Img with low denoising (quick and dirty). Current recommendation: don't add post-processing yet — focus on native generation quality first.

Not Started Fallback option if native fails.

Why It Failed Before

6 diagnosed failure modes: (1) descriptions too generic, (2) not enough reference images, (3) correct profile but wrong prompt placement, (4) vision model described content not style, (5) profile misses "ineffable" quality, (6) model cannot produce the style. Each has specific fixes documented.

✓ In Skill Anti-patterns encoded.

Practical Workflow

4-step pipeline: (1) Style Extraction with 3-5 gold standards, ~30 min, (2) Prompt Template Creation with 3-block structure, ~15 min, (3) Batch Generation with 3-5 candidates per spread, ~5-10 min each, (4) Iterative Refinement as needed. Full 24-page book estimated $3.60-$12.00 for 36-60 images.

✓ In Skill 4-step pipeline encoded.

Full Report — 2026-05-06

1. Vision-Extracted Style Profiles: Best Practices

What Is a Vision-Extracted Style Profile?

A structured text description of an image's visual style, produced by sending approved reference images to a vision model (Gemini, Claude, GPT-4o) and asking it to analyze and catalog style attributes. The output is a reusable text block for generation prompts.

How Many Images to Analyze

Count	Pros	Cons
1 image	Simple, fast	Overfits to one scene's lighting/composition
2 images	Better than one	Still too few to separate scene from style
3-5 images	Sweet spot: enough diversity to find common thread	Recommended
6-10 images	Very thorough	Diminishing returns, may average too aggressively
10+ images	Comprehensive	Descriptions become generic

Key principle: Images should be diverse in SCENE but consistent in STYLE. Forces the vision model to identify style (what stays the same) vs. content (what changes).

The Extraction Prompt

A 10-category extraction covering: (1) Rendering Technique, (2) Color Palette with hex codes, (3) Line Work, (4) Lighting and Shadow with contrast ratios, (5) Texture and Surface, (6) Character Rendering proportions, (7) Background Treatment, (8) Composition and Framing, (9) What This Style Is NOT, (10) One-Sentence Test. Forces specificity, anchors to observable details, includes negatives.

Structuring Output

Run extraction on each of 3-5 gold standard images separately
Per-image profiles: keep each one
Intersection profile: attributes in ALL profiles = core style
Conflict resolution: note ranges, decide canonical values
Final canonical profile: merged, validated, human-approved

JSON Style Guides

The 2025-2026 standard: 70-78% of professional AI creators use JSON-formatted style guides. Structured prompts improve accuracy by 60-80% for complex scenes. Eliminates ambiguity, machine-readable, reproducible. GPT-Image-2 responds especially well to structured input.

2. Replicating Midjourney --sref

What --sref Actually Does

Midjourney uses a dedicated internal style encoder (likely CLIP-based) to extract style features from a reference image and inject them as a separate conditioning signal. Style is kept separate from content.

No Direct Equivalent Exists

Feature	Midjourney --sref	NB2/GPT-Image-2
Style encoding	Dedicated encoder, separate channel	General vision encoder, mixed with content
Style vs content separation	Explicit	Implicit (must be prompted)
Reproducibility	Deterministic codes	No equivalent
Weight control	--sw 0-1000	No direct param

Practical Approximations

Approach 1: Reference images + explicit style instructions (available now, works with our models)
Approach 2: IP-Adapter via ComfyUI (closest open-source, requires Stable Diffusion)
Approach 3: exactly.ai (managed LoRA, tied to their platform)
Approach 4: Custom LoRA fine-tuning (SD only)

3. Academic Techniques

Technique	Can Use with NB2?	Can Use with GPT-Image-2?	Post-Processing?	Open Source?
StyleTokenizer (ECCV 2024)	No	No	Potentially	Yes
StyleBrush (2024)	No	No	Yes (promising)	Yes
InstantStyle (2024)	No	No	Yes (via ComfyUI)	Yes
IP-Adapter (2023-2025)	No	No	Yes (via ComfyUI)	Yes
Z-STAR (CVPR 2024)	No	No	Potentially	Yes

Bottom line: None can be plugged directly into NB2 or GPT-Image-2 APIs. All require Stable Diffusion. However, they CAN be used as post-processing layers, and their design principles (separate style encoding, targeted injection, JSON prompts) inform better prompt writing.

4. Style Layering as Post-Processing

Generate a base image with GPT-Image-2 (composition, character consistency) then run through a style transfer layer for the specific aesthetic.

Options Evaluated

Option A: Neural Style Transfer — Not recommended. Produces "painterly filter" effect, looks obviously processed.
Option B: IP-Adapter via ComfyUI — Recommended investigation. Best balance of quality, control, accessibility. Uses ControlNet for composition preservation.
Option C: StyleBrush — Promising if accessible. Adjustable style strength, better separation than IP-Adapter.
Option D: Img2Img with low denoising — Quick and dirty. Good for experimentation, risky for production.

Current recommendation: Don't add post-processing yet. Focus on making GPT-Image-2/NB2 produce the right style natively. Post-processing is the fallback. If needed later, start with Option B.

5. Why It Hasn't Worked Before

Problem 1: Descriptions Too Generic

"Colorful watercolor children's book illustration with warm tones" describes half of all children's books. Fix: Use the detailed extraction prompt with hex codes, ratios, named techniques.

Problem 2: Not Enough Reference Images

Single image = profile captures scene-specific attributes instead of style attributes. Fix: Use 3-5 images with different scenes but same style. Intersection = style.

Problem 3: Wrong Prompt Placement

Style buried in middle/end of prompt. Fix: Style FIRST, clear section markers, repeat critical elements at end.

Problem 4: Content Described, Not Style

Vision model defaults to "what is in the image." Fix: Explicitly say "Do NOT describe what is depicted. ONLY describe HOW it is depicted."

Problem 5: Missing the "Ineffable" Quality

Measurable attributes correct but images still look "off." Fix: Always pair text with reference images. Use one-sentence descriptor for emergent quality. Use negative constraints. Iterate.

Problem 6: Model Cannot Produce the Style

Model's style distribution bias. Fix: Try the other model, use fine-tuned SD, apply post-processing, or adjust target style.

6. Practical Workflow

STEP 1: STYLE EXTRACTION (One-time, ~30 min)
- User provides 3-5 gold standard images
- Run extraction prompt on each via Gemini/Claude
- Aggregate into canonical JSON style profile
- User reviews and approves

STEP 2: PROMPT TEMPLATE CREATION (~15 min)
- Build three-block template: [STYLE] + [CHARACTER] + [SCENE]
- Test with 2-3 generations
- Adjust style profile based on test results

STEP 3: BATCH GENERATION (Per illustration, ~5-10 min)
- Fill in [SCENE] block
- Select 2-3 reference images
- Generate 3-5 candidates
- Automated QC: vision model scores consistency
- Present top 2-3 to user

STEP 4: ITERATIVE REFINEMENT (As needed)
- Fix one attribute at a time
- Re-run with updated profile

Cost Estimate

Step	Images	Est. Cost
Style extraction (vision, text only)	0	$0
Template validation	3-5	$0.30-$1.00
Per illustration (3-5 candidates)	3-5	$0.30-$1.00
Automated QC (vision, text only)	0	$0
Full 24-page book (12 spreads)	36-60	$3.60-$12.00

7. Key Principles

Start with extraction, not generation
Reference images are not optional — always include 2-3
The profile is a living document — refine after test generations
Separate style from content in prompts with clear section markers
JSON > prose for style descriptions
Iterate on one attribute at a time