Research & Progress
Test new photorealistic character refs + natural language NB2 prompts across 3 scenes.
nb2-consistency-prompt skill with narrative format.| Model | Character Refs | Object Refs | Total Max |
|---|---|---|---|
| NB2 (Gemini 3.1 Flash) | 4 | 10 | 14 |
| NBP (Gemini 3 Pro) | 5 | 6 | 11 |
Reference images passed as inlineData in the contents array. The model interprets image roles through prompt text — explicitly describe what each ref shows. Quotas are independent hard limits, not flexible.
nb2-consistency-prompt skill.ollie-dot-deploy + cloudflare-deploy skills with Playwright verification.GPT Image 2 rewards structure over adjectives. Words like "stunning" or "beautiful" do very little. What works is organized, specific description in this order:
The model allocates more attention to earlier tokens in the prompt. Putting materials and format first tells the model "this is a mixed-media collage" before it starts imagining the scene, which fundamentally changes how it renders everything that follows. Burying the style at the end means the model may have already committed to a photorealistic rendering before it sees "watercolor."
GPT Image 2 follows long prompts more reliably than most other models. Our best results used 300-500 word prompts. If you're getting generic results, the answer is usually more specificity, not less. Be explicit about every element you care about.
Dry torn paper coral with visible fiber texture exists deep
underwater alongside photographic reef life.
Gold foil metallic stipple shimmers on surfaces submerged
in real ocean water.
Gouache painted characters swim through photographic
Hawaiian reef. Heavy watercolor paper base.
Full bleed 3:2 children's book illustration. Mixed media
collage blending with underwater photography.
SCENE: A dramatic steep Hawaiian reef drop-off descending
into the deep...
OLLIE: A cartoon baby octopus. [full character anchor with
colors, proportions, anti-patterns]...
DOT: Solid natural Tahitian pearl. [full character anchor
with eye shape, anti-patterns]...
ENVIRONMENT: Rich shallow coral reef dropping into deep...
RENDERING: Warm golden god-ray light transitioning to deep
navy. [hex palette]. Full bleed 3:2 aspect ratio.
GPT Image 2 achieves ~99% accuracy on text (vs ~60% for DALL-E 3). Place exact text in quotation marks or ALL CAPS, specify font style and placement explicitly. Best at 1-5 words per text element. Not relevant for our book illustrations (we avoid in-image text).
Shortly after GPT Image 2 launched on April 21, 2026, users reported a consistent visual artifact: generated images were layered with persistent tiling textures and grime artifacts:
Primary cause: Steganographic Watermarking. OpenAI embeds a dual-layer watermark into every generated image: (1) C2PA metadata and (2) an imperceptible pixel-level watermark with no public detector yet. The pixel watermark is baked into the generation process itself, creating visible texture patterns especially on smooth, uniform surfaces. Downstream workflows like upscaling or compositing can amplify these artifacts.
Secondary cause: Context-dependent noise accumulation. The model propagates noise from previous generations within the same conversation. Artifacts carry over and compound with each successive image in a chat thread. OpenAI fixed a noise amplification bug, but some baseline texture persists.
Tertiary cause: Training data contamination. GPT Image 2 has been observed generating images containing visible Gemini branding and other AI-model artifacts — evidence that AI-generated content in the training data introduces learned artifacts.
smooth matte finish, clean flat colorquality: "high" for finals — lower quality settings show more compression artifactsAnalysis of our 5-spread OpenAI batch revealed likely artifact amplifiers in our prompts:
| Parameter | Value |
|---|---|
| Supported sizes | 1024×1024 (1:1), 1536×1024 (3:2), 1024×1536 (2:3), auto |
| Our output | 1536×1024 (3:2 landscape) |
Key finding: GPT-Image-2 only supports 4 fixed sizes: 1024×1024 (1:1), 1536×1024 (3:2 landscape), 1024×1536 (2:3 portrait), and auto. For print-quality book illustrations, generate at 1536×1024 (3:2) and upscale externally (Real-ESRGAN, Topaz Gigapixel).
GPT Image 2 struggles with complex multi-species underwater compositions. When a prompt lists many species, the model often:
Spread 05 requested maximum density with 10+ named species and showed morphed fish and anatomically incorrect creatures. Spread 01 and 08, which had fewer species described more carefully, produced better individual creature accuracy. For future batches: prioritize quality of each creature over quantity of species.
GPT Image 2 processes all reference images at high fidelity automatically with no adjustable knob.
Change: [exactly what should change]
Preserve: [face, identity, pose, lighting, framing, background, geometry, text, layout]
Constraints: [no extra objects, no redesign, no logo drift, no watermark]
| Setting | Use Case | Cost (1024x1024) |
|---|---|---|
low | Fast drafts, thumbnails | ~$0.006 |
medium | General use | ~$0.053 |
high | Final assets, dense layouts | ~$0.211 |
auto | Let the model decide | varies |
| Format | Notes |
|---|---|
png | Default. Lossless. Best for illustration work. |
jpeg | Faster than PNG. Good when latency matters. |
webp | Smallest file size. Good for web delivery. |
| Feature | GPT Image 2 | DALL-E 3 |
|---|---|---|
| Text rendering accuracy | ~99% | ~60% |
| Texture/photorealism | More natural, contextually grounded | Good but less precise |
| Character consistency | Strong with proper anchoring | Inconsistent |
| Prompt following | Follows long, structured prompts well | Better with shorter prompts |
| Max resolution | Up to 2K (2048x2048) | 1024x1024 |
quality: "high" and png output format for all finalsquality: "low", rapid iteration on composition and posequality: "medium", nail down details and expressionsquality: "high", png format, full resolutionMidjourney's --sref (Style Reference) is a parameter that transfers the aesthetic of a reference image or a numeric style code to a new generation. It operates in two modes:
--sref [image URL] — Midjourney analyzes the image and attempts to transfer its style.--sref [code] — A numeric identifier mapping to a pre-defined internal style within Midjourney's latent space.Midjourney has not published their architecture, but based on community analysis:
| Feature | --sref (Style Reference) | --cref (Character Reference) |
|---|---|---|
| Purpose | Keep the "camera"/art direction the same | Keep the "actor"/character the same |
| Captures | Colors, textures, lighting, composition, rendering style | Facial features, hair, body proportions, clothing |
| Weight param | --sw (0-1000, default 100) | --cw (0-100, default 100) |
| Metaphor | Same cinematographer, different scene | Same actor, different scene |
| Capability | Nano Banana Pro (3 Pro) | Nano Banana 2 (3.1 Flash) |
|---|---|---|
| Character reference images | Up to 5 | Up to 4 |
| Object fidelity images | Up to 6 | Up to 10 |
| Total reference images | Up to 11 | Up to 14 |
| Supported formats | PNG, JPEG, WebP, HEIC, HEIF | PNG, JPEG, WebP, HEIC, HEIF |
Important: The 14-image limit is NOT freely allocatable. Character and object quotas are independent.
GPT-Image-2 supports up to 16 reference images (JPEG, PNG, WebP, under 30MB each) via the images.edit endpoint.
Image 1: Character reference sheet showing Ollie the fox
Image 2: Approved illustration showing target color palette
Image 3: Scene composition reference
Generate a new illustration of Ollie walking through a forest.
Apply the watercolor style and color palette from Image 2.
Maintain Ollie's exact appearance from Image 1.
Use similar composition depth as Image 3.
| Quality | Approx Cost per Image (1024x1024) |
|---|---|
| Low | ~$0.006 |
| Medium | ~$0.053 |
| High | ~$0.211 |
| Batch (50% discount) | Half of above |
Both Gemini and Claude can analyze an approved illustration and produce a structured style description, identifying:
Key takeaway: For our purposes, using a vision model to extract a structured text description is more practical than training custom models. It is free/cheap, fast to iterate, and the output (text) can be directly used as prompt input for both NB2 and GPT-Image-2.
[STYLE BLOCK - immutable]
Watercolor children's book illustration. Soft pencil outlines
with warm earthy palette.
[CHARACTER BLOCK - immutable per character]
Ollie: small orange fox with oversized round head, bright
curious eyes (#4169E1 blue), wearing a green scarf (#2E8B57).
[SCENE BLOCK - variable]
Ollie standing at the edge of a meadow, looking up at
fireflies. Golden hour lighting from the left.
Strategy: Vision-Extracted Style Profile + Multi-Reference Generation
| Factor | Rationale |
|---|---|
| No LoRA needed | Works natively with NB2 and GPT-Image-2 APIs |
| Low setup cost | Vision-based extraction is essentially free |
| Iterative | Profile can be refined as more illustrations are approved |
| Cross-model | Same style profile works for both NB2 and GPT-Image-2 |
| Automated QC | Vision models can score consistency without human bottleneck |
Select gold standards, run style extraction, merge into canonical profile, validate with text-only generations.
Test 0, 1, and 3 reference images on both NB2 and GPT-Image-2. Identify optimal consistency-to-cost ratio.
Test prompt ordering variations: style-first, scene-first, and interleaved. Measure style fidelity.
Test vision-model-based quality scoring. Validate against human judgment (>80% agreement target).
| Phase | Estimated Cost |
|---|---|
| Phase 1 | $0.50-$1.00 |
| Phase 2 | $2.00-$5.00 |
| Phase 3 | $2.00-$3.00 |
| Phase 4 | $0.00 |
| Total | $4.50-$9.00 |
Estimated generations: 50-70 images total across all phases.
A 10-category JSON schema covering medium, color palette, line work, lighting, texture, character rendering, background treatment, mood/atmosphere, negative constraints, and prompt templates. The schema supports cross-model adaptation (NB2 and GPT-Image-2) with model-specific formatting differences.
NB2 weights the start of the prompt heavily. Put your subject first, then style/environment details. Starting with style descriptors produces softer, less focused results.
Do: "A small brown dog wearing a red cape, standing on a hilltop at sunset, digital illustration style"
Don't: "Digital illustration style, sunset lighting, a small brown dog on a hilltop"
NB2 is a thinking model that understands intent, physics, and composition. Write prompts as if briefing a human illustrator. Avoid comma-separated tag lists.
Community benchmarks show JSON-structured prompts improve consistency by ~25% over plain text. Structure with labeled fields:
{
"subject": "A small brown dog named Ollie wearing a red cape",
"pose": "standing heroically, looking to the right",
"environment": "grassy hilltop at golden hour",
"style": "soft watercolor children's book illustration",
"lighting": "warm golden backlight with soft shadows",
"camera": "low angle, medium shot"
}
For different content in different parts of an image, divide the canvas into regions with a prompt for each. Useful for complex scenes with foreground characters and detailed backgrounds.
NB2 supports up to 10 reference images simultaneously. When using uploaded images, clearly define each role: "Use Image A for the character's pose," "Use Image B for the art style," "Use Image C for the background environment."
The most reliable consistency method:
NB2 can retrieve real-world reference images via Google Search during generation. This is unique to NB2 (not available in Pro) and helps ground outputs in real-world visual context.
Include terms like "HD," "4K," or "HDR" for better clarity. Start every prompt with a fixed seed for reproducibility. Combine with weighted descriptors like "(same face:1.3)" to enforce consistency.
| Feature | Nano Banana 2 | Nano Banana Pro |
|---|---|---|
| Architecture | Gemini 3.1 Flash | Gemini 3 Pro |
| Speed | 4-6 sec at 1K | 10-20 sec at 1K |
| Quality | ~95% of Pro | Best absolute quality |
| Textures/Lighting | Very good | Superior |
| Unique Features | Image Search Grounding, Thinking Mode | N/A |
| Cost | Significantly cheaper | Higher per image |
| Best For | Speed, batch generation, stylized work | Precision, identity preservation |
| Capability | NB2 | GPT-Image-2 |
|---|---|---|
| Photorealism | Superior | Strong |
| Text/Typography | Good | Superior (best in class) |
| Anime/Illustration | Superior | Good |
| Structural Control | Moderate | Superior |
| Speed | Faster | Slower |
| Batch Consistency | Good | Superior |
| Cinematic Lighting | Superior | Good |
| Cost (API) | $60/M output tokens | $30/M output tokens |
Choose NB2 for: stylized illustration, photorealism, cinematic scenes, rapid iteration, anime. Choose GPT-Image-2 for: text-heavy designs, precise layouts, batch production, structural accuracy.
Single-model approaches force compromise. Every model has blind spots. The 2026 best practice is to route each task to the model that handles it best.
For children's book illustration: GPT-Image-2 excels at structural accuracy, text rendering, and batch consistency. NB2 excels at stylized illustration, cinematic lighting, and rapid iteration. Combining them produces results neither can achieve alone.
| Capability | NB2 | GPT-Image-2 | Best Choice |
|---|---|---|---|
| Photorealism | Excellent | Very Good | NB2 |
| Stylized Illustration | Excellent | Good | NB2 |
| Text/Typography | Good | Best in class | GPT-Image-2 |
| Precise Layout | Moderate | Excellent | GPT-Image-2 |
| Cinematic Lighting | Superior | Good | NB2 |
| Batch Consistency | Good | Excellent | GPT-Image-2 |
| Character Identity | Good (improving) | Strong | GPT-Image-2 |
| Speed | 4-6 sec/image | Slower | NB2 |
| Cost | $60/M output | $30/M output | GPT-Image-2 |
| Conversational Editing | Excellent | Good | NB2 |
| Multi-Reference | Up to 10 | Limited | NB2 |
Flow: GPT-Image-2 (structure) → NB2 (style transfer)
Flow: NB2 (ideation) → GPT-Image-2 (final production)
Both models generate independently on the same prompt. Compare and select the best per scene. Best when you have prompt budget and want maximum quality.
Phase 1 — Character Design (GPT-Image-2): Generate definitive character sheets. Lock proportions, colors, features. GPT-Image-2's batch consistency ensures reliable reference sheets.
Phase 2 — Scene Exploration (NB2): Rapid scene iteration with character sheet references. Explore compositions and lighting at 3-5x speed. Multi-reference conditioning with character + environment + style references. Generate 4-6 variants per scene.
Phase 3 — Final Production (Model-Dependent): Text scenes → GPT-Image-2. Atmospheric/stylistic scenes → NB2. Precise character placement → GPT-Image-2. Wide environmental scenes → NB2.
Phase 4 — Consistency Pass: Review all finals. Use NB2 conversational editing for minor adjustments. Use GPT-Image-2 for structural corrections.
Leonardo AI's Pro Upscaler is purpose-built for AI-generated images, capable of scaling up to 105 megapixels. Key differentiator: while most upscalers are trained on real photographs, Leonardo's understands synthetic grain and AI textures, cleaning them rather than amplifying them.
Verdict: Strong candidate. 105MP ceiling far exceeds print needs (12×8 inch spread at 300 DPI = ~8.6MP).
Leonardo's Canvas mode includes Inpainting/Outpainting tools for extending images beyond original boundaries.
Converting 1536×1024 (3:2) to 2048×1024 (2:1) requires ~512px per side extension — moderate and within the tool's sweet spot. Works best when extending backgrounds/environments rather than adding character elements.
| Plan | Price | Tokens/Month | API Access |
|---|---|---|---|
| Free | $0 | 150/day | No |
| Apprentice | $12/mo | 8,500 | Limited |
| Artisan | $30/mo | 25,000 | Yes |
| Maestro | $48-60/mo | 60,000 | Full + priority |
Token costs: Standard generation 2-25 tokens, upscaling ~2x generation cost. Cost optimization: generate low-res, only upscale selections (saves 40-50%). Artisan plan handles ~500-1,000 upscale operations. $5 free API credit for testing.
| Tool | Best For | API | Pricing | AI-Image Aware |
|---|---|---|---|---|
| Leonardo Pro | AI-generated images | Yes | Token-based | Yes (trained on synthetic) |
| Real-ESRGAN | General-purpose, free | Self-hosted | Free | No |
| Topaz Photo AI | Photographers | No | $199 one-time | No |
| Magnific AI | Creative upscaling | Yes | Subscription | Partial (over-hallucination risk) |
Recommendation: Leonardo Pro is the strongest candidate: purpose-built for AI images, API available, token-based pricing, outpainting and upscaling in one platform. Runner-up: Real-ESRGAN as free fallback. Avoid: Magnific — tendency to hallucinate new details could compromise illustration consistency.
| Version | Release | Status |
|---|---|---|
| V6 | Late 2024 | Legacy |
| V7 | 2025 | Stable, was default |
| V8.0 Alpha | March 17, 2026 | Superseded |
| V8.1 Alpha | April 14, 2026 | Current latest |
Hands/bodies/objects dramatically better from V7 onward. V8 further improves physical coherence. V8.1 corrects V8.0's over-processed default look. Some legacy features still missing (image prompting, in-painting). Reduced ability for abstract/surreal compositions compared to V7.
The --sref parameter extracts aesthetic qualities from reference images: color palette, lighting, texture treatment, composition tendencies. Unlike regular image prompts, it focuses on style elements rather than content.
--sv--sv 6: Default (latest)New in V8: shape aesthetics visually. Enter a prompt, get aesthetic options, select preferences, receive a unique reusable style code. Web-only (midjourney.com), not Discord.
Curate collections of images to define persistent style. Upload images, name and save for reuse, apply across generations.
Recommended workflow:
No public API. Enterprise dashboard users only, must apply. Unofficial APIs (PiAPI, APIFRAME, ImagineAPI) violate ToS and risk account bans. Midjourney cannot be reliably integrated into an automated pipeline.
Best used as a manual tool for style exploration and reference generation, not batch production.
| Plan | Price | Images/Month |
|---|---|---|
| Basic | $10/mo | ~200 |
| Standard | $30/mo | ~900 |
| Pro | $60/mo | ~1,800 fast |
For our use case (style references, not production), Basic at $10/mo would suffice.
| Capability | Score |
|---|---|
| Image quality (V8.1) | 9/10 |
| Style consistency (sref/moodboards) | 8/10 |
| Artifact issues | 8/10 |
| API for pipeline | 2/10 |
| Cross-model style transfer | 7/10 |
| Cost efficiency | 8/10 |
| Overall pipeline fit | 6/10 (style reference tool only) |
A structured text description of an image's visual style, produced by sending approved reference images to a vision model (Gemini, Claude, GPT-4o) and asking it to analyze and catalog style attributes. The output is a reusable text block for generation prompts.
| Count | Pros | Cons |
|---|---|---|
| 1 image | Simple, fast | Overfits to one scene's lighting/composition |
| 2 images | Better than one | Still too few to separate scene from style |
| 3-5 images | Sweet spot: enough diversity to find common thread | Recommended |
| 6-10 images | Very thorough | Diminishing returns, may average too aggressively |
| 10+ images | Comprehensive | Descriptions become generic |
Key principle: Images should be diverse in SCENE but consistent in STYLE. Forces the vision model to identify style (what stays the same) vs. content (what changes).
A 10-category extraction covering: (1) Rendering Technique, (2) Color Palette with hex codes, (3) Line Work, (4) Lighting and Shadow with contrast ratios, (5) Texture and Surface, (6) Character Rendering proportions, (7) Background Treatment, (8) Composition and Framing, (9) What This Style Is NOT, (10) One-Sentence Test. Forces specificity, anchors to observable details, includes negatives.
The 2025-2026 standard: 70-78% of professional AI creators use JSON-formatted style guides. Structured prompts improve accuracy by 60-80% for complex scenes. Eliminates ambiguity, machine-readable, reproducible. GPT-Image-2 responds especially well to structured input.
Midjourney uses a dedicated internal style encoder (likely CLIP-based) to extract style features from a reference image and inject them as a separate conditioning signal. Style is kept separate from content.
| Feature | Midjourney --sref | NB2/GPT-Image-2 |
|---|---|---|
| Style encoding | Dedicated encoder, separate channel | General vision encoder, mixed with content |
| Style vs content separation | Explicit | Implicit (must be prompted) |
| Reproducibility | Deterministic codes | No equivalent |
| Weight control | --sw 0-1000 | No direct param |
| Technique | Can Use with NB2? | Can Use with GPT-Image-2? | Post-Processing? | Open Source? |
|---|---|---|---|---|
| StyleTokenizer (ECCV 2024) | No | No | Potentially | Yes |
| StyleBrush (2024) | No | No | Yes (promising) | Yes |
| InstantStyle (2024) | No | No | Yes (via ComfyUI) | Yes |
| IP-Adapter (2023-2025) | No | No | Yes (via ComfyUI) | Yes |
| Z-STAR (CVPR 2024) | No | No | Potentially | Yes |
Bottom line: None can be plugged directly into NB2 or GPT-Image-2 APIs. All require Stable Diffusion. However, they CAN be used as post-processing layers, and their design principles (separate style encoding, targeted injection, JSON prompts) inform better prompt writing.
Generate a base image with GPT-Image-2 (composition, character consistency) then run through a style transfer layer for the specific aesthetic.
Current recommendation: Don't add post-processing yet. Focus on making GPT-Image-2/NB2 produce the right style natively. Post-processing is the fallback. If needed later, start with Option B.
"Colorful watercolor children's book illustration with warm tones" describes half of all children's books. Fix: Use the detailed extraction prompt with hex codes, ratios, named techniques.
Single image = profile captures scene-specific attributes instead of style attributes. Fix: Use 3-5 images with different scenes but same style. Intersection = style.
Style buried in middle/end of prompt. Fix: Style FIRST, clear section markers, repeat critical elements at end.
Vision model defaults to "what is in the image." Fix: Explicitly say "Do NOT describe what is depicted. ONLY describe HOW it is depicted."
Measurable attributes correct but images still look "off." Fix: Always pair text with reference images. Use one-sentence descriptor for emergent quality. Use negative constraints. Iterate.
Model's style distribution bias. Fix: Try the other model, use fine-tuned SD, apply post-processing, or adjust target style.
STEP 1: STYLE EXTRACTION (One-time, ~30 min)
- User provides 3-5 gold standard images
- Run extraction prompt on each via Gemini/Claude
- Aggregate into canonical JSON style profile
- User reviews and approves
STEP 2: PROMPT TEMPLATE CREATION (~15 min)
- Build three-block template: [STYLE] + [CHARACTER] + [SCENE]
- Test with 2-3 generations
- Adjust style profile based on test results
STEP 3: BATCH GENERATION (Per illustration, ~5-10 min)
- Fill in [SCENE] block
- Select 2-3 reference images
- Generate 3-5 candidates
- Automated QC: vision model scores consistency
- Present top 2-3 to user
STEP 4: ITERATIVE REFINEMENT (As needed)
- Fix one attribute at a time
- Re-run with updated profile
| Step | Images | Est. Cost |
|---|---|---|
| Style extraction (vision, text only) | 0 | $0 |
| Template validation | 3-5 | $0.30-$1.00 |
| Per illustration (3-5 candidates) | 3-5 | $0.30-$1.00 |
| Automated QC (vision, text only) | 0 | $0 |
| Full 24-page book (12 spreads) | 36-60 | $3.60-$12.00 |
Nano Banana Pro is Google DeepMind's premium image generation model, built on Gemini 3 Pro. It is a "Thinking" model that understands intent, physics, and composition rather than simply matching keywords.
| Feature | NBP (Pro) | NB2 (Flash) |
|---|---|---|
| Quality | Highest quality, richer textures, more natural lighting | ~95% of Pro quality |
| Speed | Slower (standard generation time) | 3-5x faster than Pro |
| Cost | $0.134/image (Google AI Studio) | Free tier available |
| Text Rendering | Best-in-class legible text | Improved but inferior to Pro |
| Lighting/Composition | Superior complex lighting, depth, perspective | Good but less nuanced |
| Reasoning | Advanced semantic understanding of spatial relationships | Fast but less compositional reasoning |
NBP processes prompts as semantic structures, reasoning about subject relationships, spatial positions, and instructional requirements before generating.
Supports up to 14 reference images. Practical optimum: 5-6 for best consistency. Above 8-10, consistency degrades.
| Slot | Type | Purpose |
|---|---|---|
| 1 | Style Reference | Visual quality, lighting, material feel |
| 2 | Composition Reference | Scene layout, camera angle |
| 3-4 | Character Reference(s) | Identity lock for recurring characters |
| 5 | Quality Anchor | Best previous generation for this scene |
| 6 | Brand Bible (optional) | Color palette, typography rules |
Critical: Every prompt must open with explicit role assignment. Character refs should include 3 views (frontal, 45-degree, side). Minimum 1024x1024 resolution.
| Goal | Temperature | Notes |
|---|---|---|
| Maximum consistency | 0.8-1.0 | Best for character-critical scenes |
| Balanced (default) | 1.0 | Gemini 3 default |
| Creative variation | 1.0-1.4 | Good for exploring compositions |
| Too random (avoid) | >1.5 | Degrades quality and consistency |
Community workflow: Generate 4 variants at temperatures 0.85, 0.95, 1.05, 1.15. Pick best. Edit pass at 0.9. Final polish at 0.85.
Key rules: Never chain edits. Always go back to best original source. Max 10 changes per edit. Short, numbered, concrete visual instructions.
| Spec | NBP (Pro) | NB2 (Flash) |
|---|---|---|
| Base Model | Gemini 3 Pro Image | Gemini 3.1 Flash Image |
| Max Resolution | 4K (4096x4096) | 4K (4096x4096) |
| Max Reference Images | 14 (optimal: 6) | 10 (optimal: 5-6) |
| Max Character Tracking | 5 simultaneous | 3-4 simultaneous |
| Generation Speed (1K) | 10-20 sec | 4-6 sec |
| Cost (2K) | ~$0.134/image | ~$0.08/image |
| Text Rendering | Superior | Good (can blur small text) |
| Criterion | Winner | Notes |
|---|---|---|
| Torn paper fiber texture | NBP | Richer detail, more natural fiber separation |
| Gold foil metallic rendering | NBP | Superior metallic sheen, realistic stipple |
| Gouache paint texture | NBP | More natural brushwork, visible paint body |
| Watercolor transparency | NBP | Better wet-on-wet bleeding, truer pigment mixing |
| Photo-real caustics | NBP | Superior photorealistic integration |
| Overall material fidelity | NBP | NB2 is 90-95% of Pro quality |
Verdict: NBP is superior for character consistency, particularly for maintaining specific material qualities (gold foil, nacre) that define the characters.
| Model | Drift Pattern | Frequency |
|---|---|---|
| NBP | Over-refinement (too polished vs handmade) | ~10% |
| NBP | Unwanted photorealistic elements | ~5% |
| NB2 | Smoothing/simplifying textural detail | ~20% |
| NB2 | Treating collage as filter rather than physical media | ~15% |
| NB2 | Defaulting to more "illustrated" vs "mixed-media" | ~10% |
| Phase | Model | Purpose | Cost |
|---|---|---|---|
| Exploration | NB2 | Rapid composition/layout testing | ~$0.06/img |
| Refinement | NB2 | Prompt tuning before committing to Pro | ~$0.06/img |
| Finals | NBP | Hero images with full reference stack | ~$0.134/img |
| Edits/Fixes | NBP | Targeted edits on near-final images | ~$0.134/img |
Three-model triangle: GPT-Image-2 (structural accuracy, text-heavy spreads) + NB2 (rapid iteration, prompt validation) + NBP (final quality, material fidelity, metallic/iridescent rendering).
| Scenario | Images | Cost |
|---|---|---|
| NB2-only | 80 | ~$4.80 |
| NBP-only | 80 | ~$10.72 |
| Hybrid (NB2 explore + NBP finals) | 80 NB2 + 40 NBP | ~$10.16 |
| Hybrid optimized | 40 NB2 + 20 NBP finals + 20 NBP edits | ~$7.76 |
Recommended total budget: $12-18 for full Draft 3 hybrid approach (covers exploration, refinement, finals, and edits with margin for iteration).
Current rule "USE ONLY NB2" should be updated to:
67, Phoenix 166, Flux Dev 299, Flux Schnell 298, Lucid Realism 431. Multiple style refs supported with influence weighting.POST /variations/universal-upscaler. ARTISTIC mode recommended for illustrated content. Ultra Mode supports up to 8K output. Parameters: creativityStrength, detailContrast, similarity, upscaleMultiplier. AI-native training handles synthetic grain better than general-purpose upscalers.de7d3faf-762f-48e0-b3b7-9d0ac3a3fcf3. Built for prompt adherence. Improved style transfer fidelity in 2.0. Rewards precise, descriptive prompting over vague/artistic language. Phoenix-specific params: contrast (3/3.5/4), alchemy: true, ultra: true. Leans photorealistic by default — needs prompting effort for stylized looks.Leonardo AI has a fully public REST API at https://cloud.leonardo.ai/api/rest/v1/. Authentication via Bearer token. New API accounts receive $5 in free credit.
| Operation | Method | Endpoint |
|---|---|---|
| Image Generation | POST | /generations |
| Upload Init Image | POST | /init-image |
| Universal Upscaler | POST | /variations/universal-upscaler |
| Create Dataset | POST | /datasets |
| Train Element (LoRA) | POST | /elements |
Leonardo's direct equivalent to Midjourney --sref. Upload a reference image and include it as a controlnet with the Style Reference preprocessor ID. The system extracts visual style (not content/composition) and applies it to new generations.
| Model | Preprocessor ID |
|---|---|
| SDXL | 67 |
| Phoenix | 166 |
| Flux Dev | 299 |
| Flux Schnell | 298 |
| Lucid Realism | 431 |
Strength Types: Low, Mid, High, Ultra, Max
Multiple Style References: Combine multiple style reference images with an influence parameter (0-1) to blend different style aspects (e.g., torn paper texture at 0.6 + gold foil gouache at 0.4).
For deeper style consistency, train custom LoRA models ("Elements") on a set of reference images. The style becomes baked into a reusable model weight.
POST /datasets)POST /elements)userElements arrayLimitation: Element training currently supports SDXL-based models. Compatibility with Phoenix and Flux models for custom Elements should be verified.
| Feature | Midjourney --sref | Leonardo Style Ref | Leonardo Elements (LoRA) |
|---|---|---|---|
| Style from image | Yes | Yes (controlnet) | N/A (training-based) |
| API access | No (Discord only) | Yes (REST API) | Yes (REST API) |
| Strength control | --sw 0-1000 | Low/Mid/High/Ultra/Max | weight 0-1.0 |
| Multiple refs | Yes | Yes (multiple controlnets) | Yes (blend Elements) |
| Reusable style code | Yes (style code) | No persistent code | Yes (trained Element ID) |
| Training required | No | No | Yes (10-50 images) |
| Batch consistency | High | Moderate-High | High |
Key advantage: Leonardo's LoRA training on 10-50 reference images can learn nuances more deeply than a single reference image can convey. This dual approach (Element + Style Reference) would likely outperform Midjourney --sref for batch style consistency.
| API Credits | Basic ($) | Standard ($) | Pro ($) |
|---|---|---|---|
| 5,000 | $15.00 | $11.00 | $8.00 |
| 10,000 | $30.00 | $22.00 | $16.00 |
| 25,000 | $75.00 | $55.00 | $40.00 |
| 50,000 | $150.00 | $110.00 | $80.00 |
| Scenario | Credits/image | Cost at Pro | Cost at Basic |
|---|---|---|---|
| Standard gen (SDXL) | ~5 | ~$0.008 | ~$0.015 |
| Alchemy-enhanced | ~12 | ~$0.019 | ~$0.036 |
| Phoenix generation | ~8-15 | ~$0.013-$0.024 | ~$0.024-$0.045 |
| With style reference | +2-5 extra | +$0.003-$0.008 | +$0.006-$0.015 |
| Model | Cost/image | Notes |
|---|---|---|
| Leonardo (Pro, Phoenix + sref) | ~$0.016-$0.032 | Cheapest option |
| NB2 (Replicate) | $0.04-$0.08 | Fixed per generation |
| NBP (Google AI Studio) | $0.08-$0.134 | Fixed per generation |
| GPT-Image-2 | ~$0.02-$0.08 | Varies by resolution |
Endpoint: POST /variations/universal-upscaler
ultraUpscaleStyle: ARTISTIC (for illustrations) or REALISTICcreativityStrength: How much the AI adds detail (1-10)detailContrast: HDR-like detail contrast (1-10)similarity: Structural similarity to original (1-10)upscaleMultiplier: Scale factorAI-native training handles synthetic grain better than general-purpose upscalers like Real-ESRGAN.
No dedicated API endpoint confirmed. Canvas outpainting appears to be a web UI feature only. For programmatic outpainting, workarounds include:
This is a gap in Leonardo's API for the Ollie & Dot pipeline, where 3:2 to 2:1 aspect ratio conversion is needed.
de7d3faf-762f-48e0-b3b7-9d0ac3a3fcf3contrast (3/3.5/4), alchemy: true, ultra: trueStyle Reference Image + Trained Element + Precise Prompt → Leonardo Phoenix API → Raw Generation (1024x768) → Universal Upscaler (ARTISTIC mode) → Print-Ready Illustration (4K+)
| Model | Underwater | Mixed-Media | Char. Consistency | Style Consistency | Cost/Image |
|---|---|---|---|---|---|
| Flux 2 Pro | 9/10 | 5/10 | 8/10 | 8/10 | $0.03–0.055 |
| Flux Kontext | — | — | 8/10 | 8/10 | $0.02–0.04 |
| Ideogram 3.0 | 5/10 | 6/10 | 8/10 | 9/10 | $0.03–0.09 |
| Recraft V3 | 4/10 | 5/10 | 6/10 | 9/10 | $0.04–0.08 |
| SD 3.5 | 6/10 | 6/10 | 8/10 | 8/10 | $0.065 |
| Leonardo Phoenix | 6/10 | 4/10 | 7/10 | 7/10 | $0.007–0.02 |
| InstantCharacter | — | — | 7/10 | — | Via FAL.ai |
| Pipeline Approach | Est. Cost (200 gens) |
|---|---|
| Leonardo only | $1.40–$4.00 |
| Flux Dev + Kontext (FAL) | $4.00–$5.00 |
| Flux Pro + Kontext Pro | $7.00–$10.00 |
| NB2 (free) + GPT-Image-2 | $0 + $4.00 |
| Multi-model hybrid | $5.00–$15.00 |
The Ollie and Dot style is fundamentally a compositing problem, not a single-generation problem. The style requires: photorealistic underwater environments (Hawaiian reef, caustics, light physics), torn paper collage elements with visible fiber texture, gold foil metallic stipple dots, gouache-painted cartoon characters (baby octopus + pearl), watercolor paper base texture, bioluminescent glow effects, and nacre/pearl iridescence. No single model excels at all of these simultaneously.
Best-in-class photorealism in the open-weight space. 4MP output, strongest underwater environments of any model evaluated. Available as Pro ($0.03–0.055/image) and Dev ($0.025/image). Full LoRA support on Dev/Klein for custom style training. Flux Kontext enables instruction-based editing with character identity preservation across scenes.
Style References (up to 3 images) + reusable Style Codes for batch consistency. 4.3 billion style presets. Character Reference for facial/trait consistency. Best-in-class text rendering. Turbo: $0.03/image, Quality: $0.09/image. No LoRA training — inference-time only. May not preserve photo-vs-illustration contrast.
20B parameter design-focused model. Style Lock constrains line weight, color application, detail level, proportions, and rendering approach. Native SVG output. Character consistency still developing (feature request on roadmap). Raster: $0.04/image, Vector: $0.08/image. Design-grade illustration, not photorealism.
Most mature LoRA ecosystem. IP-Adapter + ControlNet = most fine-grained control available. Canny edge, depth maps, blur ControlNets. Can combine Style IP-Adapter + Depth ControlNet + Character LoRA. Lower base quality than Flux 2, but unmatched control mechanisms. $0.065/image (API) or free self-hosted.
Style/Character/Content Reference via ControlNet preprocessors. Custom LoRA training ("Elements") on 10–50 images. Very affordable ($0.007–0.02/image). Leans photorealistic by default. Will homogenize mixed-media layers. Best role: cheap iteration and concept testing.
Tuning-free character consistency from a single reference image. Built on scalable diffusion transformer, trained on 10M+ character samples. Available on FAL.ai. Designed primarily for humanoid characters — performance with baby octopus cartoon character is untested.
| Model | Mechanism | Type |
|---|---|---|
| Leonardo Phoenix | Style/Character/Content Reference (3 preprocessors) | Inference-time |
| Leonardo Elements | LoRA training on 10–50 images | Trained |
| Flux Kontext | Instruction-based editing | Inference-time |
| Flux LoRA | Custom fine-tuning | Trained |
| Ideogram 3.0 | Style Reference images + Style Codes | Inference-time |
| Recraft V3 | Style Lock | Inference-time |
| SD3.5 | IP-Adapter + ControlNet | Inference-time |
| NB2/NBP | Subject consistency (up to 5 characters) | Inference-time |
| InstantCharacter | Single-image character preservation | Inference-time |
| Model/Platform | Cost/Image | Best For |
|---|---|---|
| Leonardo (Apprentice) | ~$0.007 | Cheapest, token-based |
| Flux Kontext Dev (FAL) | $0.02 | Style transfer, editing |
| Flux 2 Dev (FAL) | ~$0.025 | LoRA-based generation |
| Ideogram 3.0 Turbo | $0.03 | Fast style-consistent |
| Flux 2 Pro (BFL) | $0.03–0.055 | Top photorealism |
| Recraft V3 (raster) | $0.04 | Design illustration |
| GPT Image 2 | $0.04 | Composition + text |
| NB2/NBP (Google AI Studio) | $0.134 (paid) / free | Current primary model |
| Ideogram 3.0 Quality | $0.09 | Maximum quality |
[Flux 2 Pro] [NB2/NBP] [GPT-Image-2]
| | |
Photorealistic Characters Collage/Texture
Underwater BG (octopus, pearl) Elements
| | |
+----------+------------------+------------------------+
|
[Composite Layer]
(manual or GPT-Image-2 compose)
|
[Style Unification]
(Flux Kontext OR trained LoRA)
|
[Upscale + Final QC]
(Flux 2 Pro 4MP or Gemini review)
|
Final Spread
Unified API aggregator: one API key, one billing system, access to Flux 2 Pro/Dev/Schnell, Flux Kontext Pro/Dev, Ideogram 3.0, Recraft V3, SD 3.5, InstantCharacter, NB2, and community LoRAs. Pay-per-use with no subscription. Swap models by changing an endpoint string. Ideal platform layer regardless of which models are chosen.
No single model replaces NB2/NBP. But specific models do specific layers much better:
The best path forward: enhance the current NB2/NBP + GPT-Image-2 pipeline with targeted additions, accessed via FAL.ai as a unified platform.