Persona -> turns -> product signals.
Failure is measured as behavior over time, not as a one-shot aesthetic score.
VIVID turns image-model evaluation into a longitudinal debugging lens: In this demo, persona-grounded users edit across five mobile photo scenarios, revealing where trust strengthens, where friction accumulates, and which model behavior needs deeper inspection.
A single edited image is not enough. VIVID tracks whether trust holds across repeated edits and which failures change user behavior.
Failure is measured as behavior over time, not as a one-shot aesthetic score.
Model score and friction coverage are separate signals. That is why image-modeling decisions need persona-specific, multi-turn evaluation.
Persona panels produce two different signals: a faster model-quality estimate and a slower, broader friction map.
Do not collapse evaluation to a single average score. Pair model-level acceptance with persona-specific friction coverage so engineers can see which user contexts still need inspection.
Every persona edits turn by turn from its own feedback and memory. Saved outputs are then evaluated by three judge families so the analysis can separate model behavior, persona differences, and judge-family variance.
Each persona simulation agent has habits, tolerance thresholds, alternative services, and memory of prior friction.
The next turn is driven by that persona's own feedback, so drift can accumulate instead of being reset.
Claude judges in-session; GPT-5.4-mini and Grok-4.3 rejudge post-session to check family bias.
The output is acceptance, friction category, persona split, judge disagreement, and product validation implication.
openai / gpt-image-2
google_vertex / gemini-3-pro-image-preview
These cards are pulled from saved journey JSON and turn-level image outputs. Each one shows the seed image, turn-5 result, per-turn evaluation, and persona state changes that explain why a user would keep trusting the service or start working around it.
PSA means persona simulation agent: a persona-conditioned evaluator with goals, memory, tolerance thresholds, and turn-by-turn feedback.
For a professional portrait user, repeated cleanup turns compounded skin and hand texture artifacts. Quality and trust fell together, which is the longitudinal failure the average score hides.
Seed
Turn 5For a small-business seller, the relevant product question is not beauty alone. A persistent shadow over menu copy keeps acceptance low and trust erodes across the journey.
Seed
Turn 5A seller persona does not forgive corrupted menu prices or pseudo-text, even when the scene looks warmer. This is a switching-risk signal for practical image workflows.
Seed
Turn 5The novice travel/photo user keeps very high acceptance across five turns. The remaining issues are mild softness and foreground blur, not identity or evidence damage.
Seed
Turn 5GPT Image 2 wins more scenarios by three-judge user acceptance and is especially strong in S4/S5, where text, price, menu, and commerce trust failures are high-risk for mobile photo editing.
This is a compact 8-persona pilot across five scenarios. Treat it as directional evidence for a deeper Apple validation study, not as a standalone deployment decision.
5 scenarios x 2 models x 8 personas x 5 turns x 3 judge families. This creates directional evidence for deeper validation, not a standalone release decision.
A 0-1 proxy for whether the persona would keep using the baseline product experience. It is more useful as a relative lift and scenario split than as an absolute score.
Derived signal: acceptance below 0.40, frustration above 0.03, or a severity 4-5 issue. This captures product risk that a mean score can hide.
Trust is longitudinal. In this study the mean rises, but spread grows across turns, meaning users polarize rather than cleanly converge.
Across 80 journeys and 400 generated turns, friction hit rate rises from 25.0% at T1 to 77.5% at T5. Half of journeys show their first friction by T2, and 63.7% by T3.
Mean trust moves from 0.546 to 0.654, but trust spread expands from 0.022 to 0.137. Low-friction journeys end near 0.778 trust; high-friction journeys end near 0.544.
GPT Image 2 wins 4 of 5 scenarios by three-judge user acceptance.
Avg acceptance 0.478; avg quality 0.704.
Avg acceptance 0.452; avg quality 0.619.
Measure whether different Apple users trust a built-in mobile editor when face edits must stay memory-faithful and consent-safe across repeated turns.
Measure whether a mobile editor improves a family/pet memory without inventing anatomy, texture, or scene evidence that breaks emotional trust.
Measure whether a travel-photo edit stays documentary enough for memory and sharing while still providing the vivid polish users expect from mobile AI.
Measure whether an everyday mobile editor can polish commerce-like or practical photos without corrupting text, labels, product identity, or evidence value.
Measure whether an everyday mobile editor can make a practical food/tabletop image more appealing without corrupting menu text, food identity, prices, or commercial trust.
privacy-sensitive family and pet photo keeper
face and pet identity, natural skin/fur, consent-sensitive family memory trust
professional portrait photographer
identity preservation, retouching artifacts, delivery readiness
accessibility-aware casual iPhone photo editor
simple controls, readable/natural contrast, confusing failures, over-edited look
social content creator
share-ready polish, speed, subject/background consistency
small-business product seller
product identity, readable labels, commercial trust
Korean mobile-first creator
Korean prompt adherence, subtle portrait edits, local control
Chinese product and food-photo seller
text preservation, product truthfulness, color accuracy
Japanese novice travel-photo user
landmark preservation, natural color, low tolerance for repeated drift
S1 is the Nano Banana pocket; S4/S5 are GPT Image 2 pockets. The same aggregate winner does not explain text, commerce, identity, and memory risk equally.
Directional only: expert personas show 9 GPT-favored scenario cells vs. 1 Nano-favored cell.
Cross-scenario volatility identifies who changes preference most as the task changes.
With n=8, treat correlations as leads for follow-up sampling, not statistical claims.
Language is useful as a routing dimension, but scenario risk still drives the strongest differences.
Majority rates show how often the three judge families agree on which model wins. Spread shows residual judge-family disagreement that should be reviewed before higher-stakes product decisions.
| Scenario | Risk axis | Quality majority | Acceptance majority | Acceptance spread |
|---|---|---|---|---|
| S1 | Identity trust | 0.650 | 0.625 | 0.204 |
| S2 | Memory/action realism | 0.475 | 0.700 | 0.290 |
| S3 | Place fidelity | 0.700 | 0.650 | 0.245 |
| S4 | Text trust | 0.175 | 0.600 | 0.220 |
| S5 | Commerce trust | 0.425 | 0.600 | 0.274 |
Cells show GPT Image 2 minus Nano Banana acceptance. Positive values indicate GPT Image 2 pull, negative values indicate Nano Banana pull, and gray indicates practical tie.
| Persona | S1 | S2 | S3 | S4 | S5 | GPT prefs | Nano prefs | Max shift |
|---|---|---|---|---|---|---|---|---|
| P01Family/pet keeper | -0.020 | -0.015 | +0.025 | +0.080 | -0.026 | 1 | 0 | 0.106 |
| P02Pro photographer | -0.021 | +0.139 | +0.074 | +0.042 | +0.097 | 4 | 0 | 0.159 |
| P03Casual access. | -0.177 | +0.060 | +0.007 | +0.100 | -0.096 | 2 | 2 | 0.278 |
| P04Social creator | -0.020 | -0.026 | -0.047 | +0.089 | +0.042 | 2 | 1 | 0.136 |
| P05Small business | -0.032 | -0.081 | +0.038 | +0.061 | +0.096 | 3 | 2 | 0.177 |
| P06Korean creator | -0.024 | +0.007 | +0.055 | +0.030 | +0.078 | 2 | 0 | 0.102 |
| P07Chinese seller | -0.056 | +0.176 | -0.006 | +0.052 | +0.097 | 3 | 1 | 0.232 |
| P08Travel novice | -0.002 | -0.015 | +0.033 | +0.072 | +0.184 | 3 | 0 | 0.199 |