Scaling Preference Elicitation
The Engineering Challenge
The theoretical framework tells us what to measure. The engineering challenge is how to measure it reliably, at scale, across diverse populations, without sacrificing the precision that makes the data scientifically useful.
Collecting human preference data sounds simple: show people two things, ask which they prefer, record the answer. In practice, every decision in the elicitation pipeline affects the quality and interpretability of the data.
Three Modalities, One Profile
We collect preferences through three complementary modalities, each probing different aspects of the preference landscape.
Questions (~10 per profile). High-level constraints that partition preference space coarsely. These establish the broad contours—the major axes along which this person's preferences vary. Questions are cheap to answer and provide scaffolding for the more precise modalities that follow.
Reference labels (~30 per profile). The annotator sees a stimulus and marks it on a quality scale. This gives absolute anchoring: it tells you not just ordering but threshold—where "good enough" lives for this person. Labels are more informative per item than comparisons but more susceptible to scale calibration issues.
Pairwise comparisons (~100 per profile). Forced choices between alternatives. This is where the precision lives. Each comparison gives you one bit of ordinal information about the local structure of preference space. A hundred comparisons, chosen adaptively based on previous responses, can recover a detailed preference surface.
The mathematical foundation is Bradley-Terry preference modeling. Given pairwise outcomes, we fit a latent utility function that maximizes the likelihood of the observed choices. The resulting surface is the taste profile—a differentiable function over stimulus space that predicts preference for novel stimuli.
Quality Control for Subjective Data
The failure mode specific to subjective data is inconsistency. If someone prefers A to B, B to C, and C to A, the Bradley-Terry model can't fit it cleanly. Some cyclicity is natural—preferences are genuinely intransitive in certain domains—but excessive cyclicity signals inattention or confusion.
We monitor three quality metrics in real time:
- Internal consistency. Fraction of comparison triples that satisfy transitivity. Below threshold triggers a quality flag.
- Test-retest reliability. Correlation between preferences measured at different times on the same annotator. Low reliability suggests the annotator isn't engaging with the stimuli.
- Inter-annotator agreement. For calibration stimuli shown to multiple annotators, how much do preferences converge? This separates universal structure from idiosyncratic noise.
Low-quality annotations get flagged. Annotators with persistently low quality scores get retrained or removed from the pool. The quality pipeline runs continuously—there's no batch review step where bad data accumulates undetected.
Version Control for Evolving Preferences
Preferences change. What you found compelling last year isn't necessarily what compels you now. A static model misses this—and more importantly, misses the fact that preference evolution is itself data.
We treat taste profiles as versioned artifacts. Each profile has a commit history: immutable snapshots that record the state of preference at a given time. You can branch to explore variations without losing your current state. You can merge branches when an experiment works out. You can diff two versions to see what changed.
The trajectory of a person's taste through preference space reveals dynamics. Which aesthetic attractors are stable, which transitions are common, where individual variation is high versus low. This temporal structure is as informative as the cross-sectional data. It tells you about the dynamics of experiential space, not just its topology.
Multi-Modal Coverage
Experience isn't only visual. The platform supports annotation across modalities—images, video, text, code, websites, audio, documents—each with modality-specific rendering and interaction patterns optimized for honest, rapid evaluation.
Different modalities probe different dimensions of experiential space. Visual aesthetics probe spatial integration and valence. Music probes temporal dynamics and arousal. Narrative probes counterfactual engagement and self-model salience. Code aesthetics probe structural elegance—a dimension of experience that programmers know intimately but that has received almost no formal attention.
Cross-modal preference data is especially valuable. If someone's visual preferences and their musical preferences share structure—if what they find beautiful in images predicts what they find beautiful in sound—that shared structure reflects the underlying experiential geometry, not modality-specific processing. Cross-modal coherence is evidence of geometric universality.
The Keyboard-Driven Interface
Annotation interfaces affect data quality. A slow, confusing interface produces noisy data. We optimized for speed and low cognitive overhead: keyboard shortcuts for all actions (1-4 to select quality, S to skip, arrow keys to navigate), minimal visual clutter, large stimulus presentation, and immediate feedback. The goal is to minimize the gap between perceiving and reporting—to get the preference judgment before the annotator has time to overthink it.
This matters because the signal we're after is pre-reflective. The preference you report after deliberation is contaminated by your theory of what you should prefer. The preference you report in 800 milliseconds is closer to the raw geometry.
What the Infrastructure Enables
Every taste profile we collect is a partial observation of one person's experiential geometry. At scale, these partial observations constrain the underlying structure. A million comparisons across a thousand people gives you the empirical distribution of human preference—and that distribution has structure. It clusters in ways that reflect shared human architecture. The clusters are the structural motifs of experience: the shapes that delight takes, the shapes that curiosity takes, the shapes that wonder takes.
We built this infrastructure because the science requires data that didn't previously exist in machine-readable form, and the engineering to collect it well hadn't been done. Now it has.