Vibecoding a complex combobox component
Can an LLM build a non-trivial UI component? I tested Claude, Gemini, and o3 on a tree-based combobox and shared what worked, what didn’t, and where they fell short.
At WorkOS, we’re constantly exploring how new tools can accelerate product development. Recently, I set out to answer a simple but loaded question: can an LLM build a complex and novel UI component for me?
Now, we’ve all seen LLMs do really impressive work in the front-end domain. But the nature of an LLM—that it is trained on existing data—makes the question here more interesting. How far can we get when the UI deviates from normal patterns?
The component in question is a tree-based combobox—a searchable dropdown with collapsible parent nodes with deeply nested child items. It had to look great, work with keyboard and screen readers, and match our design system. I used Cursor, which is configured and tuned with context from our monorepo, including our design system’s component library.
What follows is a breakdown of how it went—what worked, what didn’t, and how we approach AI-driven UI moving forward.
What I asked It to build
I started by writing a prompt that described the desired behavior of the component in plain language:
Using existing components in our design system and Radix Primitives as a base, build a combobox with the following requirements:
- It should have a search input and a list of options.
- The list of options should be formatted as a tree where parent nodes are collapsible.
- If the user's search matches a parent node, its children will not be shown until the user expands the parent node.
- If the user's search matches a child node, the parent node will be included in the results and expanded to show the matching child node.
- It should be styled to match our design system. Options should look similar to items in our `Select` component.
- It should be usable for keyboard users. The `Enter` key should expand options if the item has children, otherwise it should select the item.
- It should be accessible for screen-reader users.
- It should use similar patterns for the component API as our other design system components (compound components where composition is the primary way to configure the component)
What I got
Claude Sonnet 4
I explicitly asked for a set of compound components following the Radix composition model. Instead, I got a single component with a configuration-driven API.
<TreeCombobox.Root
searchPlaceholder="Search permissions..."
renderEmpty={() => {/* ... */}}
renderItem={() => {/* ... */}}
renderSelectedValue={() => {/* ... */}}
/>
I followed up with a request to break the component into multiple compound components, reiterating that it should look similar to a Radix component. On the second pass it did a much better job, giving me an API that looked more like this:
<TreeCombobox.Root>
<TreeCombobox.Anchor>
<Input />
</TreeCombobox.Anchor>
<TreeCombobox.Content>
<TreeCombobox.Item>Node 1</TreeCombobox.Item>
<TreeCombobox.Item>
<TreeCombobox.ItemLabel>Node 2</TreeCombobox.ItemLabel>
<TreeCombobox.Item>
<TreeCombobox.ItemLabel>Node 2.1</TreeCombobox.ItemLabel>
</TreeCombobox.Item>
<TreeCombobox.Item>
<TreeCombobox.ItemLabel>Node 2.2</TreeCombobox.ItemLabel>
</TreeCombobox.Item>
</TreeCombobox.Item>
</TreeCombobox.Content>
</TreeCombobox.Root>
Unfortunately the API did not seem to support nesting items at all, so the resulting structure wasn’t even a tree! Not at all what I asked for, so I set this one aside for the moment to see what the other models could do.
Gemini 2.5 Pro
Just like Claude Sonnet, this model failed to understand the composition model I was asking for. The initial component API looked very similar to the one generated by Claude, but even more restrictive as there was no ability to modify the rendering behavior of contained elements.
<TreeCombobox
searchPlaceholder="Search permissions..."
data={{ /** **/ }}
/>
Upon requesting changes, Gemini still could not quite understand the goal. Seems like humans aren’t the only ones who struggle with React composition! The result was still a single component with more configuration-driven props. I felt like continuing to iterate from here might lead us to even more confusing output, so I decided to move on for now and give o3 a try.
o3
This model did the best job right out of the gate, delivering an API I would actually come up with myself. This mirrored our other design system components and Radix much more closely, with a few minor tweaks.
<TreeCombobox.Root
open={open}
selectedValue={selected}
value={search}
onOpenChange={handleOpenChange}
onValueChange={setSearch}
onSelectedValueChange={onSelectedChange}
>
<TreeCombobox.Anchor>
<TreeCombobox.Input placeholder="Search permissions..." />
</TreeCombobox.Anchor>
<TreeCombobox.Content>
<TreeCombobox.ScrollArea>
<TreeCombobox.Item>
<span>Node 1</span>
<TreeCombobox.Item>
<span>Node 2.1</span>
</TreeCombobox.Item>
<TreeCombobox.Item>
<span>Node 2.2</span>
</TreeCombobox.Item>
</TreeCombobox.Item>
</TreeCombobox.ScrollArea>
</TreeCombobox.Content>
</TreeCombobox.Root>
This model was much slower to produce output, but the output quality was so good I decided to stick with it for the remaining tasks. So far so good!
Where it fell short
Getting the overall API and component scaffolding was great, but the moment I started testing it some big issues surfaced:
- The Radix
Collapsible
component didn’t play well inside a combobox. Event handlers that triggered the collapsible state conflicted with selectable items, breaking key functionality of the combobox. - Screen reader support was poor. Semantic tree structures didn’t translate intuitively in the combobox context, and ARIA role/attributes were used incorrectly.
- Keyboard navigation broke. Radix's collection utilities expect flat lists, so nested navigation was a mess.
- Filtering logic was flawed. It searched invisible values, not labels, and sometimes hid parent nodes when child matches existed.
- Focus management issues. Focus would jump or disappear entirely during interaction, preventing the user from typing after toggling a tree node’s state.
- Popover placement was off. Visually, it almost worked—but spacing and alignment were wrong.
Iteration
I hoped I could keep prompting my way out of these issues. That worked … sometimes. But most iterations looked like this:
- Write a prompt to fix a behavior
- Skim the output for obvious issues
- Test it manually
- Discover deeper bugs or regressions
- Decide: prompt again or fix it myself?
- Repeat
For simple changes, prompting worked pretty well. But for behavioral issues—especially those involving coordination between state, accessibility, and layout—the LLM often made things worse or veered off course.
At some point, I realized it would’ve been faster to rewrite much of the implementation manually.
Was it worth it?
It depends on your expectations.
If you think of the LLM as a code generator that builds the whole, fully-functional component for you, it might come up short. Unless there is a lot of existing code out there that the model has been trained on to solve similar problems, chances are that key requirements will be missed.
If you treat it like a scaffolding tool or a rapid prototype assistant: it saved a ton of time. The structure, naming, and composition patterns were remarkably close to what I would’ve written. And skipping the repetitive typing helped me stay focused on the logic. Had I stopped here and built out most of the implementation myself, I suspect I’d have saved quite a bit of time.
That said, context matters—a lot. The more context you provide your AI tools the more likely they are to produce output that satisfies your prompt. I’d suggest having a robust set of tools and techniques for feeding the LLM to improve your chances of success.
Takeaways
- Scaffolding was strong. The LLM gave me a head start and saved effort on boilerplate and API design.
- Iteration is the bottleneck. The first 80% was fast. The last 20% was slow, frustrating, and not easily prompt-able.
- Complex UI = weak spot. The more stateful, interactive, or nuanced the behavior, the more the LLM struggled.
- Prompt quality matters. Writing a good prompt takes time, especially if you’re not used to guiding LLMs effectively.
- Context matters more. Code comments and co-located tests could’ve helped preserve LLM understanding across prompts.
What I’d Do Differently
- Start with tests. Codifying the expected behavior upfront would’ve made it easier to catch regressions and clarify intent.
- Comment the files. Losing chat context is painful. In-file context is sticky and reusable.
- Leverage the right tools. I recently started using my own Cursor rules and the WorkOS Docs MCP server when working on our internal codebase, and it’s been a game changer.
- Choose the right model. Not all LLMs are created equal. Some handled logic better, others UI better.
Final Thoughts
LLMs aren’t magic. But they’re good at patterns—and for a design system-heavy codebase, that’s meaningful.
This wasn’t a one-click success story. But it also wasn’t a waste of time. The LLM didn’t finish the component for me, but it got me moving fast. And in the early stages of building complex UI, that’s still a huge productivity boost and a win.