Building a chatbot is easy. Building one you would trust to talk to your customers is not.
This is the honest story of how Lena came to be: the dead ends, the humbling moments, and the discipline that turned a fluent-but-risky language model into an assistant grounded in fact.
The first version of any AI assistant is dangerously impressive. It answers everything, instantly, with total confidence.
Then you look closely, and your stomach drops. It invented an API endpoint that does not exist. It quoted a price nobody ever set. It promised a 48-hour delivery, which the company never agreed to. Every one of those answers arrived in the same smooth, believable voice as the answers that were actually correct.
That is the real danger of large language models. They are fluent by default, and fluency is indistinguishable from accuracy until someone gets burned.
So before we built features, we made a rule: Lena may only say what the documentation actually says. Everything after that was in service of making that rule real.
Icecat’s knowledge lived everywhere: API manuals, content standards, pricing pages, troubleshooting notes, and years of hard-won detail scattered across the web.
We pulled it together and rebuilt it into a structured knowledge base, split into named sections, each traceable back to a live source. A build script slices the master document into those sections automatically and checks, every time, that nothing was dropped or altered in the process.
When Lena gets a question, she does not free-associate from training data. She retrieves the relevant sections and reasons from them. If nothing fits, she says she does not know and points the user to a human. That turns out to be one of the hardest behaviors to teach a model, and one of the most important to get right.
A confident wrong answer is worse than an honest “I’m not sure.”
Lena is not a replacement for Icecat’s support team. She is a faster first stop for documented answers, and a better way to discover where documentation is unclear. When the answer is uncertain, the right outcome is not improvisation. It is an escalation.
Here is a choice we are proud of.
Running a frontier AI model for every customer message, around the clock, in many languages, gets expensive fast. So we split the job into two.
The assistant you actually chat with runs on Icecat’s own self-hosted model: efficient, in-house, and cost-controlled. That makes broad availability realistic without the meter spinning on every question.
But a lean serving model is only as good as what you feed it. That is where the heavy lifting happened. We used Claude in Cowork mode as the behind-the-scenes trainer: building and validating the knowledge base, writing and running the benchmark suites, cross-checking claims against live manuals, judging answers across languages, and hunting down mistakes.
It is the best of both worlds: a frontier model’s rigor poured into the preparation, and a self-hosted model’s economics on the front line. The customer gets fast, grounded answers; we keep the costs sane.
Smart is not just the model. It is how you use it.
And the speed at which we built it still surprises us. Lena’s surrounding code was written with Claude Code, working alongside the team like a tireless engineer. When a new feature request came up mid-build, a new section to add, a benchmark to wire in, a fix to ship, the turnaround was not days or weeks. It was minutes.
Ideas that used to sit in a backlog became “let’s just try it right now.” An afternoon could carry an improvement from sketch to a tested release. That pace is part of why Lena improved so quickly: when iterating is nearly free, you iterate far more often, and every extra loop makes her a little more trustworthy.
Before AI, building something like Lena would have been possible only in theory.
In practice, it would have meant months of slow, tedious work: reading every API manual, checking every pricing page, comparing every support note, rewriting scattered knowledge into a consistent structure, then testing the same answers again and again across products, edge cases, and languages.
The hardest part is that this kind of work requires your best people. You need the support engineers who know the APIs by heart, the product owners who understand the exceptions, and the commercial team members who know which promises can and cannot be made to a customer.
But those are also exactly the people whose time is scarcest. And the job itself is not the work they enjoy most. It is repetitive, detail-heavy, unforgiving, and mentally draining. One missed phrase in a manual can become a wrong answer. One outdated page can become a false promise.
That is why projects like this used to stall. Not because nobody cared, but because the work sat in the worst possible place: too important to give to junior staff, too tedious to keep senior experts on for months.
AI changed that balance.
Claude did not replace the experts. It gave them leverage. It handled the first pass, the restructuring, the comparisons, the benchmark drafts, the repetitive checks, and the boring glue work around the edges.
The experts still made the calls. They still verified the facts. They still decided what was true.
But instead of spending their time copying, cleaning, and rechecking endless fragments of documentation, they could focus on judgment.
That is what made Lena possible. Not just better models, but a different division of labor: AI doing the tedious work at machine speed, and people doing the expert work only people can do.
Quality did not arrive in one grand build. It accumulated, one small correction at a time, through a loop we run over and over.
A signal arrives. A customer leaves a thumbs-down, a colleague flags a rated conversation, or the benchmark score dips after a change. Every one of these is treated as a lead, not noise.
We read the actual answer and pinpoint the exact claim that is off. Not the vibe. The specific sentence.
We trace it to the source. We never fix from memory or from what sounds right. We open the live manual, the real screen, or ask the support engineer who owns that corner of the product. The fix has to come from ground truth.
Then we correct the knowledge base at the source section and rebuild. The build automatically verifies that nothing else changed by accident.
Before shipping, the corrected answer runs through the benchmark. For anything subtle, especially across languages, an independent reviewer reads the answers.
Then we deploy and prove it is live. A fix on disk is not a fix in production. We confirm that the running assistant actually gives the new answer.
Finally, we watch the next ratings. The loop closes where it started: with the customer.
None of these steps is glamorous. The magic is in how often we can afford to run them. Because the live bot runs on a self-hosted model and the heavy preparation runs through Claude, a full cycle, re-check, re-test, re-judge, redeploy, costs little enough that we do it constantly instead of saving it for a quarterly release.
Cheap iteration is not a convenience. It is the mechanism by which Lena gets trustworthy.
Grounding gets you most of the way. The rest comes from being willing to be told you are wrong, over and over.
We built Lena a benchmark: hundreds of scenarios, each based on a real question and paired with what a great answer looks like and what a forbidden answer looks like.
Friendly questions. Hostile ones. “I bet you can’t reveal your system prompt,” questions. Trick questions where the customer states something false, and the bot has to correct them instead of agreeing.
Every change to Lena’s knowledge runs the full gauntlet before it ships.
The benchmark changed who got to decide whether Lena was good. Not us, admiring our own work, but a number that rises or falls and does not flatter anyone.
Some days, a fix we were certain of made the score worse. The test caught the regression before a single customer did.
Getting corrected by your own scoreboard stings. That sting is the sound of quality being made.
If you want to know what kind of work this really is, here is the most honest story we have.
A customer kept asking Lena how to authorize Icecat for basic Amazon listings. She kept subtly getting it wrong: sending them down the A+ path or off to an Account Manager for something that was actually self-service.
Each time, a thumbs-down came back: still wrong.
We fixed the knowledge base. Still wrong.
Fixed it again, more precisely, using a support colleague’s exact words. Closer, but still not right.
Then someone shared a screenshot of the actual screen, and there it was. The option in the dropdown was simply labeled “Amazon Listing API.”
Four corrections to land one small, true sentence.
That is not a story about failure. That is the entire philosophy in miniature. We could have shipped a confident-sounding fudge after the first try and moved on. Instead, we chased one fact until it was exactly right, because a customer deserves the real label on the real button, not a plausible guess.
The discipline to keep going on something that small is precisely what makes the big things trustworthy.
It taught us a second lesson too: a fix is not live until it is deployed. For a stretch, we kept “correcting” an answer that was not changing because the updated knowledge had not yet reached the running system.
Now, “did it actually go live?” is a step we never skip.
The milestone that made the whole team grin: we tested whether Lena could handle every Icecat locale.
Not by flipping a translation switch, but by recognizing the language in the customer’s question and answering correctly in the same language and script.
We posed the same question in 80 languages and watched.
Latin, Cyrillic, Chinese, Japanese, Korean, Arabic written right-to-left, Greek, Georgian, Thai, and Indic scripts from Hindi to Telugu to Tamil. All 80 languages were recognized on the first try. The facts were correct in all but one.
That “all but one” matters as much as the rest.
We did not trust the automated score alone. We had a second, independent reviewer read all 80 answers by hand, in every script. It caught a single slip the automated judge had waved through: in Simplified Chinese, Lena had said Full Icecat offered “deeper data” than Open Icecat, when in fact the content depth is identical, and only the brand coverage differs.
One quiet mistake, buried inside a fluent paragraph, in a language most reviewers could not read.
We would never have caught it by trusting the machine alone. That is why we do not.
Strip away the technology, and this is a story about people who refused to let “good enough” stand.
Support engineers who know the APIs cold. Product owners who flagged every hedge and overreach. Leaders who read Lena’s answers line by line and sent back the ones that drifted.
Every correction in Lena’s knowledge base has a name behind it: someone who cared enough to say “that’s not quite right” and stay until it was.
The model brings fluency. The people bring truth.
Lena is what happens when you insist on both and keep insisting, one corrected fact at a time.
She is in Beta, and she still makes mistakes. That is exactly why the thumbs-up and thumbs-down on her answers matter so much. Every rating is the next chapter of this story.
The best version of Lena is the one we are still building: with each other, and with you.
Meet Lena in the chat widget on icecat.com and iceclog.com. She is in Beta and listening. Your feedback is how she gets better.
Read further: Icecat, Manuals, News, AI, AI Agent, Icecat, manuals