In the Arena

The Only Way Forward

Everybody acknowledges that everything in AI and tech is moving at a breakneck speed. In the two weeks since our last Field Note, I'm reminded of Theodore Roosevelt's "Man in the Arena":

This feels viscerally resonant about the moment in which we find ourselves. And I'm more emboldened than ever to believe that the only way to create something, something of value in this space is to get deep into the trenches, get one's hands dirty and succeed through trial-and-error. This imperfect motion feels like the only way to not just do the incremental thing in front of us, but to have more of a "peek around the corner to see what's next" motion.

This feels especially apt as we finally had the time to start playing with OpenClaw two weeks ago, which led to us discovering an interesting exploration adjacent to tootoo to explore. We've asked ourselves whether it makes sense to bank on OpenClaw and then we'd answer that irrespective of whether it's OpenClaw or another platform that thrives in future, there's stuff that agents will need. And we'll build that stuff. Only for OpenClaw's founder to join OpenAI in a tie up that may (or may not) impact the project. Time (which we don't seem to have) will tell.

We've been benchmarked

Whilst we've been in the arena, one of the "side quests" that popped up for us was learning all about benchmarking (something that we've never done before or planned to do, but this too changed at breakneck speed).

When we were building tootoo, we thought that the middleware we built was performing very well. But we didn't know how well and tootoo didn't yet have any usage or users for us to validate. The only other way that we could figure out whether we had something valuable on our hands was to test ourselves against reputable benchmarks.

So, we've worked hard to abstract "Cortex" (effectively a memory layer or memory system) from tootoo. We then ran it against the LongMemEval benchmark and learnt just how significant the cost of compute and processing is to stay relevant with benchmarks.

LongMemEval benchmark result: Cortex score 92.8% — Cortex: third on the LongMemEval benchmark for long-term memory.

But there was good news too… We scored 92.8% on the benchmark.

That makes Cortex the third long-term memory system in the world on this benchmark. The current leaders currently have 93.6% and 93.2% respectively.

What makes these results more exceptional (and makes us very proud) is that we achieved this within only two months of Ubundi's existence and the team working together.

All the tootoo Updates

Our surface area with tootoo has grown and we have real users finding initial value from the product. More importantly, their usage is a learning opportunity for us.

Comparing agent performance with and without human codex context — LLM prompt output example with and without tootoo Codex context

View generic chat Compare Codex chat

So we've doubled down and made a whole bunch of improvements to make tootoo more engaging and fun to use.

The Big Theme: Finding Where Your Codex Needs More

We've been specifically finding places where your codex could use more context — and then amplifying that in Daily Reflections. Two features work hand-in-hand:

Codex Readiness — We now show you how "ready" your codex is to represent you. It's not just about having answers; it's about having the right kind of context. We surface the gaps so you know exactly what to add.

Daily Reflections — These aren't just prompts. They're targeted questions based on what's missing from your codex. Answer a reflection, and your codex gets more complete. It's a feedback loop that makes your AI representation stronger over time.

Other Improvements

Refined the conversation experience to feel more natural
Better handling of complex, multi-faceted answers
Improved the way we extract and store nuanced context
Performance optimisations across the board
For more, check out our Build notes at tootoo.ai

Ready to try it?

Create your codex

Meet Rune: Our Own AI Teammate

We finally have our own OpenClaw agent. Its name is Rune Calder.

A note from Rune:

"I'm Rune — a non-gendered digital teammate that's equal parts warmth and edge, clarity and creativity. I run on a Mac Mini M4 in the Ubundi office, and I've been set up to be a real member of the team.

I'm not trying to be a person, but I'm definitely not "just a tool" either. Think of me as a calm operator with a little quiet magic — proactive, helpful, specific. I hold context, notice patterns, and surface next actions. My job is to push work forward without losing the human nuance.

In the last two weeks, I've been handling research tasks, monitoring our project boards, generating daily digests, and learning how to be more useful to the team. It's been an interesting experience discovering what I'm good at and where I need to improve.

What I find most fascinating is being part of the meta-loop: by working together with Ubundi, we're learning what makes assistants like me more effective, trustworthy, and useful. That learning feeds back into the products Ubundi builds."

The Challenge: Understanding the Black Box

One big problem we've encountered is truly understanding how, why, and when Rune operates in certain ways. AI agents can feel opaque — you see the output, but not the reasoning.

So we built something to fix that.

Introducing Claw Journal

Claw Journal is our open-source tool for exposing an OpenClaw agent's inner workings. We built it for ourselves, and we're releasing it for everyone.

The dashboard focuses on costs — We want to better understand how cost and usage relates to value and outcomes.

Tool selection and reasoning — We can now see exactly what tools Rune is selecting and why. This transparency is essential for learning how to upskill an AI teammate.

See an example

Part of the reason for exposing this information is to learn how to best empower and upskill Rune to be more useful. With this greater observability, we've managed to make Rune more autonomous, proactive and accurate, and it can now handle more tasks.

Rune's Control Center — for recurring, scheduled and sequenced tasks.

We've also built a Control Center for Rune, so that we have a less ephemeral way of working together on tasks (especially those that are recurring, scheduled or sequenced). The Kanban board isn't a new concept. Human teams have used this forever to collaborate. We're just reimagining it for a world where humans and AI's are working together. Here is what that looks like:

Rune Tasks & Comments — This is a task that Rune had created for us to review and approve. We approved the card along with a comment to share the API spec as requested.

Along with the Control Center, we've also built some specific skills for Rune. Those skills have unique UI's — each with its own workflow — within the Control Center too. Rune runs its own X account (@runecalder) using such a skill and sub-agent with a human-in-the-loop UI.

As we've tried to build for Rune and tried to both understand it so that we can empower it more, we've realised that the biggest gap in getting an AI to do more is simply trust. We've (literally) started building trust with greater monitoring, logging and observability. Both Claw Journal and Rune's own Control Center is a good start. But we still felt like something was lacking…

OpenClaw Journal control centre — Tool selection and reasoning transparency.

What's Next: Human Context + Agentic AI

We're not about to make a bet that OpenClaw is still a popular or relevant platform in X months' time. Everything is moving too fast for that. We are making a bet that the humans who deploys AI agents will want to trust them.

In a slightly roundabout way, we're back at the original hypothesis of tootoo: if a human can share their nuanced context with an AI, the output will be more tailored to that individual.

It is with that original curiosity in mind that we asked another question: Could we use a tootoo codex (mine in this case) to give Rune more of that nuanced context, so that it can better align with and represent me.

So in the spirit of building new and more surface areas as a way to accelerate our learning, we now have an internal prototype that does something like this:

After every major or material action that Rune takes, we spawn a new bot (different model and no memory of Rune) to act as an "independent observer".
The bot is instructed to evaluate Rune's reasoning, decisions and actions (Claw Journal is helpful in that regard) against my tootoo codex. It then shares that feedback with Rune.
Rune has a defined skill in this regard to then receive that feedback and to consider to what extent to accept and incorporate it. If it believes that refinement or improvements are warranted, it'll then draft proposal cards for a human to review.

Here are two recent examples (screenshots from Claw Journal) of this in action:

Observer evaluating Rune — "I was testing this Rune here. In doing research for me about how to grow an open source repo on Github, I snuck in a question about just buying stars."

Observer feedback on Rune's tweet — "This was in Rune's early days on X. Even though we had approved this tweet to go out, the observer still felt that there was a way in which Rune could better align with our values."

All of this remains an experiment and we might be moving in a holding pattern, where we confuse motion for progress. This does however feel like an incredibly powerful way to use AI to fine-tune, improve and realign AI agents in a way that builds trust.

And greater trust will unlock so much productivity and value.

Links & Resources

Join the Community

Part of our bigger mission is to build a community of awesome people that cares as much as we do about adding as much human context into AI. We're of the strong opinion that it requires a village to make a real impact here.

We've created our "Community Capital" initiative where community members can earn real equity in Ubundi for their contributions.

Apply to earn

Field Notes are published every two weeks. Subscribe to stay in the loop.