Codex, Jules, and Claude Code comparison

codex openai claude gemini jules anthropic agent ai

2025-05-23


I've tried three of the newer agentic code assistants this week: OpenAI Codex, Google Jules, and Claude Code.

I asked each of them to operate in the same codebase, which is a personal app which I built to track my finances. It's a pretty straightforward Django CRUD application, which tracks account balances over time, does some lightweight reporting, and can produce charts of the account balances, that kind of thing. There is very little Javascript, and it uses Bootstrap for the UI.

I gave each agent the same task:

"""
I'm not happy with the name `accounts` as the application which deals with financial accounts. The name `accounts `is overloaded in web application development, and it might clash with other models in the system. I've decided it should not be used in this context.

Instead, this should be renamed to `money`, and all the side effects should be dealt with:

- Updating URL patterns throughout.
- Updating the `accounts:` namespace in all {% url %} tags.
- Updating import statements throughout.
- Ensuring the tests continue to pass.

At the same time, let's squash the migrations.
"""

This is a pretty simple request, one which I'd expect a junior software engineer to be capable of performing, though it would rely on some knowledge of Django patterns, understanding migrations, and how to find and update references to a package.

OpenAI Codex

I setup the repository access via the web UI, gave it the task, and it immediately began working.

The agent's security model allows it internet access during the container bootstrap, but not after. The repository is pretty simple in layout, and the deployment is Herokuish, so I'd expect the AI to pickup from the requirements.txt file that this was a Python project, and know to install the dependencies. Instead, the container started, and no dependencies were installed.

Looking through the thinking log of the agent, it noticed very early that the dependencies were not installed, but decided to continue anyway. It then performed the task, but it's approach was quite superficial (replacing the string accounts with money wherever it was found).

The AI then tried to run the tests, but not in the way that is documented in the README.md, or present in the Makefile (the correct way to run the tests is just make test, which will deal with migrating, running the tests, calculating coverage etc.). Instead it tried to use the standard Django ./manage.py test approach. Regardless of approach, this failed, as no dependencies were installed.

Rather than stop, the AI then decided that the tests not running wasn't a problem, and YOLO'd up a PR with the changes. The total runtime was about 20 minutes.

I pulled the PR locally, and ran the tests, which failed catastrophically due to a Python syntax error which was introduced in a migration file.

This was a very poor impression, and it reinforces my opinion that OpenAI has lost their first-mover advantage. The product felt rushed and incapable.

Grade: F

Google Jules

Jules appears to be a highly beta product, and was clearly struggling with high load when I ran the test.

There were parts of the UI which I really liked, specifically a diff viewer for each file the agent touched. I think this is a really useful visible indicator of progress (or being stuck in a rabbit hole). This is more akin to the in-IDE agents like Cline, and I think it's a useful middle ground.

Container setup was successful, and the AI found and installed my requirements. It appears that Google trust their Gemini model to access remote resources at any time, so there wasn't the same limitations as Codex displayed. Interestingly, it seems that their base container has uv installed already, and it prioritied using uv over pip, though this also missed the documented steps in the README.md, which would have pointed it to running make setup.

It follows the pattern which I've found particularly useful when using agentic assistants: plan then act, and presented the plan for my approval. The plan was reasonable in it's approach, and I requested it begin.

For some reason (again, maybe high load), when the agent was about half way through the planned steps, it stopped and asked for my approval to continue. I was happy to give approval, and then it continued to the second-to-last step in the plan (running the tests), and stopped, telling me the code was "Ready for Review", and prompting me to hit a button to create the PR in my Github repository. Our last step in the plan, to squash the migrations, was never attempted.

The tests passed, the refactoring was successful, but the whole process took around an hour. I'm interested in seeing how this progresses as I think Gemini is the standout model across most tasks at the moment. Jules is a very early product and it shows, but so does the potential. I wonder how many IDE features Google plan on implementing in the browser.

Grade: C

Claude Code

I'll preface this by saying that I've used Claude Code via the CLI extensively since it launched (I'm at least $2000 deep), so I had some preconceptions here. However, the day this launched was also the launch of Opus 4, a significant upgrade to the underlying model.

Setup was the most awkward of the products, relying on the claude CLI tool, and gh for Github operations. The process of authenticating with Github was similar to any other OAuth application, but Claude Code operates entirely as a Github-based tool, and doesn't have any UI outside of interactions in Issues and PRs.

I copied and pasted my task description into a new ticket, and then notified it to begin work with "@claude work on this ticket". This triggers two effects: the agent updated the ticket with a detailed plan, and triggered a long-running Github Action to perform the work.

I like being able to see the agent's plan, though there was no opportunity to amend the plan or provide further guidance. I also liked being able to connect to the Github Action log and see in detail the tools the agent was using, and the detailed, very verbose output from the session.

Claude Code completed the task after about 12 minutes, having run a limited subset of the tests (just for the newly refactored money application), and completing the 'squash migrations' task which the other agents failed to address.

I was prompted via an Issue update to create a PR, which triggered the project-wide tests to run, one of which failed; one of the namespaced URLs in the application's main navigation hadn't been updated. I expect the agent would have caught this failure had it ran the entire test suite rather than a subset of the tests. I've seen this behaviour in claude-cli, too. I expect this might be a token-efficiency strategy, as verbose test failure output can produce a LOT of tokens very quickly, but I'd expect that at the end of a task the agent would test the entire codebase.

A comment on the newly-created PR "@claude can you fix the failing test" resulted in the bug being fixed in a further 4 minutes.

Grade: B