OpenAI Codex Tips: A Practical Guide for Product Managers and Non-Engineers

I have been using Codex more and more recently. Not in a “look, AI wrote me a todo app” kind of way. More like: I need to ship my actual product, fix real bugs, improve real tests, and not spend my evenings arguing with a robot about why it deleted half of my codebase.

And, as usual with AI tools, the difference between “this is magic” and “this is useless” is often not the model itself. It is how you use it.

So here are a few practical Codex tips that made the biggest difference for me.

Use the app, not just the web

I started using Codex in the web version and I was pretty happy with it.

The flow was simple enough: create a task, review the result, then open a pull request into my repo. It already felt useful. It was not perfect, but it saved me time and helped me move faster.

Then I installed the Codex app on Windows.

That improved my vibe coding life instantly.

The difference is not just the interface. The app can work closer to your actual development environment. It can inspect the repo properly, run commands, use terminals, validate changes, and test its own work in a more realistic setup.

And this matters a lot.

A coding agent that only edits files is useful. A coding agent that can edit files, run the project, read errors, fix them, run tests again, and show you the result is much more useful.

Since I started using the app, the quality of output has improved noticeably. I spend less time reviewing obvious mistakes. I spend less time writing “no, that broke the build”. And most tasks now take fewer prompts to complete.

The agent still needs supervision. It still makes mistakes. But it is a very different experience when it can validate its own assumptions instead of throwing a patch over the wall and hoping for the best.

Always start in planning mode

This might be the simplest improvement with the highest return.

Do not start by asking Codex to immediately change the code. Start by asking it to inspect the repo and create a plan.

This gives you a chance to check whether it actually understood the task. More importantly, it lets you catch bad assumptions before they become bad code.

When reviewing the plan, I usually pay attention to three things:

Does it understand the actual problem?
Is it changing the right files?
How is it planning to test or validate the result?

The third point is the most important one.

A vague validation step like “run tests” is often not good enough. Which tests? Is there a relevant test? Does the project even have tests for that area? Should it run the game, build the app, check logs, test a specific flow, or inspect a generated file?

The better the validation plan, the better the final output.

Planning mode also helps you avoid accidental overengineering. Sometimes Codex will try to solve a small issue with a massive refactor. Sometimes it will suggest creating abstractions you absolutely do not need. Catching that in the plan is much cheaper than discovering it after 900 changed lines.

Write a proper AGENTS.md

AGENTS.md is one of those things that feels optional until you finally use it properly.

Then you realise how much repeated prompting it saves.

Instead of explaining your project rules again and again, write them once. Things like:

how the project is structured
how to run tests
how to build the app
coding conventions
naming conventions
things the agent should never touch
platform-specific gotchas
how you want changes validated
what “done” means in your repo

The important bit: Codex reads AGENTS.md before starting work, so you do not need to reference it every time.

This is especially useful when your project has weird edges. And every real project has weird edges.

For example, in my game project, there are platform-specific details, asset rules, Godot conventions, Android build quirks, and store publishing requirements. I do not want to explain all of that every time I ask the agent to fix a menu bug.

AGENTS.md turns that knowledge into reusable context.

One warning though: do not turn it into a novel. If the file becomes too long, vague, or full of outdated rules, it becomes noise. Keep it practical. Keep it specific. Update it when you notice Codex repeatedly misunderstanding something.

Give Codex examples from your own codebase

Codex is much better when you show it what “good” looks like inside your repo.

If you want a new screen, point it to an existing screen you like.

If you want a new test, point it to a similar test.

If you want a new service, show it the service pattern already used elsewhere.

This avoids the classic AI problem: technically correct code that looks like it came from a different project.

Most codebases have local style. Not just formatting, but small decisions. How errors are handled. How files are named. How state is passed around. How UI components are structured. How logs are written. How tests are organised.

Codex can infer some of that, but it does much better when you explicitly anchor the task in examples.

A prompt like this works much better:

Add this feature using the same structure as existing_feature.gd. Follow the same naming style, signal pattern, and validation approach. Do not introduce a new architecture unless necessary.

This sounds boring. It works.

Keep tasks small enough to review

One of the biggest traps with AI coding agents is giving them tasks that are too large.

It is tempting. The agent seems powerful, so you ask it to “refactor the whole save system, improve the UI, fix achievements, and add analytics while you are there”.

That is how you get a giant diff nobody wants to review.

Codex is better when the task has a clear boundary. One bug. One feature. One refactor. One test suite. One screen. One integration step.

Not always tiny, but reviewable.

The size of the task should match the size of your trust. If it is an area of the codebase you know well and the agent has handled before, give it more freedom. If it is risky, new, or business-critical, narrow the scope.

The goal is not to make Codex do the biggest possible task. The goal is to ship correct work faster.

Ask for tests, but do not blindly trust them

One of the best uses of Codex is writing tests.

One of the worst mistakes is assuming that AI-written tests prove anything.

Codex can write useful tests, but it can also write tests that simply confirm the implementation it just created. This is not always intentional. It is just what happens when the same agent writes both the code and the validation.

So I usually ask for tests in a specific way.

Not “write tests”.

More like:

Add tests that would fail with the current bug and pass after the fix. Explain why each test would fail before the change.

That forces the agent to connect the test to the actual behaviour, not just to coverage theatre.

For bug fixes, regression tests are especially valuable. For refactors, I want tests or validation steps that prove behaviour did not change. For UI changes, I want screenshots, logs, or a clear manual validation checklist if automated tests are not realistic.

Tests are not magic. But they are a very good way to make the agent less hand-wavy.

Use screenshots and logs aggressively

Do not describe everything from memory.

If something is broken visually, give Codex a screenshot.

If a build fails, give it the actual logs.

If a store submission rejects something, give it the exact warning or error.

If a UI layout looks wrong, show the before and describe the desired after.

Agents are much worse when they have to guess. They are much better when they can inspect evidence.

This is also where the desktop app experience helps. If Codex can run the project, inspect output, and use your local files, you get a much tighter loop.

The worst version of vibe coding is: It does not work. Fix it.

The better version is:

This screen is cropped on a 1080x1920 device. Here is the screenshot. The top HUD should remain visible, the board should stay centred, and the bottom controls should not overlap. Inspect the layout code first, propose a plan, then make the smallest fix.

Same task. Very different result.

Do not forget skills and plugins

Codex is not just a text box where you throw coding requests.

It can use skills and plugins, and those are worth exploring.

Skills are useful when you have repeatable workflows. For example: generating store assets, reviewing tests, preparing release notes, checking localization files, or running a specific validation process.

Plugins can bundle reusable workflows, app integrations, and other setup into something easier to use across projects or teams.

This matters because a lot of coding work is not really “coding”. It is process. Check this file. Run this command. Compare this output. Follow this convention. Update this changelog. Validate this package. Repeat forever.

The more of that you package into reusable instructions or tools, the less you need to micromanage each task.

Browse the gallery. Look at what already exists. And when you notice yourself repeating the same prompt for the fifth time, turn it into durable instruction or a skill.

Change models depending on the task

The latest model is usually the best. It is also usually the most expensive.

Not every task needs the best model.

If I am doing something complex, risky, or architectural, I want the strongest model available. If I am asking Codex to update documentation, add obvious tests, fix small copy, rename strings, or follow a very clear pattern, a cheaper model is often good enough.

This is especially relevant once you start using agents heavily. Token usage can grow quietly. One task is cheap. A hundred tasks are not.

The trick is not to always use the cheapest model. That often costs more in the end because you spend time correcting bad work.

The trick is to match the model to the task.

Use stronger models when ambiguity is high. Use cheaper models when the pattern is obvious.

Review diffs like a product person, not just a developer

When reviewing Codex output, do not only ask “does this compile?”

Ask:

did it solve the actual user problem?
did it introduce a new edge case?
did it make the product behaviour more confusing?
did it change something outside the requested scope?
did it add complexity that will hurt you later?
did it preserve the existing feel of the product?
This is where product managers can actually be good at using coding agents.

You do not need to personally write every line of code to evaluate whether the change makes sense. But you do need to stay engaged. The agent can generate implementation. It cannot own product judgement.

That part is still your job.

Be careful with “while you are there”

“While you are there” is dangerous.

It sounds efficient, but it often creates messy diffs.

Fix this bug, and while you are there, clean up this file, and while you are there, improve the tests, and while you are there, rename this thing, and while you are there…

Now you have a pull request that changes behaviour, structure, naming, and tests all at once. If something breaks, good luck finding the cause.

I try to separate tasks:

fix the bug
add regression coverage
refactor if still needed
clean up naming if still worth it

This feels slower. In practice, it is faster because review becomes easier and mistakes are easier to isolate.

Agents make it cheap to generate code. They do not make it cheap to understand messy changes.

Provide feedback when you have time

Codex makes it easy to provide feedback on each task.

Use it.

I do not know exactly how OpenAI processes every piece of feedback, but I have noticed the product improving over time. Things that used to be clunky get better. Workflows get smoother. Bad behaviours become less common.

And honestly, if you are using these tools seriously, you want them to improve in the direction of real work, not demo work.

So when Codex does something clearly good, say so. When it fails in a repeatable way, report it. When the interface gets in your way, send feedback.

It takes a few seconds, and it helps shape the tool you are increasingly relying on.

The main lesson

Codex is not a magic senior engineer in a box.

It is also not a toy.

It is closer to a very fast, very literal, sometimes brilliant, sometimes careless junior engineer with access to your repo and no fear of changing files.

That means your job changes.

You need to give it context. You need to constrain the task. You need to define what good looks like. You need to review the plan. You need to check the validation. You need to protect the product from unnecessary complexity.

The people getting the best results from coding agents are not the ones writing the longest prompts. They are the ones building better working systems around the agent.

Use the app. Start with a plan. Maintain AGENTS.md. Package repeatable work into skills. Pick the right model. Keep tasks reviewable. Give feedback.

That is not as exciting as “AI will build the whole app for you”.

But it is much closer to how useful work actually gets done.

Test-n-Tell - tiny apps, Big ideas

Search This Blog