Why Your AI Prompts Aren't Working: The Gap Between Intention and Output

Writing prompts has often made me feel like I’m learning a new language. Though I’m not, the responses I get aren’t always what I expect.

Years ago, I lived in a small Mexican city for a few months. I am not a fluent speaker of Spanish. There was a point in my life where my awesome tutor helped me reach an intermediate level. By the time I moved to Mexico, I wasn’t using the language as much. I knew enough to navigate my way around the city though.

One night, I stopped into a pharmacy to purchase a box of Band-Aids. After searching all over the store, I eventually asked a couple of workers behind the counter if they could tell me where they were located.

I had never needed to use the Spanish word for bandage in a real life situation, so I had forgotten it. When I used Google Translate, I typed bandage, and it gave me the word “vendas.” One of the workers went to the back and grabbed something that looked like gauze.

This wasn’t what I wanted, so I searched for an image on my phone and shared it. Realization immediately hit and then one of the workers exclaimed “curitas.” Of course, I blamed the miscommunication on Google Translate. In the end, I bought what I needed and the store workers enjoyed a good giggle to end their work day.

That feeling of going back and forth with the workers before getting the Band-Aids reminds me of the frustration that can arise from working with AI. No matter how many best practice guides exist on writing prompts, it won’t matter if the LLM interprets it differently from how I thought it was communicated.

When AI outputs aren’t what we expect, it can be easy to declare a model or AI overall as broken. On the opposite side, it’s tempting to place blame on the prompt writer for lacking the communication skills to clearly define what they want.

The real issue lies somewhere in the middle. Like humans, AI isn’t perfect and it never will be. In order to bridge this gap, we have to learn how to work with LLMs.

You Need a Plan

When we write code, we know that if we write a function that takes in 2 numbers and adds them together, the return will be the sum of those 2 numbers. We know this because we’ve planned it that way. But before any of us could do this work almost in our sleep, we had to learn how. We had to understand the fundamentals.

The issue with AI is more subtle. LLMs have been trained on so much data already. However, they haven’t been trained on how we as individuals think. They don’t know my coding style unless I tell them. It won’t know how I want components to be broken down without details.

It forces you to think about what you actually want. Before asking an LLM to solve something, ask yourself: Why is this even an issue in the first place? What should the results actually look like? How would I evaluate whether the output successfully solves my problem?

When I’ve worked with AI on projects, especially when we’ve built tools to help others do the same, I’ve noticed that teams who start with this kind of planning get dramatically better results. They’re not just throwing prompts at a wall. They’re being intentional about what they need and how they’ll know they got it right.

Comparing Results Matters

It’s easy to label a model or prompt as being terrible when we don’t get what we expect. But are we being encouraged to test and evaluate before writing things off? Often you’ll see claims shared that “Opus is better than GPT-5.2 for a certain task.” This is almost always shared as an opinion. A subjective take. Though it may not be incorrect, there’s rarely any discussion about the testing or evaluation process involved.

Changing models and prompts are how we test, but if those changes are all inside one chat or aren’t easy to compare side-by-side, it feels somewhat meaningless. You don’t actually learn what worked and what didn’t.

What I’ve come to value about structured evaluation is the ability to create multiple experiments, test different models and prompt variations, and easily view their outputs side-by-side. Not only does it help verify which model will be best suited for a particular task, but it allows you to see the differences in the prompts and other configuration settings. You can actually learn from the experiments instead of hoping for the best.

The Real Issue Isn’t the AI

The frustration people feel when working with AI often comes from the same place as my moment in that Mexican pharmacy: miscommunication. The good news is that once you understand it’s not about having the “right” prompts or AI being broken, you can approach writing prompts differently.

Plan intentionally. Test systematically. Evaluate honestly. These practices work with any tool, any model, any LLM. They may not guarantee the results you expect, but they transform the process from guessing to learning.

Over the past few months, we at Rotational have been building a platform called Endeavor that’s specifically designed to make these practices easier for teams. It’s a tool built from the ground up to support planning, testing, and evaluation of AI agents for custom use cases. The principles behind it are the insights I’ve shared here. Those principles are what matter most, regardless of what tool you use.

Image generated using GPT-5