12 Debugging with a large language model

Some programmers use large language models (LLMs) like ChatGPT and GitHub Copilot as tools to help them get past bugs in their code. Like any other tool, it’s helpful to know what language models can and can’t do, so you can be strategic about how you choose to use them.

12.1 What is an LLM?

LMs are a kind of machine learning model that represent patterns in huge datasets of text. They can be used not only to represent text, but also to generate text that can sound remarkably conversational and fluent. These models are tuned on vast collections of text from the internet, which includes a lot of coding examples and user discussions. So for a popular programming language like Python, many common beginner mistakes are well-represented in the model’s training data. This data, combined with recent advancements in language model architecture allow models to represent long-ranging patterns and structure in text, give modern LLMs some useful features for debugging help.

12.2 What an LLM is not

As a default, unless you explicitly read otherwise, assume LLMs are not search engines. They don’t run a Google search and report the results back to you. They only have access to the data you share with it via a prompt, and the data that was collected by the model developers for training. They also don’t run computations or calculation like a computer or calculator. So the model can’t run your code, and if it presents you a code suggestion, you shouldn’t automatically assume it is runnable or that the logic it represents is entirely correct.

LLMs make things up!

LLMs may generate text that correspond to what is true, but they aren’t designed with an explicit fact-checking mechanism. They may produce confident-sounding but ultimately incorrect responses, like suggesting you use a library’s function that doesn’t actually exist. Treat everything that comes out of an LLM- and we mean everything-as a hypothesis that must be tested. You will still have to think through if these suggestions are reasonable and supported!

12.3 Prompting strategies for debugging

You can actually use LLMs to help you think through your code at each stage of the OILER framework.

12.3.1 Orient yourself.

Prompt the LLM with a line of code or a user-defined function and ask, “What does this do?”

12.3.2 Investigate the symptoms.

Pick a keyword from the error message: “What does KeyError mean?”
Put in the whole error message and ask for an explanation.
Ask it to help you investigate the state of a program at a certain point: “How do I print out all the keys in my dictionary at line 32?”

Don’t stop reading error messages yourself

Some beginner programmers use LLMs to help them find the line in their program where the program stopped running- for example, where a syntax error occurs. While LLMs usually can point out a missing parentheses in a few lines of code, this information is also frequently in the error message itself. If you’re relying on LLMs, it might be a sign that you’re not able to read the corresponding line number in the error message for yourself. It’s still a good idea to be able to read error messages for clues like line numbers and type keywords.

12.3.3 Locate the root cause.

Ask it to generate hypotheses. “Given my code, what could cause the”price” key to be missing from the dictionary inventory[“orange”]?”

12.3.4 Experiment with a fix.

You can prompt the model to suggest fixes. “How can I create the”prices” dictionary so that missing prices default to 0?”
Ask the model to suggest fixes you might not have considered. “How else could I handle a missing price key in the dictionary”prices”?

12.3.5 Reflect

Ask the model to suggest comments or other forms of documentation to parts of the code you’ve modified.
Ask the model to draft tests, like assert statements, that will help you avoid unwanted behaviors in the future.

12.4 Things to watch out for

One of the most challenging aspects of working with LLMs as a programming aid is that they don’t work exactly as we might intuit- and when they produce code suggestions or explanations that are incorrect, these errors can be very different from the kinds of error a person would make.

12.4.1 LLMs are non-deterministic.

If you ask an LLM the same prompt repeatedly, it may give a different response. Here are 4 different diagnoses of the bug we explored together from 4 (identical) promptings of the Llama 2b model:

The error message you’re seeing is caused by the price key not being found in the inventory dictionary. This is happening because the prices CSV file does not have a column for the price of each product.
This error occurs because the prices dictionary, which is used to create the inventory dictionary, does not have a key named 'price'. The prices dictionary only has two columns, product and price, so it does not have a separate column for the price of each product.
The create_combined_inventory function is trying to access the price key for “oranges” in the inventory dictionary, but it does not exist.
The error message indicates that there is a key error in the inventory dictionary, specifically with the key price. The price key is not found in the dictionary, which is why the code is raising a KeyError.

These are all slightly different, and not all correct (only one is exactly right).

12.4.2 LLM writing sounds authoritative.

In each of the above cases, the writing sounds “sure” that of what is happening. But in only one of these cases it was correct. We can’t use the tone of the generated text as a reliable indicator of how correct it is.

12.4.3 LLMs have limited scope.

The model only has access to the text you’re providing it- so if you’re dealing with an error that originates outside the scope of the code you’ve shared (e.g., a data quality issue in your data files, not your program), the LLM lacks information about this.

12.4.4 Keep it short.

LLMs tend to perform best on tasks that involve relatively short blocks of code- so where you can, prompt with relevant chunks of code rather than an entire codebase. Similarly, the longer a generated code suggestion is, the more opportunity for errors. Try to avoid accepting multiple lines of generated code at once unless you are confident you can carefully read and verify each line.