written by Eric J. Ma on 2026-02-01 | tags: agentic coding experiments logging reports journal plots iteration structure exploration
In this blog post, I share ten lessons I've learned from experimenting with agentic coding in data science, from setting clear goals and structuring projects to leveraging coding agents for faster iterations and better insights. I discuss practical tips like maintaining logs, generating diagnostic plots, and treating the agent as a partner in exploration. Curious how you can make AI your jazz partner in data science and boost your productivity?
Having tasted what agentic coding could look like for software development, I wanted to know what it would look like for data science - training machine learning models and answering scientific questions with data. So I started experimenting, at work, and on my own at home as well. Here are ten lessons I've learned from my experiments thus far.
Similar to building software, you need to know exactly what you want and how you'll evaluate the outcome. The difference, however, is as follows: With software, you will often know what you need to build, but with data science, you can only know what hypotheses need to be verified, which means you will need to iterate your way to the answer. Nonetheless, it is possible to leverage coding agents to move quickly. The parallels are striking: if you frame each question you ask in terms of an observable outcome, you can set up your coding agent to write code that produces an output that can be evaluated for correctness, just like with software tests. Here, your ability to describe precisely the hypothesis you're exploring, and the ability to describe in precise language what the answer would look like if the hypothesis held true or not, are critical components of what enables the coding agent to figure out what needs to be counterfactually true (within the codebase or the data) in order for your hypothesis to hold true.
Here's an example from my work. In a machine learning experiment with synthetic data, I wanted to hit 100% sequence editing performance. (It was synthetic data after all!) The coding agent hit a scenario where it was only doing 25%. With the clear goal in mind, it proposed edits to the code, edited the code, and re-ran experiments until it hit 100%. All without cheating; I know, because I checked!
The agent needs a predictable place for experiments. Similar to how a software repo has a conventional layout (src/, tests/, and so on), your experiments need a conventional layout so the agent knows where to put things and where to look. When doing machine learning experiments, I do my work inside an experiments folder. Underneath that, for each experiment, there's a README file, a plots directory, a data directory, a scripts directory. I maintain pyds-cli, which helps you scaffold and manage data science projects with exactly this kind of structure. There's also an agent skill I wrote that enables this. Naming things logically helps, but the scheme matters more than the exact names. Coding agents will follow the patterns you already have.
With software, the feature one ask's a coding agent to build either succeeds or fails, and this can be automatically verified using programmatically-runnable unit and integration tests. With data science, experimental runs produce logs and metrics, but aren't easily boolean pass/fail like software tests. In both cases, however, and the agent can introspect logs to figure out what to change!
Your AGENTS.md file should include instructions for putting enough logging in place so the LLM can introspect what's going on during the experiment. I've written elsewhere about how to teach your coding agent with AGENTS.md and using AGENTS.md as repository memory for self-improving agents. Pair that with tools that run code in the terminal so the agent gets logs it can read. When the agent can read the logs, it can figure out what's wrong and what to change.
In my work, logging and printing to terminal were what let my agent fix a masking strategy that was only yielding 25% correctness. It read the logs, proposed a fix, re-ran, and got to where we needed to be. No intervention on my part. A 3 day experiment became 20 minutes.
The agent can write code and read mountains of logs, but you need something else: a human-readable summary of what it observed and what looked weird, so you can triage without re-reading every log. Give your coding agent instructions (e.g. in an agent skill) to write out in plain language what it observed during the model evaluation phase. It should read execution logs throughout the run. Tell it to write down anything that looks weird for follow-up. If something is off, it should say so. You get a readable summary and a list of things to dig into.
For reports (e.g. reports.md), encode in the skill that every table and every plot must be scrutinized. Ensure that plots are generated for every table, and that someone (you or the agent, with you verifying) carefully checks for inconsistencies between AI-generated plots and the tables they are supposed to reflect. The agent can miss things. It is valid to ask the AI to check its own work, but only if you have an idea of exactly where it is wrong and you tell it as such. Vague "double-check this" rarely helps; "the values in figure 2 do not match the second column of table 1" gives the agent something it can fix.
Within the skill, instruct the agent to keep a single file (e.g. notes.md or journal.md) that it is told to only append to, never overwrite. The journal is not just for the agent. You should add to it too: things you noticed while looking at the data, gut feelings, weird patterns. It becomes a running log of what was going on, from both sides, that you can go back and summarize later. The point is to capture the thought process while you are doing the work.
Logs and plots are complementary: logs are agent-accessible, but plots are human-accessible versions of the same underlying performance data. Have the agent generate diagnostic plots for you. The agent can propose fixes from the logs, but it can't build your intuition. You're the one who has to smell when something is off. Nothing beats looking at the data yourself - the performance data in the logs and the plots. Otherwise you never build intuition for what's happening. I still looked at the logs and plots myself to make sure the metrics were real and the agent wasn't hallucinating. Prior experience is what lets you smell when something is off.
With software, you run tests in seconds. With ML, you're tempted to train for hours and rush to the real data and the real training run. As a human, you don't want to "waste" time proving out the pipeline when you could just run the full thing. That temptation is exactly why you should instruct the agent to do the opposite: write the minimalist version first, then use it to prove out elementary errors before scaling up.
That means train for one iteration, not even one epoch. Use miniature versions of the final model (e.g. a tiny custom deep net with the same architecture but a fraction of the parameters). Check for shape errors, data-loading bugs, and that the forward pass runs end to end. All the sanity checks you would do manually to prove that things work, but that you are tempted to skip. Encode in AGENTS.md that the agent must implement and run this minimal version before moving to full-scale training. The agent does not have your impatience; use that. Once the minimal run passes, you can scale up with confidence.
I do this often with large software refactors within canvas-chat, in which I ask the coding agent to prioritize for me a list of manual checks I need to look at. This is particularly helpful when I'm (a) context switching back into the project, or (b) running on fumes but my gut tells me we're so close to the end.
The same applies to data science and data exploration! After having the coding agent autonomously execute on your experiment, you can have it walk through what it's done step by step, giving you the space to operate at your pace -- at the speed of your thought! Of course, if you're in a better state than merely "running on fumes", you can (and should) treat the coding agent as a research partner and ask questions back to critically evaluate whether the output is correct or not. What I have found is that there will still be unexplored paths that need to be trodden, and you can send a coding agent off on that direction on the side.
Pay attention to the terms the agent uses when it describes what it did. You can reuse that vocabulary in future prompts and get more precise results. For example, in developing canvas-chat, I used this non-optimal verbiage in my prompt:
ok, I see, the default node size made it such that the next/prev buttons were hidden away. Can we make the pagination controls visible regardless of the size of the node?
Cursor's agent replied with something like "making the pagination toolbar sticky". That gave me a more compact way to express exactly what I need the next time. If you don't know the vocab at first, this is a great way to expand your technical vocabulary too.
What you don't want in agentic data exploration is for the coding agent to hand you a boatload of output and leave you no room to follow your own curiosity. Flip the table: treat the agent as an executor of your ideas. You lead; it follows. Instruct it that it is not allowed to race ahead. It should only execute on the one thing you want, and it should ask you questions to clarify and narrow down what you actually want before it goes and does it. In other words, it is there to be a jazz partner for your data exploration.
You can run that partnership a few ways. One is to have the agent write scripts that produce plots on disk; you run them, look at the output, then ask for the next thing. Another is to go one level higher and work inside Marimo notebooks, using Marimo's reactive execution so you go one cell at a time, one question at a time. I've written about using coding agents to write Marimo notebooks if you want to try that path.
The agents handle the implementation. You handle the inquiry. The ten practices above - prescriptive goals, clear structure, logging, reports, an append-only journal, diagnostic plots and your own eyes on the data, the minimalist version first, having the agent guide you step by step when you need it, learning the agent's vocabulary, and in exploration keeping the agent as your jazz partner - are what make that partnership work. I've spent nearly a decade training ML models by hand, so I know what I want, and I have developed a sense of taste for what success looks like. You can get to the same level of taste with AI assistance, but you must work for it. I'll write separately about how I'm learning new things with AI. The point is not to hand off the science, but to do more of it!
@article{
ericmjl-2026-how-to-do-agentic-data-science,
author = {Eric J. Ma},
title = {How to Do Agentic Data Science},
year = {2026},
month = {02},
day = {01},
howpublished = {\url{https://ericmjl.github.io}},
journal = {Eric J. Ma's Blog},
url = {https://ericmjl.github.io/blog/2026/2/1/how-to-do-agentic-data-science},
}
I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.
If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!
Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!