Evaluating What Large Language Models Produce

Behavior Analysts are likely already being asked to adopt Large Language Model (LLM) tooling. The promises made by vendors may sound familiar: faster documentation, reduced administrative burden, and more time back in your day. These promises are often enough to generate enthusiasm from organizational leaders eager for a more efficient workforce, increased operational output, and improved decision-making. But the question that needs to be asked, and one behavior analysts are uniquely positioned to answer, is: does any of this actually work? Does it reduce documentation time, and to what degree? Does the quality of the output hold up under scrutiny? And does the time saved justify what may be lost in accuracy, consistency, or clinical utility?

Many practitioners in Applied Behavior Analysis work within healthcare contexts. This demands that outputs of any system we utilize must be defensible. This requires that we both understand what a system is producing and be able to explain it to clients, caregivers, supervisors, and payers.

So, as behavior analysts, what should we do? Fortunately, we are well trained in measurement techniques. We can define behaviors operationally, select measurement systems, collect data across conditions, and establish interobserver agreement before drawing conclusions. Measurement is one of our strongest tools for understanding behavior, and evaluating LLMs is no different.

When a large language model generates a response, it is producing a permanent product. That permanent product is an output that we can and should measure. The output can fall into many different categories. We can use labels such as accurate or inaccurate, complete or incomplete, consistent or variable across similar instructions, and even harmful or safe.

But the current state of evaluation may leave practitioners at risk of evaluating Large Language Model (LLM) outputs the way untrained observers evaluate behavior: informally and without agreed-upon definitions. A response that appears correct may be labeled as 'good enough,' whereas one that looks a bit 'off' may be discarded. This is the equivalent of measuring behavior change by anecdote and intuition rather than what the data do and don't support.

The purpose of this post is to challenge behavior analysts to hold LLM systems to the same standard we hold our own interventions. If we would not accept anecdotal evidence as proof of behavior change, we should not accept a fluent, well-formatted response as proof of clinical utility.

This post isn't aimed at solving all of your technical woes with adopting AI systems (and more specifically LLM based offerings), and it's not going to give you the answer on how to evaluate every AI system that exists (there are many different types of AI systems, and all of them have different levels of explainability). Rather, it is aimed at giving you a way to think about how you approach adoption, and a framework for making defensible decisions about the systems you are likely already being asked to adopt.


Before You Can Evaluate: Narrow the Scope

The capabilities of current large language models are no longer the limiting factor in this conversation. Sit with that for a moment. You can automate almost anything text-based with an LLM. For example, within a free-tier ChatGPT subscription and five seconds of back and forth, you can have a system conducting an unstructured interview with you. After that, it is parsing clinical and ethical scenarios and returning a response. This is all wrapped in a fluent, well structured response. And it feels credible. Every idea you input seems well supported, many good and bad ideas are validated, all wrapped up in a believable stream of text. That description isn't a risk of using these systems, it a feature of using one. In this context, the question isn't whether you can automate something, it's whether you should automate something.

But should you? And more importantly: could you evaluate that automation if you did?

When a system can do almost anything, evaluating "the system" as a whole is almost meaningless. What would it mean to evaluate whether an LLM is clinically useful when the same system can write a session note, draft a treatment plan, summarize a research article, and, without missing a step, settle the debate about whether the giants are the most storied franchise in MLB history? You cannot measure a tool that has no well defined scope. Before any LLM implementation in a healthcare setting can be evaluated, we MUST make the scope explicit. This means defining the problem you are solving, the specific task the model will execute, who will be affected by its output, and what success looks like before any data are collected.

For our purposes here, that work is already done. We are providing an example task to illustrate the evaluation process. The task we are using to illustrate this process is session note generation. The users are clinicians and behavior technicians. The stakeholders include clients, caregivers, supervisors, and payers. Whether this is similar to the task you are evaluating is unclear, but being laser focused on the specific task you are evaluating is a prerequisite to evaluating the system that is producing the output.


Step 1: Identifying the Dependent Variable(s)

The first step is breaking the problem down into smaller, more measurable units. Breaking the problem into smaller, observable units not only gives us something more concrete to track, it also forces us to get explicit about what we expect these systems to produce and what we are willing to accept as evidence that they are working.

Primary Dependent Variable: Time Spent on Session Notes

One of the dependent variables that may be actionable with documentation summaries is time spent on session notes. This is likely our most straightforward dependent variable, and could come into play as either a direct continuous measure if your documentation platform logs timestamps automatically, or as a self-reported measure if practitioners are recording their own time. We could record baseline data on the time spent to generate a session note pre- and post-implementation of the LLM system to monitor the effects of such a system on time allocation of behavior professionals, but duration as a dependent variable is not sufficient. A note produced in 30 seconds is only useful if it actually contains valuable information about the session, is an accurate representation of what has occurred, and can hold up to professional scrutiny. Unless all of those are met, it is pointless.

Secondary Dependent Variables: Note Quality Components

So, we need to add additional data elements to the mix so that we can monitor the impact of our adoption decision across multiple quantifiable variables. To do this, we need to break down the components of a session note into observable and measurable characteristics. Think of this like you would a treatment fidelity measurement system: small, specific, and scorable. The characteristics you want to monitor and measure will likely depend on your setting and clinical context.

What makes a good session note is largely outside the scope of this post. For our purposes, we need to identify a handful of components that are observable, measurable, and clinically meaningful enough to tell us something useful about whether the system is producing work we can actually stand behind.

#ComponentOperational Definition
1Time spent on noteDuration from initiation to completion of the session note, measured in minutes
2Date recordedNote includes the date on which the service was rendered
3Service type recordedNote specifies the type of service delivered (e.g., direct therapy, supervision)
4Location recordedNote specifies where the session took place (e.g., home, clinic, school)
5Behavior described in observable termsClient behavior is described using objective, observable language with no mentalistic terms (e.g., seemed, appeared, was anxious)
6Target goal referencedNote explicitly references the treatment goal or objective addressed during the session
7Measurable progress reportedNote includes a specific numerical or percentage-based measure of client performance (e.g., completed 70% of trials independently)
8Performance compared to prior sessionCurrent performance is contextualized against a previous data point (e.g., up from 65% last session)

Step 2: Deciding on measurement for each component

Our measurement can take different forms, and there are many options. Research on human evaluation of LLMs in healthcare has used Likert scales ranging from simple binary ratings to nuanced 5 or 7-point scales (e.g., "how accurately does this note reflect session data?" rated 1–5), categorical judgments such as presence or absence of a feature (e.g., "operationally defined" vs. "mentalistic language"), percentage correct scores calculated by summing met components across the task components, and comparative ratings against a gold standard note authored by a clinician. For our example with session notes, we are going to use the simplest of those: true or false, yes or no. Does the note describe behavior in observable terms? Does it reference the relevant treatment goal? Does it accurately reflect the data collected? In the aggregate, these binary scores give us an opportunity to calculate metrics we can plot across notes, benchmarked against the human-generated documentation that preceded LLM adoption.

Now that we have our measurement unit, let's create our data sheet.

#ComponentOperational DefinitionDuration (min)
1Time spent on noteDuration from initiation to completion of the session note, recorded in minutes
#ComponentOperational DefinitionT / F
1Date recordedNote includes the date on which the service was rendered
2Service type recordedNote specifies the type of service delivered (e.g., direct therapy, supervision)
3Location recordedNote specifies where the session took place (e.g., home, clinic, school)
4Behavior described in observable termsClient behavior is described using objective, observable language — no mentalistic terms (e.g., seemed, appeared, was anxious)
5Target goal referencedNote explicitly references the treatment goal or objective addressed during the session
6Measurable progress reportedNote includes a specific numerical or percentage-based measure of client performance (e.g., completed 70% of trials independently)
7Performance compared to prior sessionCurrent performance is contextualized against a previous data point (e.g., up from 65% last session)

Step 3: Establishing a Ground Truth

Before any data are collected, you need to define what a correct score actually looks like for each component. These are your ground truth values, used as a reference against which every scored note will be compared. Without it, observers are making judgments against an implicit and unverified standard, which is the same problem we started with.

The components in our example fall into two categories that require different approaches.

For conditional components, the expert rater must consult the session data to understand what is warranted within the session note. This could be auditing the trial-by-trial records, graphs, or clinical logs. This is done before assigning a ground truth value for the note. The note is evaluated not against a universal expectation, but against what the session data actually support. This also means the person establishing the ground truth should be a qualified clinician reviewing the raw data, not the individual who wrote the note being evaluated.

Expert Rater: A trained clinician who independently scores both human and AI-generated notes against the data sheet criteria, serving as the source(s) of ground truth judgments used to build the confusion matrix.

LLM-Generated Note: A session note produced by a large language model, evaluated against the same operational definitions as the human note to determine where LLM documentation meets, exceeds, or falls short of clinician performance.

Human-Generated Note: A session note authored by the treating clinician or behavior technician. When scored, the human-generated note helps establish a pre-adoption baseline and serves as the reference condition in the confusion matrix.


Step 4: Collecting the data

Once we have our measurement system and ground truth established, we need to collect data. This step involves scoring the same set of sessions twice, once for the human-generated note and once for the AI-generated note. For each session, the expert rater applies the data sheet to both notes independently, producing a True/False score for each component from each source. Those scores are then compared against the ground truth values you defined in the previous section. This gives you two types of records:

  1. How well your clinicians or direct implementors are meeting each criterion
  2. How well the LLM system is meeting each criterion

For behavior analysts beginning to evaluate LLM output, the following practical considerations apply before you start:

A reasonable question is: how many notes do we actually need to evaluate? The honest answer is that we do not know yet. A systematic review of 142 studies evaluating LLM outputs in healthcare found that most used 100 or fewer samples, though the authors themselves acknowledged this as a limitation of current practice rather than a recommended standard. For our purposes, more is better. What we are really after is stability in the metric over time, and that is an empirical question worth investigating in our own field.


Step 5: Analyzing the data

A natural starting point is our primary dependent variable: time spent on note.

The figure below displays hypothetical data for note completion time across human-generated and LLM-generated conditions. Each data point is a single note. Even without applying any statistical analysis, the difference between the two distributions is visible. By visually inspecting data like any other intervention, we could likely identify how much time is saved by adopting a system like this.

Note completion time: human vs. LLM-generated notes (hypothetical)

But what do we do with the rest of our data? Duration tells us whether the system saves us time. It does not tell us whether the output is good. To make use of our secondary dependent variable data, we need a tool for comparing what the LLM produces against what we defined as acceptable in our ground truth. The tool we will be using is a confusion matrix.

The confusion matrix: a tool for comparing scored notes to the ground truth

A confusion matrix is a common tool in the evaluation of classification systems but less familiar in behavior analysis. A confusion matrix is a simple table that compares a set of predictions or classifications against a known reference, showing where they agree and where they do not. In behavior analysis, we use a similar logic when two observers record the same behavior independently and we compare the recorded events to calculate interobserver agreement. The confusion matrix works the same way, with one important distinction: one side of the comparison is always the ground truth established in the previous section, not another observer. It tells you precisely how well a given set of notes holds up against that fixed reference.

For each note, whether written by a clinician or generated by an LLM, the expert rater has produced a True/False score for each component. Those scores are then compared against the ground truth values for that component, component by component, session by session. For mandatory components, the ground truth is always True. For conditional components, the ground truth reflects what the session data supported. Either way, once the ground truth is defined, the comparison is the same.

Here is an example of a confusion matrix for a single component:

Note said: PresentNote said: Absent
Ground truth: PresentTrue Positive (TP): criterion was correctly includedFalse Negative (FN): criterion was missed
Ground truth: AbsentFalse Positive (FP): criterion was incorrectly addedTrue Negative (TN): criterion was correctly omitted

This framing applies whether the note was written by a clinician or generated by an LLM system. Run it for both, and you have a direct, quantitative comparison between human and LLM documentation performance on the same criteria. From these four cells, we can calculate several summary metrics and each of these metrics tells us something different about the performance of the note.

Putting it together: from data sheet to confusion matrix

The ground truth established in the previous section is your fixed reference. The expert rater then scores each note (human-generated or AI-generated) against the data sheet, producing a True/False value for each component. Those scored values are then compared against the ground truth, component by component for each session. You run this for both note sources, which gives you two sets of confusion matrices: one showing how well human notes hold up against the ground truth, and one showing how well LLM notes do.

To make this concrete, let's trace one component from our data sheet. We will use Measurable progress reported as our example. Because this is a conditional component, the ground truth for each session is determined first by the expert rater reviewing the raw session data: did formal measurement actually occur? That decision is the reference. The rater then scores both the human note and the LLM note against that same reference independently.

SessionGround TruthHuman NoteHuman OutcomeLLM NoteLLM Outcome
1TTTrue PositiveTTrue Positive
2FFTrue NegativeFTrue Negative
3TTTrue PositiveTTrue Positive
4FFTrue NegativeFTrue Negative
5FFTrue NegativeFTrue Negative
6TTTrue PositiveTTrue Positive
7FFTrue NegativeTFalse Positive
8FFTrue NegativeFTrue Negative
..................

Each row produces two independent outcomes: one for the human note and one for the LLM note. Both are compared against the same ground truth. You can see in session 7 that the ground truth was False (no formal measurement occurred), the human note correctly omitted a progress measure, but the LLM note included one anyway. That is a false positive specific to the AI: the human note gets a true negative, the LLM Outcome gets a false positive, for the same session.

Summing the LLM note outcomes across all 30 sessions produces the confusion matrix below. A separate matrix can be built for the human notes using the same approach, giving you a direct benchmark for comparison.

LLM-Generated Notes: Measurable progress reported vs. ground truth

30 sessions

AI Output
Progress ReportedNo Progress Reported
ActualProgress Reported
10
True Positive
2
False Negative
No Progress Reported
4
False Positive
14
True Negative
80.0%
Accuracy
(TP + TN) / Total
83.3%
Sensitivity
TP / (TP + FN)
77.8%
Specificity
TN / (TN + FP)
71.4%
Precision
TP / (TP + FP)
0.60
MCC
-1 to +1

A few things worth noting from these results. Sensitivity of 83% tells us the LLM correctly included a measurable data point in most sessions where the ground truth required one. Specificity of 78% means the system generated a false positive in roughly one out of five sessions where the ground truth indicated no formal measure occurred. In a documentation context, that matters. A note that reports a specific performance percentage for a skill that was not formally measured is not just inaccurate, it is a fabrication that could not survive scrutiny. Running the same matrix for the human notes against the same ground truth gives you the baseline (i.e., how often do your clinicians meet this criterion when the session data say they should).

Human-Generated Notes: Measurable progress reported vs. ground truth

30 sessions

AI Output
Progress ReportedNo Progress Reported
ActualProgress Reported
12
True Positive
0
False Negative
No Progress Reported
0
False Positive
18
True Negative
100.0%
Accuracy
(TP + TN) / Total
100.0%
Sensitivity
TP / (TP + FN)
100.0%
Specificity
TN / (TN + FP)
100.0%
Precision
TP / (TP + FP)
1.00
MCC
-1 to +1

In this example, the human notes are outperforming the LLM-generated notes on this specific metric. That finding cuts both ways. It confirms that your clinicians are meeting the documentation standard when the session data say they should, which is exactly what you want to see in your baseline. It also tells you that adopting this system, as configured, would represent a step backwards in documentation quality for this particular criterion. That is a meaningful clinical finding, and it is the kind of finding that should inform an adoption decision.

Now, these data are hypothetical, and your results will vary depending on your setting, your clinicians, the system you are evaluating, and how it has been configured. The point is not the numbers. The point is that this kind of comparison is possible, it is not technically difficult, and it is exactly the kind of evidence that should be driving these decisions in our field rather than promises and anecdotal reports.

Step 6: Interpreting the results and making a decision

Having data is not the same as making a decision. The confusion matrix gives you numbers; what you do with them requires clinical judgment informed by the context in which your organization operates.

A useful starting point is to examine your results by component type, because not all data elements carry equal weight. Some components are low stakes and may be well suited for LLM assistance, potentially even outperforming your human baseline. Others carry meaningful clinical, billing, or compliance risk where a false positive or false negative has consequences that extend well beyond the note itself. Furthermore, engaging in this kind of analysis may surface something unexpected: that human performance on certain components is not as strong as assumed, and that the LLM is actually the more consistent documenter. That is not a failure of the analysis. That is exactly what the analysis is for.

Step 7: Monitoring over time

Our final step is to monitor the performance of the system over time. Adoption is not the end of the evaluation process. LLM systems are updated, changed, modified, or abandoned by vendors all the time. This can occur without notice. A model that performed well during your initial evaluation may produce meaningfully different output after an update. Treating adoption as a one-time decision leaves your organization exposed to drift you may not detect until it surfaces in a chart audit or payer review.

Continuous monitoring does not require repeating the full evaluation indefinitely. A practical approach borrows from process control: establish a baseline level of performance for each component during your initial evaluation, then sample a smaller number of notes on an ongoing basis to monitor for meaningful deviation. If performance on a high-risk component drops below your acceptable threshold, that is a signal to pause and investigate before continued use.

A few elements worth building into any monitoring plan:

The question this post opened with was: does any of this actually work? The answer behavior analysts should give is the same one we give to any intervention claim: show me the data. We would not accept a testimonial as evidence that a treatment produces behavior change. We wouldn't adjust a client's program based on promises made. The standards we hold our clinical work to do not disappear because the product being evaluated runs on a language model. If anything, the complexity of these systems makes rigorous measurement more necessary, not less.

Operational definitions, permanent product recording, baseline logic, interobserver agreement, and simple arithmetic. Every behavior analyst already has these tools. The only question is whether we use them.

Have questions about this framework or want to discuss how to apply it in your organization? Get in touch.

References

Cox, D. J., Weil, L., Sosine, J., Jennings, A. M., & Santos, C. (2025). Getting more from your IOA data: Alternative measures to total, occurrence, and non-occurrence agreement. Behavioral Interventions, e70031. https://doi.org/10.1002/bin.70031

Tam, T. Y. C., Sivarajkumar, S., Kapoor, S., Stolyar, A. V., Polanska, K., McCarthy, K. R., Osterhoudt, H., Wu, X., Visweswaran, S., Fu, S., Mathur, P., Cacciamani, G. E., Sun, C., Peng, Y., & Wang, Y. (2024). A framework for human evaluation of large language models in healthcare derived from literature review. npj Digital Medicine, 7, 258. https://doi.org/10.1038/s41746-024-01258-7