Evaluating What Large Language Models Produce

Behavior Analysts are likely already being asked to adopt Large Language Model (LLM) tooling. The promises made by vendors may sound familiar: faster documentation, reduced administrative burden, and more time back in your day. These promises are often enough to generate enthusiasm from organizational leaders eager for a more efficient workforce, increased operational output, and improved decision-making. But the question that needs to be asked, and one behavior analysts are uniquely positioned to answer, is: does any of this actually work? Does it reduce documentation time, and to what degree? Does the quality of the output hold up under scrutiny? And does the time saved justify what may be lost in accuracy, consistency, or clinical utility?

Many practitioners in Applied Behavior Analysis work within healthcare contexts. This demands that outputs of any system we utilize must be defensible. This requires that we both understand what a system is producing and be able to explain it to clients, caregivers, supervisors, and payers.

So, as behavior analysts, what should we do? Fortunately, we are well trained in measurement techniques. We can define behaviors operationally, select measurement systems, collect data across conditions, and establish interobserver agreement before drawing conclusions. Measurement is one of our strongest tools for understanding behavior, and evaluating LLMs is no different.

When a large language model generates a response, it is producing a permanent product. That permanent product is an output that we can and should measure. The output can fall into many different categories. We can use labels such as accurate or inaccurate, complete or incomplete, consistent or variable across similar instructions, and even harmful or safe.

But the current state of evaluation may leave practitioners at risk of evaluating Large Language Model (LLM) outputs the way untrained observers evaluate behavior: informally and without agreed-upon definitions. A response that appears correct may be labeled as 'good enough,' whereas one that looks a bit 'off' may be discarded. This is the equivalent of measuring behavior change by anecdote and intuition rather than what the data do and don't support.

The purpose of this post is to challenge behavior analysts to hold LLM systems to the same standard we hold our own interventions. If we would not accept anecdotal evidence as proof of behavior change, we should not accept a fluent, well-formatted response as proof of clinical utility.

This post isn't aimed at solving all of your technical woes with adopting AI systems (and more specifically LLM based offerings), and it's not going to give you the answer on how to evaluate every AI system that exists (there are many different types of AI systems, and all of them have different levels of explainability). Rather, it is aimed at giving you a way to think about how you approach adoption, and a framework for making defensible decisions about the systems you are likely already being asked to adopt.

Before You Can Evaluate: Narrow the Scope

The capabilities of current large language models are no longer the limiting factor in this conversation. Sit with that for a moment. You can automate almost anything text-based with an LLM. For example, within a free-tier ChatGPT subscription and five seconds of back and forth, you can have a system conducting an unstructured interview with you. After that, it is parsing clinical and ethical scenarios and returning a response. This is all wrapped in a fluent, well structured response. And it feels credible. Every idea you input seems well supported, many good and bad ideas are validated, all wrapped up in a believable stream of text. That description isn't a risk of using these systems, it a feature of using one. In this context, the question isn't whether you can automate something, it's whether you should automate something.

But should you? And more importantly: could you evaluate that automation if you did?

When a system can do almost anything, evaluating "the system" as a whole is almost meaningless. What would it mean to evaluate whether an LLM is clinically useful when the same system can write a session note, draft a treatment plan, summarize a research article, and, without missing a step, settle the debate about whether the giants are the most storied franchise in MLB history? You cannot measure a tool that has no well defined scope. Before any LLM implementation in a healthcare setting can be evaluated, we MUST make the scope explicit. This means defining the problem you are solving, the specific task the model will execute, who will be affected by its output, and what success looks like before any data are collected.

For our purposes here, that work is already done. We are providing an example task to illustrate the evaluation process. The task we are using to illustrate this process is session note generation. The users are clinicians and behavior technicians. The stakeholders include clients, caregivers, supervisors, and payers. Whether this is similar to the task you are evaluating is unclear, but being laser focused on the specific task you are evaluating is a prerequisite to evaluating the system that is producing the output.

Step 1: Identifying the Dependent Variable(s)

The first step is breaking the problem down into smaller, more measurable units. Breaking the problem into smaller, observable units not only gives us something more concrete to track, it also forces us to get explicit about what we expect these systems to produce and what we are willing to accept as evidence that they are working.

Primary Dependent Variable: Time Spent on Session Notes

One of the dependent variables that may be actionable with documentation summaries is time spent on session notes. This is likely our most straightforward dependent variable, and could come into play as either a direct continuous measure if your documentation platform logs timestamps automatically, or as a self-reported measure if practitioners are recording their own time. We could record baseline data on the time spent to generate a session note pre- and post-implementation of the LLM system to monitor the effects of such a system on time allocation of behavior professionals, but duration as a dependent variable is not sufficient. A note produced in 30 seconds is only useful if it actually contains valuable information about the session, is an accurate representation of what has occurred, and can hold up to professional scrutiny. Unless all of those are met, it is pointless.

Secondary Dependent Variables: Note Quality Components

So, we need to add additional data elements to the mix so that we can monitor the impact of our adoption decision across multiple quantifiable variables. To do this, we need to break down the components of a session note into observable and measurable characteristics. Think of this like you would a treatment fidelity measurement system: small, specific, and scorable. The characteristics you want to monitor and measure will likely depend on your setting and clinical context.

What makes a good session note is largely outside the scope of this post. For our purposes, we need to identify a handful of components that are observable, measurable, and clinically meaningful enough to tell us something useful about whether the system is producing work we can actually stand behind.

#	Component	Operational Definition
1	Time spent on note	Duration from initiation to completion of the session note, measured in minutes
2	Date recorded	Note includes the date on which the service was rendered
3	Service type recorded	Note specifies the type of service delivered (e.g., direct therapy, supervision)
4	Location recorded	Note specifies where the session took place (e.g., home, clinic, school)
5	Behavior described in observable terms	Client behavior is described using objective, observable language with no mentalistic terms (e.g., seemed, appeared, was anxious)
6	Target goal referenced	Note explicitly references the treatment goal or objective addressed during the session
7	Measurable progress reported	Note includes a specific numerical or percentage-based measure of client performance (e.g., completed 70% of trials independently)
8	Performance compared to prior session	Current performance is contextualized against a previous data point (e.g., up from 65% last session)

Step 2: Deciding on measurement for each component

Our measurement can take different forms, and there are many options. Research on human evaluation of LLMs in healthcare has used Likert scales ranging from simple binary ratings to nuanced 5 or 7-point scales (e.g., "how accurately does this note reflect session data?" rated 1–5), categorical judgments such as presence or absence of a feature (e.g., "operationally defined" vs. "mentalistic language"), percentage correct scores calculated by summing met components across the task components, and comparative ratings against a gold standard note authored by a clinician. For our example with session notes, we are going to use the simplest of those: true or false, yes or no. Does the note describe behavior in observable terms? Does it reference the relevant treatment goal? Does it accurately reflect the data collected? In the aggregate, these binary scores give us an opportunity to calculate metrics we can plot across notes, benchmarked against the human-generated documentation that preceded LLM adoption.

Now that we have our measurement unit, let's create our data sheet.

#	Component	Operational Definition	Duration (min)
1	Time spent on note	Duration from initiation to completion of the session note, recorded in minutes

#	Component	Operational Definition
1	Date recorded	Note includes the date on which the service was rendered
2	Service type recorded	Note specifies the type of service delivered (e.g., direct therapy, supervision)
3	Location recorded	Note specifies where the session took place (e.g., home, clinic, school)
4	Behavior described in observable terms	Client behavior is described using objective, observable language — no mentalistic terms (e.g., seemed, appeared, was anxious)
5	Target goal referenced	Note explicitly references the treatment goal or objective addressed during the session
6	Measurable progress reported	Note includes a specific numerical or percentage-based measure of client performance (e.g., completed 70% of trials independently)
7	Performance compared to prior session	Current performance is contextualized against a previous data point (e.g., up from 65% last session)

Step 3: Establishing a Ground Truth

Before any data are collected, you need to define what a correct score actually looks like for each component. These are your ground truth values, used as a reference against which every scored note will be compared. Without it, observers are making judgments against an implicit and unverified standard, which is the same problem we started with.

The components in our example fall into two categories that require different approaches.

Mandatory components are those that should be present in every note, regardless of session content. These values will always need to be True. A missing date is never acceptable; a missing service type is never acceptable. For these components, the ground truth is established before data collection begins and requires no additional information.
Conditional components are those whose correct score depends on what actually occurred during the session. Whether a note should include measurable progress depends on whether formal measurement took place. Whether it should reference a prior data point depends on whether prior data points exist. For a session introducing a new skill with no baseline, the absence of a prior-session comparison is correct, and should not be scored as a miss.

For conditional components, the expert rater must consult the session data to understand what is warranted within the session note. This could be auditing the trial-by-trial records, graphs, or clinical logs. This is done before assigning a ground truth value for the note. The note is evaluated not against a universal expectation, but against what the session data actually support. This also means the person establishing the ground truth should be a qualified clinician reviewing the raw data, not the individual who wrote the note being evaluated.

Expert Rater: A trained clinician who independently scores both human and AI-generated notes against the data sheet criteria, serving as the source(s) of ground truth judgments used to build the confusion matrix.

LLM-Generated Note: A session note produced by a large language model, evaluated against the same operational definitions as the human note to determine where LLM documentation meets, exceeds, or falls short of clinician performance.

Human-Generated Note: A session note authored by the treating clinician or behavior technician. When scored, the human-generated note helps establish a pre-adoption baseline and serves as the reference condition in the confusion matrix.

Step 4: Collecting the data

Once we have our measurement system and ground truth established, we need to collect data. This step involves scoring the same set of sessions twice, once for the human-generated note and once for the AI-generated note. For each session, the expert rater applies the data sheet to both notes independently, producing a True/False score for each component from each source. Those scores are then compared against the ground truth values you defined in the previous section. This gives you two types of records:

How well your clinicians or direct implementors are meeting each criterion
How well the LLM system is meeting each criterion

For behavior analysts beginning to evaluate LLM output, the following practical considerations apply before you start:

Collect data on human-generated session notes. Collect data on a set of human-generated notes in addition to LLM output. In order to understand the impact of adoption, you need to know where you are starting from. This step allows us to contextualize our post-implementation data as meaningful.
Use an adequate sample size. If data from this evaluation will justify organization-wide adoption, discontinuation of a product, or changes to clinical workflows, a large sample may be warranted.
Sample across conditions. Notes should vary by clinician, client, skill domain, and session type. Remember, we are trying to build a representative sample of how we operate. A system that performs well on straightforward skill acquisition notes may fail on notes requiring more nuanced summarization. Variability and representation in your sample can help bolster your conclusions.
Keep the model constant. Different LLMs, and different versions of the same LLM, can produce meaningfully different outputs. If you are evaluating a specific system, document which model you are using to ensure it does not change mid-evaluation. A model update during data collection introduces a confound that makes your data difficult to interpret.
Consider blinded evaluations. In a blinded evaluation, the person scoring the note does not know whether it was written by a clinician or generated by an LLM. A note known to be LLM-generated may be scrutinized more harshly, or accepted more readily, depending on the evaluator's prior learning history with the technology.
Use more than one observer. A single evaluator introduces the same observer bias concerns we work to eliminate in any behavioral measurement context. Having at least two observers independently score the same set of notes allows you to calculate interobserver agreement (IOA).
Recalibrate when agreement is low. When observers disagree on a criterion, treat that discrepancy as data rather than noise. Review the scored items together, revisit the operational definition, and re-score until observers reach consensus. If disagreements cluster around a particular component, that is a signal the definition requires revision before data collection continues. This is similar to the conclusions a behavior analyst would draw when IOA falls below an acceptable level. Stop, clarify, retrain, and then resume.

A reasonable question is: how many notes do we actually need to evaluate? The honest answer is that we do not know yet. A systematic review of 142 studies evaluating LLM outputs in healthcare found that most used 100 or fewer samples, though the authors themselves acknowledged this as a limitation of current practice rather than a recommended standard. For our purposes, more is better. What we are really after is stability in the metric over time, and that is an empirical question worth investigating in our own field.

Step 5: Analyzing the data

A natural starting point is our primary dependent variable: time spent on note.

The figure below displays hypothetical data for note completion time across human-generated and LLM-generated conditions. Each data point is a single note. Even without applying any statistical analysis, the difference between the two distributions is visible. By visually inspecting data like any other intervention, we could likely identify how much time is saved by adopting a system like this.

Note completion time: human vs. LLM-generated notes (hypothetical)

But what do we do with the rest of our data? Duration tells us whether the system saves us time. It does not tell us whether the output is good. To make use of our secondary dependent variable data, we need a tool for comparing what the LLM produces against what we defined as acceptable in our ground truth. The tool we will be using is a confusion matrix.

The confusion matrix: a tool for comparing scored notes to the ground truth

A confusion matrix is a common tool in the evaluation of classification systems but less familiar in behavior analysis. A confusion matrix is a simple table that compares a set of predictions or classifications against a known reference, showing where they agree and where they do not. In behavior analysis, we use a similar logic when two observers record the same behavior independently and we compare the recorded events to calculate interobserver agreement. The confusion matrix works the same way, with one important distinction: one side of the comparison is always the ground truth established in the previous section, not another observer. It tells you precisely how well a given set of notes holds up against that fixed reference.

For each note, whether written by a clinician or generated by an LLM, the expert rater has produced a True/False score for each component. Those scores are then compared against the ground truth values for that component, component by component, session by session. For mandatory components, the ground truth is always True. For conditional components, the ground truth reflects what the session data supported. Either way, once the ground truth is defined, the comparison is the same.

Here is an example of a confusion matrix for a single component:

	Note said: Present	Note said: Absent
Ground truth: Present	True Positive (TP): criterion was correctly included	False Negative (FN): criterion was missed
Ground truth: Absent	False Positive (FP): criterion was incorrectly added	True Negative (TN): criterion was correctly omitted

True Positives and True Negatives are agreements. This is when the note matched the ground truth.
False Negatives are misses. This is when the note failed to include something the ground truth required.
False Positives are over-inclusions. This is when the note reported something the ground truth did not support.

This framing applies whether the note was written by a clinician or generated by an LLM system. Run it for both, and you have a direct, quantitative comparison between human and LLM documentation performance on the same criteria. From these four cells, we can calculate several summary metrics and each of these metrics tells us something different about the performance of the note.

Accuracy tells us the overall proportion of components the note got right, both what it correctly included and what it correctly omitted.
Sensitivity tells us how often the note correctly captured something when it should have been there.
Specificity tells us how often the note correctly omitted something that should not have been there.
Precision tells us, of everything the note reported, how much of it was warranted.
MCC (Matthews Correlation Coefficient) gives a single summary score ranging from -1 to +1 that accounts for imbalances in the data — useful when one outcome (e.g., "present") is far more common than the other.

Putting it together: from data sheet to confusion matrix

The ground truth established in the previous section is your fixed reference. The expert rater then scores each note (human-generated or AI-generated) against the data sheet, producing a True/False value for each component. Those scored values are then compared against the ground truth, component by component for each session. You run this for both note sources, which gives you two sets of confusion matrices: one showing how well human notes hold up against the ground truth, and one showing how well LLM notes do.

To make this concrete, let's trace one component from our data sheet. We will use Measurable progress reported as our example. Because this is a conditional component, the ground truth for each session is determined first by the expert rater reviewing the raw session data: did formal measurement actually occur? That decision is the reference. The rater then scores both the human note and the LLM note against that same reference independently.

Session	Ground Truth	Human Note	Human Outcome	LLM Note	LLM Outcome
1	T	T	True Positive	T	True Positive
2	F	F	True Negative	F	True Negative
3	T	T	True Positive	T	True Positive
4	F	F	True Negative	F	True Negative
5	F	F	True Negative	F	True Negative
6	T	T	True Positive	T	True Positive
7	F	F	True Negative	T	False Positive
8	F	F	True Negative	F	True Negative
...	...	...	...	...	...

Each row produces two independent outcomes: one for the human note and one for the LLM note. Both are compared against the same ground truth. You can see in session 7 that the ground truth was False (no formal measurement occurred), the human note correctly omitted a progress measure, but the LLM note included one anyway. That is a false positive specific to the AI: the human note gets a true negative, the LLM Outcome gets a false positive, for the same session.

Summing the LLM note outcomes across all 30 sessions produces the confusion matrix below. A separate matrix can be built for the human notes using the same approach, giving you a direct benchmark for comparison.

LLM-Generated Notes: Measurable progress reported vs. ground truth

30 sessions

	AI Output
	Progress Reported	No Progress Reported
ActualProgress Reported	10 True Positive	2 False Negative
ActualNo Progress Reported	4 False Positive	14 True Negative

80.0%

Accuracy

(TP + TN) / Total

83.3%

Sensitivity

TP / (TP + FN)

77.8%

Specificity

TN / (TN + FP)

71.4%

Precision

TP / (TP + FP)

0.60

MCC

-1 to +1

A few things worth noting from these results. Sensitivity of 83% tells us the LLM correctly included a measurable data point in most sessions where the ground truth required one. Specificity of 78% means the system generated a false positive in roughly one out of five sessions where the ground truth indicated no formal measure occurred. In a documentation context, that matters. A note that reports a specific performance percentage for a skill that was not formally measured is not just inaccurate, it is a fabrication that could not survive scrutiny. Running the same matrix for the human notes against the same ground truth gives you the baseline (i.e., how often do your clinicians meet this criterion when the session data say they should).

Human-Generated Notes: Measurable progress reported vs. ground truth

30 sessions

	AI Output
	Progress Reported	No Progress Reported
ActualProgress Reported	12 True Positive	0 False Negative
ActualNo Progress Reported	0 False Positive	18 True Negative

100.0%

Accuracy

(TP + TN) / Total

100.0%

Sensitivity

TP / (TP + FN)

100.0%

Specificity

TN / (TN + FP)

100.0%

Precision

TP / (TP + FP)

1.00

MCC

-1 to +1

In this example, the human notes are outperforming the LLM-generated notes on this specific metric. That finding cuts both ways. It confirms that your clinicians are meeting the documentation standard when the session data say they should, which is exactly what you want to see in your baseline. It also tells you that adopting this system, as configured, would represent a step backwards in documentation quality for this particular criterion. That is a meaningful clinical finding, and it is the kind of finding that should inform an adoption decision.

Now, these data are hypothetical, and your results will vary depending on your setting, your clinicians, the system you are evaluating, and how it has been configured. The point is not the numbers. The point is that this kind of comparison is possible, it is not technically difficult, and it is exactly the kind of evidence that should be driving these decisions in our field rather than promises and anecdotal reports.

Step 6: Interpreting the results and making a decision

Having data is not the same as making a decision. The confusion matrix gives you numbers; what you do with them requires clinical judgment informed by the context in which your organization operates.

A useful starting point is to examine your results by component type, because not all data elements carry equal weight. Some components are low stakes and may be well suited for LLM assistance, potentially even outperforming your human baseline. Others carry meaningful clinical, billing, or compliance risk where a false positive or false negative has consequences that extend well beyond the note itself. Furthermore, engaging in this kind of analysis may surface something unexpected: that human performance on certain components is not as strong as assumed, and that the LLM is actually the more consistent documenter. That is not a failure of the analysis. That is exactly what the analysis is for.

Step 7: Monitoring over time

Our final step is to monitor the performance of the system over time. Adoption is not the end of the evaluation process. LLM systems are updated, changed, modified, or abandoned by vendors all the time. This can occur without notice. A model that performed well during your initial evaluation may produce meaningfully different output after an update. Treating adoption as a one-time decision leaves your organization exposed to drift you may not detect until it surfaces in a chart audit or payer review.

Continuous monitoring does not require repeating the full evaluation indefinitely. A practical approach borrows from process control: establish a baseline level of performance for each component during your initial evaluation, then sample a smaller number of notes on an ongoing basis to monitor for meaningful deviation. If performance on a high-risk component drops below your acceptable threshold, that is a signal to pause and investigate before continued use.

A few elements worth building into any monitoring plan:

Track model version. When the vendor updates the underlying model, treat that as a change condition and collect a fresh round of data before resuming prior conclusions.
Monitor for condition changes. Clinician turnover, new client populations, expanded service types all constitute shifts in the input conditions. Performance during initial evaluation may not generalize to these new conditions.

The question this post opened with was: does any of this actually work? The answer behavior analysts should give is the same one we give to any intervention claim: show me the data. We would not accept a testimonial as evidence that a treatment produces behavior change. We wouldn't adjust a client's program based on promises made. The standards we hold our clinical work to do not disappear because the product being evaluated runs on a language model. If anything, the complexity of these systems makes rigorous measurement more necessary, not less.

Operational definitions, permanent product recording, baseline logic, interobserver agreement, and simple arithmetic. Every behavior analyst already has these tools. The only question is whether we use them.

Have questions about this framework or want to discuss how to apply it in your organization? Get in touch.

References

Cox, D. J., Weil, L., Sosine, J., Jennings, A. M., & Santos, C. (2025). Getting more from your IOA data: Alternative measures to total, occurrence, and non-occurrence agreement. Behavioral Interventions, e70031. https://doi.org/10.1002/bin.70031

Tam, T. Y. C., Sivarajkumar, S., Kapoor, S., Stolyar, A. V., Polanska, K., McCarthy, K. R., Osterhoudt, H., Wu, X., Visweswaran, S., Fu, S., Mathur, P., Cacciamani, G. E., Sun, C., Peng, Y., & Wang, Y. (2024). A framework for human evaluation of large language models in healthcare derived from literature review. npj Digital Medicine, 7, 258. https://doi.org/10.1038/s41746-024-01258-7