Skip to content
Karen O’Brien
Karen O’Brien

Karen O’Brien

Senior Principal Data Scientist
Modern Technology Solutions, Inc
Bio

Karen O’Brien is a senior principal data scientist and Artificial Intelligence and Machine Learning (AI/ML) practice lead at Modern Technology Solutions, Inc. In this capacity, she leverages her 20-year Army civilian career as a scientist, tester and evaluator, Operations Research/Systems Analyst (ORSA), and analytics leader to aid Department of Defense (DoD) agencies in implementing AI/ML and advanced analytics solutions. Her Army analytics career ranged ‘from ballistics to logistics,’ with a preponderance of time supporting the U.S. Army Test and Evaluation Command where she was known for designing scientific test and evaluation for emerging technologies. She was a physics and chemistry nerd in the early days, but now uses her M.S. in Predictive Analytics from Northwestern University to help her DoD clients tackle the toughest analytics in support of national defense. She is co-lead of the Women in Data Huntsville Chapter, a guest lecturer in data and analytics graduate programs, and an ad hoc study committee member at the National Academy of Sciences.


Advancing the Test Science of LLM-enabled Systems: A Survey of Factors and Conditions that Matter Most

Testing is an essential step in the AI/ML lifecycle and a well-designed test provides insight into how well an AI-enabled system will perform under operational conditions. Regardless of test design method, a scientifically rigorous experiment must understand, manage, and control the variables that impact test outcomes. For most scientific fields, this is settled science with decades of formalism and honed methodology. For the emerging field of Large Language Models (LLMs) – a type of generative AI – especially as used in business, scientific, creative, and military applications, it is the wild west. This presentation will survey the factors and conditions that impact LLM test outcomes, along with supporting literature and practical methods, models, and measures for use in your testing. The presentation will also highlight: 1) The statistical assumptions that underly the common LLM performance metrics and how to test those assumptions; 2) How to evaluate a benchmark for its utility in addressing measures of performance, as well as checking the benchmark’s statistical validity; 3) Practical models, and supporting literature, for binning factors into levels of severity (conditions); 4) Resources for ensuring a User-centered test design; and 5) Incorporating selected adversarial techniques. These resources and techniques are immediately actionable (you can even try them out on your device and favorite LLM during the session) and will equip you to navigate the complexity of scientific test design for LLM-enabled systems.

Professional