Author: Milan Toma, PhD, SMIEEE
Available Now:
» Amazon (ISBN: 979-8-9998324-8-1)
Book Overview:
The arrival of large language models into public consciousness has transformed the discourse surrounding artificial intelligence in medicine. Systems that converse with remarkable fluency, that deploy technical terminology with apparent precision, that generate text indistinguishable in style from expert analysis; these have captured imaginations and, more dangerously, inspired misplaced confidence. The temptation to mistake linguistic sophistication for diagnostic competence has never been greater, nor the consequences of such confusion more consequential.
This book is a warning and a guide. It warns against the conflation of chatting with diagnosing, of eloquence with accuracy, of confident prose with genuine understanding. It guides the reader toward rigorous evaluation of machine learning systems intended for clinical deployment, providing frameworks that distinguish models that have truly learned from those that have merely memorized, and that translate clinical priorities into mathematical form.
Throughout, this book emphasizes a distinction that current enthusiasm for language models tends to obscure: the distinction between general-purpose systems that generate plausible text and task-specific systems that have been trained, validated, and in many cases regulatory-cleared for defined diagnostic applications. For medical image interpretation requiring diagnostic accuracy, the appropriate technology remains purpose-built machine learning models; not chatbots, however eloquent.
The reader completing this book should possess a framework for evaluating clinical machine learning systems that extends from initial training dynamics through economic analysis. They should understand why learning curves matter more than final metrics, why class imbalance demands specialized treatment, why clinical priorities must shape evaluation criteria, and why the confident prose of a language model provides no guarantee of accuracy. In medicine, where the stakes are measured in human welfare, such understanding is not merely desirable but essential.
The chapters proceed as follows:
Chapters 1 and 2 confront the illusion of expertise that large language models create. Chapter 1 examines how these systems exploit the confidence heuristic (our natural tendency to trust confident-sounding sources) generating authoritative medical prose regardless of underlying accuracy. Chapter 2 presents empirical evidence of this unreliability through a systematic evaluation of leading multimodal language models on radiological interpretation, revealing fundamental diagnostic errors, irreconcilable disagreements about basic findings, and the categorical unsuitability of these systems for autonomous medical image interpretation.
Chapters 3 and 4 survey the broader landscape of machine learning, distinguishing the transformer architectures underlying language models from the task-specific systems appropriate for clinical diagnostics. Chapter 3 maps the taxonomy of approaches (from traditional tree-based methods through neural networks to modern transformers) clarifying which paradigms suit which problems. Chapter 4 examines core algorithms as they are actually employed in cardiovascular medicine, demonstrating how decision trees, support vector machines, ensemble methods, and convolutional neural networks have been validated for specific diagnostic tasks.
Chapter 5 addresses the fundamental challenge of class imbalance that pervades medical data: the healthy are many, the sick are few. Standard algorithms, optimizing for overall accuracy, may learn to predict that everyone is healthy while missing every case of disease. This chapter presents preprocessing and algorithmic approaches that restore the balance clinical reality denies.
Chapter 6 develops a clinically oriented evaluation protocol that translates medical priorities into mathematical form. The composite metrics introduced here (e.g., the Clinical Discriminative Performance Score, Clinical Predictive Utility Score, Weighted Endpoint Accuracy Score, and Clinical Endpoint Performance Metric) acknowledge that in medicine, not all errors carry equal weight. A false negative that allows disease to progress unchecked differs fundamentally from a false positive that triggers additional testing.
Chapter 7 establishes the primacy of learning dynamics over aggregate metrics. A model's final accuracy, however impressive, tells us the destination without revealing the journey. The learning curves that plot training and validation performance across epochs reveal whether a model has genuinely learned generalizable patterns or has merely memorized its training data. The training-validation gap, the shape of convergence, the expected performance cascade from internal to external validation; these dynamics predict deployment success in ways that final metrics cannot.
Chapter 8 demonstrates that even the interpretation of learning curves cannot be outsourced to language models. Just as these systems fail at medical image interpretation, they fail at interpreting the diagnostic plots that reveal model quality. Four language models, presented with identical learning curves, arrive at contradictory conclusions about fundamental questions: whether overfitting is present, whether the training pipeline is valid, whether results merit publication. The oracle, it turns out, cannot read its own tea leaves.
Chapter 9 completes the evaluation framework by addressing economics. Technical validation, however rigorous, answers only whether a system works; not whether it is worth the investment. Cost-effectiveness analysis integrates operational costs, error consequences, and implementation expenses into a comprehensive assessment of whether deployment makes sense.
+1-516-686-7955
College of Osteopathic Medicine
New York Institute of Technology
Old Westbury Campus
Northern Boulevard
Old Westbury NY 11568-8000
This work is licensed under a Creative Commons Attribution 4.0 International License.