Scoping review of deep learning research illuminates artificial intelligence chasm in otolaryngology-head and neck surgery – Nature

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
npj Digital Medicine volume 8, Article number: 265 (2025)
Metrics details
Clinical validation studies are important to translate artificial intelligence (AI) technology in healthcare but may be underperformed in Otolaryngology – Head & Neck Surgery (OHNS). This scoping review examined deep learning publications in OHNS between 1996 and 2023. Searches on MEDLINE, EMBASE, and Web of Science databases identified 3236 articles of which 444 met inclusion criteria. Publications increased exponentially from 2012–2022 across 48 countries and were most concentrated in otology and neurotology (28%), most targeted extending health care provider capabilities (56%), and most used image input data (55%) and convolutional neural network models (63%). Strikingly, nearly all studies (99.3%) were in silico, proof of concept early-stage studies. Three (0.7%) studies conducted offline validation and zero (0%) clinical validation, illuminating the “AI chasm” in OHNS. Recommendations to cross this chasm include focusing on low complexity and low risk tasks, adhering to reporting guidelines, and prioritizing clinical translation studies.
Artificial intelligence (AI) technology is poised to enhance care delivery by physicians in otolaryngology—head and neck surgery (OHNS). Clinical practice in OHNS is rich with image, audio, video, genetic, and neurophysiologic semi-structured data that offer abundant opportunities for AI analysis. Indeed, an increasing number of publications have explored proof-of-concept applications of AI in OHNS1,2,3.
Despite these opportunities, otolaryngologists today use a small number of AI tools for clinical work, most of which are general tools not specific to OHNS (e.g., AI-powered voice dictation and ambient scribing of clinic notes). Part of the reason for this practice is that there are few OHNS-specific AI tools that are ready for clinical deployment. As of August 7, 2024, among 950 FDA-approved AI/machine learning-enabled medical devices4 only two were developed specifically for OHNS5,6. Clearly, there is a discrepancy between the opportunity for developing clinically useful AI tools in OHNS and the realization of this opportunity; this trend has been more broadly termed the “AI chasm” in healthcare7.
Clinical studies are important to translate AI tools by validating their utility and readiness for deployment in the healthcare setting. We hypothesize that clinical validation is an understudied area of research in development of AI applications in OHNS. If true, the finding would identify an area needing attention in clinically targeted AI research. To evaluate this hypothesis, we conducted a scoping review of deep learning research in OHNS with attention to the stages of AI model development (i.e., proof-of-concept, offline validation, clinical validation phase) and approaches for model validation used.
We identified 3236 records in databases searches. After de-duplication, we screened 2946 abstracts and titles and 973 full-text articles. Ultimately, 444 publications met inclusion criteria (Fig. 1; Supplementary Data). Hereafter the included studies are referred to as the deep learning publications in OHNS and the term AI specifically refers to deep learning methods.
A total of 444 studies were included in the scoping review.
The temporal and geospatial distribution of deep learning publications in OHNS are shown in Fig. 2. In a 10-year period between 2012 and 2022, there was an exponential increase in publications per year from 0 publications in 2012 to 105 publications in 2022, the last full calendar year included in this review (Fig. 2a). Author affiliations spanned 48 countries across six continents, demonstrating the global scale of AI research (Fig. 2b). The countries with the highest numbers of publications were the United States (139 publications), China (95 publications), and South Korea (38 publications) (Supplementary Fig. 1).
a Publications per year on deep learning applications in OHNS, as of October 25, 2023. b Geographic bubble chart of country affiliations of publications. For each country, the circled area indicates the number of publications with at least one author affiliated with an institution in that country. The map was created using Natural Earth.
Additional descriptive characteristics of deep learning publications in OHNS are shown in Fig. 3. Deep learning research spanned all OHNS subspecialties with the highest number of publications in otology and neurotology (including audiology) (28%) (Fig. 3a). The most common goal of AI applications was to extend the capabilities of health care providers (56%), and the second most common goal was to screen for medical conditions (30%) (Fig. 3b). Non-radiology (36%) and radiology (19%) images were the most common data types that AI models analyzed. The most common non-radiology images analyzed were otoscopy, laryngoscopy, clinical photography, histology, hyperspectral, and nasal endoscopy images (Supplementary Table 1 and Supplementary Fig. 2). Convolutional neural network (CNN) models were the most used deep learning model type for analyzing image, audio, video, and electrophysiology data (Fig. 3c).
a Primary OHNS sub-specialties of AI applications in publications. b Application type of deep learning models in publications. c Data type of inputs to deep learning models in publications, sub-categorized by the types of deep learning models used to analyze the data. Abbreviations: ANN artificial neural network, CNN convolutional neural network, GAN generative adversarial network, LLM large language model, LSTM long-short term memory, RNN recurrent neural network.
The stages of AI models, adherence to reporting guidelines, and validation methods in included publications are shown in Fig. 4. The stages for AI in healthcare provide a useful framework for assessing the maturity of validation studies of an AI tool and its readiness for clinical deployment8. Nearly all studies (99.3%) were in the in silico proof-of-concept stage. Three (0.7%) studies moved beyond in silico development to offline validation (Fig. 4a). These three studies assessed speech denoising9, visual speech recognition10, and speaker separation11 AI models in human subjects in an experimental setting. Strikingly, there were zero (0%) clinical validation studies among the 444 deep learning publications in OHNS.
a Stages of AI model development in publications. b Reporting guidelines used by publications. c Evaluation methods used by publications. Abbreviations: TRIPOD Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis, TREND Transparent Reporting of Evaluations with Nonrandomized Designs, STROBE Strengthening the Reporting of Observational Studies in Epidemiology, STARD Standards for Reporting of Diagnostic Accuracy Studies, CONSORT-AI Consolidated Standards of Reporting Trials–Artificial Intelligence.
Reporting guidelines involve recommendations, often in the form of checklists, of essential information to include in the dissemination of research results. Reporting guidelines are particularly important for AI research in healthcare, a rapidly evolving field, and can improve reproducibility, transparency, and standardization. Despite this requirement, reporting guidelines were used infrequently by only 24 studies (5.4%) (Fig. 4b). Reporting guidelines used were, in decreasing order, the Standards for Reporting of Diagnostic Accuracy Studies (STARD)12 (2.9%), Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD)13 (1.3%), Transparent Reporting of Evaluations with Nonrandomized Designs (TREND)14 (0.7%), Strengthening the Reporting of Observational Studies in Epidemiology (STROBE)15 (0.2%); and Consolidated Standards of Reporting Trials–Artificial Intelligence (CONSORT-AI)16 (0.2%).
Single-institution and retrospective validation methods were most used (Fig. 4c), with only 10 studies (2.3%) using prospective evaluation for either single-institution (1.8%) or multi-institution (0.5%) datasets. Notably, 4 studies (0.9%) did not describe their validation method, and 73 studies (16%) reported retrospective validation results with neither an independent test set nor cross-validation, which is noteworthy because it limits assessment of a model’s generalizability17.
Explainability of AI models is important for trust and reproducibility. Forty-one studies (9.2%) described use of a method to attempt to explain the AI model. Two of the most used methods for explainability were Gradient-weighted Class Activation Mapping (Grad-CAM) (3%)18 and Class Activation Maps (CAM)19 (2%).
Despite the promise of AI technology to transform precision medicine and healthcare, there has been a chasm between the potential benefit of AI tools and their variable performance when deployed clinically7. We demonstrate that the AI chasm is particularly deep in the field of OHNS, as there is a striking absence of clinical validation studies among over 440 deep learning publications in OHNS between 1996–2023. Increasing clinical validation studies, as well as moving beyond proof-of-concept studies, is important to advance the development of clinically useful AI applications in OHNS. To the best of our knowledge, this is the first scoping review of deep learning applications that addresses the entire field of OHNS.
The AI chasm in OHNS mirrors challenges in translating AI technology to healthcare more broadly. Clinical validation studies have demonstrated mixed performance of AI models for sepsis prediction using electronic health record data20, real-time diabetic retinopathy screening21, and chest x-ray screening22. Even if AI models were highly accurate and surpassed human diagnostic capabilities, they would not necessarily enable better care because that would also depend on the healthcare system’s ability to take appropriate actions based on the AI model’s output23. Three practical aspects of model design—actionability, safety, and utility—have been proposed to help bridge the AI chasm24. Additional issues that are likely to contribute to the AI chasm include the substantial cost for developing, implementing, and monitoring AI models and liability. Healthcare is a highly regulated industry, and compliance with regulations contributes to the cost and challenges of deploying and clinically validating AI models25.
Our review informs the following suggestions for advancing the development of clinically useful AI models in OHNS. First, we suggest increasing effort and funding for AI research addressing low complexity, low-risk tasks because these applications have the potential to provide real world benefit sooner and in a more clearly defined way than higher complexity, higher risk tasks. This could involve refocusing AI research on non-diagnostic applications in OHNS, such as automating routine tasks and triaging workflows, rather than diagnostic applications, such as extending capabilities of healthcare providers and screening for medical conditions which together constitute 86% of deep learning publications in OHNS (Fig. 3b). Even automating routine tasks carries risk, however, as major health insurers have faced class action lawsuits for using AI models to automate insurance preauthorization.
Second, we suggest adherence to reporting guidelines for AI prediction tools. These guidelines are increasing in number26 and provide frameworks for standardizing reporting and transparency that can improve the fairness, usefulness, and reliability of AI models7,27. Guidelines can help researchers anticipate challenges for clinical translation of an AI model prior to its design and development and improve the quality and reporting of validation methods, which were potentially inadequate (lacking a hold-out test set or use of cross-validation) in 77 studies (17%) (Fig. 4c). The fact that only 5.4% of included studies used a reporting guideline highlights the opportunity for improvement in this regard (Fig. 4b). There remains a challenge in deciding which among the several reporting guidelines to use, and more work is needed to shape understanding of what an accepted standard reporting guideline should be.
Third, we suggest that researchers prioritize clinical validation of AI applications. From our experience in conducting proof-of-concept AI research in OHNS28,29,30,31,32,33,34,35, we recognize the hurdles to translating AI research. One perceived challenge is the assumption that external validation on test datasets from (often multiple) outside institutions is required to assess an AI model’s generalizability. Recurring local validation has been proposed as a potentially advantageous alternative to external validation36. Adopting a standard of site-specific, local validation could encourage researchers to pilot small-scale clinical validation studies at their institution as part of an iterative process to improve clinical validation and lower the barrier to initiating clinical studies. At the same time, it may protect against model drift36,37. To ensure that validation still includes a diverse patient population, we would still encourage the use of multi-institutional data for initial model training. The need to validate AI models at scale will reinforce previous efforts to create multicenter data-sharing collaborations in OHNS38. The use of frameworks such as federated39 and swarm learning40 can help maintain the confidentiality of patient data and address data-sharing barriers for building large, representative datasets and accelerating model development and clinical validation.
Finally, we suggest careful attention to the accuracy of ground truth labels in OHNS datasets to promote the success of predictive AI models. Clinical diagnoses do not always serve as the most accurate or precise “ground truth” labels for AI training datasets. For example, in laryngology-based AI research related to voice disorders, participants categorized with “hyperfunctional” voice disorders may include those in the early phases of vocal pathophysiology prior to lesion formation (i.e., vocal fold edema), or those who have already developed nodules, polyps, or contact ulcers32. Superordinate categorization of pathology may obscure the uniqueness of more refined diagnostic categories when applying AI models for predictive objectives.
Though interpretability can influence trust, we do not suggest that future studies must attempt to explain their AI models, a process that was done in only 41 studies (9.2%). The requirement that AI models be explainable to be safely used in medicine is under debate41,42, though lack of explainability likely contributes to distrust of AI models.
This study has several limitations. First, the search terms were initially designed to capture studies focused on developing rather than evaluating existing deep learning applications in OHNS. This resulted in the omission of some peer-reviewed research letters evaluating the application of Chat-GPT to OHNS43. Second, although our team of seven reviewers independently screened and extracted data from included studies, our process relied on adjudication by a single reviewer with expertise in AI and clinical OHNS. This review process could be improved in the future, as demonstrated by the fair interrater agreement of 73% (Cohen’s Kappa 0.39) for title and abstract screening. Third, some parts of the data extraction relied on subjective judgment, such as assessing adherence to reporting guidelines, stages of AI model development, and validation methods. Use of more explicit instructions in the data extraction form, for example, listing all relevant reporting guidelines, could make the data extraction more consistent. Fourth, aside from publication years, the data were analyzed collectively across all years of publication, which may inadvertently hide trends in deep learning research across different time periods. Fifth, this review’s database searches were conducted in October 2023 and may miss important studies published since then. Future studies should update this review to include new studies and monitor the rapid progress of AI research in OHNS. Such updates will be important to surveil trends in large language model applications in OHNS44,45 and multimodal AI46 which offer cutting-edge approaches for developing holistic and clinically-relevant models. Sixth, while this review conducted a broad survey of AI applications in OHNS, further descriptions of individual applications can be found in recent scoping reviews of AI in audiology47, laryngology3, and other OHNS sub-specialties.
The framework for “translational research” in AI applications for OHNS is still being established. This review identifies a clear gap in the AI literature in OHNS for clinical validation studies. Our recommendations to help fill this gap include focusing on low complexity and low risk tasks, adhering to reporting guidelines, and prioritizing clinical translation while keeping rigorous standards of diversity in our datasets. If successful, clinical translation of AI technology in OHNS might serve as a blueprint for the broader healthcare community to cross the AI chasm.
We conducted a scoping review of the literature on deep learning applications in OHNS following the PRISMA Extension for Scoping Reviews guidelines48 (Supplementary Table 2). Our search strategy was developed in collaboration with a research librarian (Supplementary Table 3). Search terms were chosen to capture studies that developed or evaluated deep learning models that were intended to be primarily applied to the field of OHNS. We executed the queries from October 16th to October 25th, 2023, to search the MEDLINE, EMBASE, and Web of Science databases.
Inclusion criteria were the development or evaluation of a deep learning model primarily intended to be applied to the field of OHNS. Deep learning models included neural networks (e.g., artificial, convolutional, recurrent, long-short term memory, generative adversarial) and large language models. Three categories of studies fell outside the inclusion criteria and, therefore, were excluded. First, studies primarily targeting another specialty (e.g., prediction of apnea-hypopnea index for sleep medicine or nasopharyngeal cancer patient survival after radiation treatment for radiation oncology) were excluded. The primary specialty targeted by an AI application was determined based on which physician specialty the AI application was intended to be used by clinically. For example, OHNS applications included analysis of radiology images to aid clinical decision making by otolaryngologists (e.g., prediction of inverted papilloma malignant transformation on magnetic resonance imaging scans28). Second, studies of machine learning methods (e.g., multilayer perceptron, logistic regression, and support vector machines) that did not involve deep learning were excluded. Third, general speech recognition tasks that did not directly relate to otolaryngology were excluded, even if these tasks involved the analysis of voice data (e.g., general-purpose speech to text prediction models). Finally, only peer-reviewed, original research articles published in the English language that had retrievable full texts were considered.
A team of seven reviewers (G.S.L., S.F., M.C.L., S.P., J.H., S.K., and M.S.,) conducted the literature review using Covidence, a collaborative web-based platform49. References found by the search strategy were de-duplicated and screened using the inclusion criteria, first by title and abstract, and then by full text. Full texts were reviewed by one reviewer and checked by a second reviewer. Discrepancies between reviewers were decided by an adjudicator with expertise in deep learning and otolaryngology (G.S.L.). Studies that passed the full-text screening phase were included in the review.
We extracted the following data from included studies: article information (e.g., year of publication, countries of authors’ affiliated institutions), deep learning method, application (i.e., the goal of the application and target sub-specialty within OHNS), input data, method for model validation, stage of model development, use of reporting guidelines, and attempts to explain the model. Our data extraction form is available in Supplementary Table 4. If a study investigated multiple deep learning methods and data types, the primary model and data type in the study were chosen. We categorized model validation methods according to whether data were collected from single versus multiple institutions, obtained prospectively versus retrospectively, and/or partitioned into a hold-out test dataset or cross-validation folds. Omission of the use of either an independent test dataset or cross-validation limits assessment of the generalization performance of the model with future, unseen data17. Evaluation of validation methods erred on the more robust methodology in cases of uncertainty to provide an upper bound on the quality of validation. For example, a study that reported “test” results but did not explicitly describe the use of a hold-out test dataset was presumed to have used one.
We categorized the stages of AI model development according to the stages for AI in healthcare described in the DECIDE-AI reporting guidelines8: in silico evaluation (i.e., proof of concept), offline validation (i.e., silent/shadow evaluation), small-scale clinical validation, large-scale clinical validation, and post-market surveillance. For consistency, we considered in silico evaluation as application of an AI model to prepared data in a context removed from the context of the intended use (e.g., assessment of a speech denoising algorithm on audio recordings saved on a computer); offline validation as application to prospective data in a context similar to the intended use (e.g., assessment of a speech denoising algorithm in cochlear implant subjects in the laboratory); and clinical validation as application in the context of the intended use (e.g., assessment of a speech denoising algorithm in cochlear implant subjects outside the laboratory). Disagreements during data extraction were decided by the adjudicator (G.S.L.).
The complete dataset of studies included in the scoping review is provided in the Supplementary Data.
Crowson, M. G. et al. A contemporary review of machine learning in otolaryngology–head and neck surgery. Laryngoscope 130, 45–51 (2020).
Article  PubMed  Google Scholar 
Bur, A. M., Shew, M. & New, J. Artificial intelligence for the otolaryngologist: a state of the art review. Otolaryngol. Neck Surg. 160, 603–611 (2019).
Article  Google Scholar 
Liu, G. S., Jovanovic, N., Sung, C. K. & Doyle, P. C. A scoping review of artificial intelligence detection of voice pathology: challenges and opportunities. Otolaryngol. Neck Surg. 171, 658–666 (2024).
Article  Google Scholar 
Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices. FDA https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices (2024).
ENT Navigation Application with Kick EM from Brainlab. Brainlab https://www.brainlab.com/surgery-products/overview-ent-products/ent-navigation-application/.
Reimagining the Way Medical Devices Are Designed. PacificMD Biotech https://www.pmdbiotech.com/.
Lu, J. H. et al. Assessment of adherence to reporting guidelines by commonly used clinical prediction models from a single vendor: a systematic review. JAMA Netw. Open 5, e2227779 (2022).
Article  PubMed  PubMed Central  Google Scholar 
Vasey, B. et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 28, 924–933 (2022).
Article  PubMed  Google Scholar 
Gajecki, T., Zhang, Y. & Nogueira, W. A deep denoising sound coding strategy for cochlear implants. IEEE Trans. Biomed. Eng. 70, 2700–2709 (2023).
Article  PubMed  Google Scholar 
Raghavan, A. M., Lipschitz, N., Breen, J. T., Samy, R. N. & Kohlberg, G. D. Visual speech recognition: improving speech perception in noise through artificial intelligence. Otolaryngol. -Head. Neck Surg. J. Am. Acad. Otolaryngol. -Head. Neck Surg. 163, 771–777 (2020).
Article  Google Scholar 
Healy, E. W., Taherian, H., Johnson, E. M. & Wang, D. A causal and talker-independent speaker separation/dereverberation deep learning algorithm: Cost associated with conversion to real-time capable operation. J. Acoust. Soc. Am. 150, 3976 (2021).
Article  PubMed  PubMed Central  Google Scholar 
Cohen, J. F. et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open 6, e012799 (2016).
Article  PubMed  PubMed Central  Google Scholar 
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMC Med. 13, 1 (2015).
Article  PubMed  PubMed Central  Google Scholar 
Des Jarlais, D. C., Lyles, C. & Crepaz, N. Improving the reporting quality of nonrandomized evaluations of behavioral and public health interventions: the TREND statement. Am. J. Public Health 94, 361–366 (2004).
Article  PubMed  PubMed Central  Google Scholar 
von Elm, E. et al. Strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. BMJ 335, 806–808 (2007).
Article  Google Scholar 
Liu, X., Cruz Rivera, S., Moher, D., Calvert, M. J. & Denniston, A. K. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 26, 1364–1374 (2020).
Article  PubMed  PubMed Central  Google Scholar 
Liu, Y., Chen, P.-H. C., Krause, J. & Peng, L. How to read articles that use machine learning: users’ guides to the medical literature. JAMA 322, 1806–1816 (2019).
Article  PubMed  Google Scholar 
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 336–359 (2020).
Article  Google Scholar 
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning Deep Features for Discriminative Localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2921–2929 (IEEE, 2016). https://doi.org/10.1109/CVPR.2016.319.
Kamran, F. et al. Evaluation of sepsis prediction models before onset of treatment. NEJM AI 1, AIoa2300032 (2024).
Article  Google Scholar 
Ruamviboonsuk, P. et al. Real-time diabetic retinopathy screening by deep learning in a multisite national screening programme: a prospective interventional cohort study. Lancet Digit. Health 4, e235–e244 (2022).
Article  PubMed  Google Scholar 
Lind Plesner, L. et al. Commercially available chest radiograph AI tools for detecting airspace disease, pneumothorax, and pleural effusion. Radiology 308, e231236 (2023).
Article  PubMed  Google Scholar 
Gomez Rossi, J., Rojas-Perilla, N., Krois, J. & Schwendicke, F. Cost-effectiveness of artificial intelligence as a decision-support system applied to the detection and grading of melanoma, dental caries, and diabetic retinopathy. JAMA Netw. Open 5, e220269 (2022).
Article  PubMed  PubMed Central  Google Scholar 
Seneviratne, M. G., Shah, N. H. & Chu, L. Bridging the implementation gap of machine learning in healthcare. BMJ Innov. 6, 45–47 (2020).
Article  Google Scholar 
Mennella, C., Maniscalco, U., De Pietro, G. & Esposito, M. Ethical and regulatory challenges of AI technologies in healthcare: a narrative review. Heliyon 10, e26297 (2024).
Article  PubMed  PubMed Central  Google Scholar 
Crowson, M. G. & Rameau, A. Standardizing machine learning manuscript reporting in otolaryngology-head & neck surgery. Laryngoscope 132, 1698–1700 (2022).
Article  PubMed  Google Scholar 
Callahan, A. et al. Standing on FURM ground: a framework for evaluating fair, useful, and reliable AI models in health care systems. NEJM Catal. Innov. Care Deliv. 5, https://doi.org/10.1056/CAT.24.013 (2024).
Liu, G. S. et al. Deep learning classification of inverted papilloma malignant transformation using 3D convolutional neural networks and magnetic resonance imaging. Int. Forum Allergy Rhinol. 12, 1025–1033 (2022).
Article  PubMed  Google Scholar 
Liu, G. S. et al. ELHnet: a convolutional neural network for classifying cochlear endolymphatic hydrops imaged with optical coherence tomography. Biomed. Opt. Express 8, 4579–4594 (2017).
Article  PubMed  PubMed Central  Google Scholar 
Liu, G. S., Shenson, J. A., Farrell, J. E. & Blevins, N. H. Signal to noise ratio quantifies the contribution of spectral channels to classification of human head and neck tissues ex vivo using deep learning and multispectral imaging. J. Biomed. Opt. 28, 016004 (2023).
Article  PubMed  PubMed Central  Google Scholar 
Shenson, J. A., Liu, G. S., Farrell, J. & Blevins, N. H. Multispectral imaging for automated tissue identification of normal human surgical specimens. Otolaryngol. Head. Neck Surg. J. Am. Acad. Otolaryngol. Head. Neck Surg. 164, 328–335 (2021).
Article  Google Scholar 
Liu, G. S. et al. End-to-end deep learning classification of vocal pathology using stacked vowels. Laryngoscope Investig. Otolaryngol. 8, 1312–1318 (2023).
Article  PubMed  PubMed Central  Google Scholar 
Liu, G. S., Cooperman, S. P., Neves, C. A. & Blevins, N. H. Estimation of cochlear implant insertion depth using 2D-3D registration of postoperative X-ray and preoperative CT images. Otol. Neurotol. 45, e156 (2024).
Article  PubMed  Google Scholar 
Neves, C. A. et al. Automated radiomic analysis of vestibular Schwannomas and inner ears using contrast-enhanced T1-weighted and T2-weighted magnetic resonance imaging sequences and artificial intelligence. Otol. Neurotol. 44, e602 (2023).
PubMed  Google Scholar 
Liu, G. S. et al. Artificial intelligence tracking of otologic instruments in mastoidectomy videos. Otol. Neurotol. 45, 1192 (2024).
Article  PubMed  Google Scholar 
Youssef, A. et al. External validation of AI models in health should be replaced with recurring local validation. Nat. Med. 29, 2686–2687 (2023).
Article  PubMed  Google Scholar 
Granlund, T., Stirbu, V. & Mikkonen, T. Towards regulatory-compliant MLOps: Oravizio’s journey from a machine learning experiment to a deployed certified medical product. SN Comput. Sci. 2, 342 (2021).
Article  Google Scholar 
Beswick, D. M. et al. Design and rationale of a prospective, multi-institutional registry for patients with sinonasal malignancy. Laryngoscope 126, 1977–1980 (2016).
Article  PubMed  Google Scholar 
Crowson, M. G. et al. A systematic review of federated learning applications for biomedical data. PLOS Digit. Health 1, e0000033 (2022).
Article  PubMed  PubMed Central  Google Scholar 
Warnat-Herresthal, S. et al. Swarm Learning for decentralized and confidential clinical machine learning. Nature 594, 265–270 (2021).
Article  PubMed  PubMed Central  Google Scholar 
Holm, E. A. In defense of the black box. Science 364, 26–27 (2019).
Article  PubMed  Google Scholar 
Ghassemi, M., Oakden-Rayner, L. & Beam, A. L. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit. Health 3, e745–e750 (2021).
Article  PubMed  Google Scholar 
Ayoub, N. F., Lee, Y.-J., Grimm, D. & Balakrishnan, K. Comparison between ChatGPT and Google search as sources of postoperative patient instructions. JAMA Otolaryngol. Neck Surg. 149, 556–558 (2023).
Article  Google Scholar 
Vaira, L. A. et al. Validation of the quality analysis of medical artificial intelligence (QAMAI) tool: a new tool to assess the quality of health information provided by AI platforms. Eur. Arch. Otorhinolaryngol. 281, 6123–6131 (2024).
Article  PubMed  PubMed Central  Google Scholar 
Vaira, L. A. et al. Enhancing AI Chatbot responses in health care: the SMART prompt structure in head and neck surgery. OTO Open 9, e70075 (2025).
Article  PubMed  PubMed Central  Google Scholar 
Judge, C. S. et al. Multimodal artificial intelligence in medicine. Kidney360 5, 1771 (2024).
Article  PubMed  Google Scholar 
Frosolini, A. et al. Artificial intelligence in audiology: a scoping review of current applications and future directions. Sensors 24, 7126 (2024).
Article  PubMed  PubMed Central  Google Scholar 
Tricco, A. C. et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann. Intern. Med. 169, 467–473 (2018).
Article  PubMed  Google Scholar 
Covidence systematic review software. Veritas Health Innovation.
Download references
We thank Christopher Stave, M.L.S. for assistance with developing search strategies and extracting references, and Nigam Shah for helpful discussion. K.M.S. gratefully acknowledges funding support from the Bertarelli Foundation Endowed Professorship and the Remondi Foundation.
Department of Otolaryngology–Head and Neck Surgery, Stanford University, Stanford, CA, USA
George S. Liu, Soraya Fereydooni, Melissa Chaehyun Lee, Srinidhi Polkampally, Jeffrey Huynh, Sravya Kuchibhotla, Mihir M. Shah, Noel F. Ayoub, Robson Capasso, Michael T. Chang, Philip C. Doyle, F. Christopher Holsinger, Zara M. Patel, Jon-Paul Pepper, C. Kwang Sung, Nikolas H. Blevins & Konstantina M. Stankovic
Department of Otolaryngology–Head and Neck Surgery, Johns Hopkins University, Baltimore, MD, USA
George S. Liu & Francis X. Creighton
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
G.S.L. contributed to the study’s conception, design, data acquisition, data analysis, data interpretation, and writing of the manuscript. S.F. contributed to the study’s design and data acquisition. M.C.L., S.P., J.H., S.K., and M.M.S. contributed to the data acquisition. N.F.A., R.C., M.T.C., P.C.D., F.C.H., Z.M.P., J.P.P., C.K.S., F.X.C., N.H.B., and K.M.S. contributed to the interpretation of data and revision of the manuscript. All authors reviewed and approved the submitted version of the manuscript and agreed to be accountable for their contributions.
Correspondence to George S. Liu.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Reprints and permissions
Liu, G.S., Fereydooni, S., Lee, M.C. et al. Scoping review of deep learning research illuminates artificial intelligence chasm in otolaryngology-head and neck surgery. npj Digit. Med. 8, 265 (2025). https://doi.org/10.1038/s41746-025-01693-0
Download citation
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-025-01693-0
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Advertisement
npj Digital Medicine (npj Digit. Med.)
ISSN 2398-6352 (online)
© 2025 Springer Nature Limited
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

source

Post navigation

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

If you like this post you might alo like these