How Reliable Is AI for Infant Safe-Sleep Advice? Evaluating Accuracy Against National Guidelines

Johns Hopkins Medicine
Johns Hopkins MedicineApr 8, 2026

Why It Matters

Inaccurate AI‑generated sleep advice could jeopardize infant safety, making clinician‑led validation essential as parents increasingly turn to chatbots for health information.

Key Takeaways

  • Large language models show variable accuracy on infant sleep advice.
  • Gemini outperformed ChatGPT in alignment with AAP guidelines.
  • Empathy scores were high across all models despite factual gaps.
  • Simple guideline questions received higher accuracy than nuanced scenarios.
  • Pediatric oversight needed to ensure AI‑generated advice remains evidence‑based.

Summary

The study presented by Johns Hopkins medical student Evan Rosschud examined how reliably large‑language models (LLMs) provide infant safe‑sleep guidance compared with the 2022 American Academy of Pediatrics (AAP) recommendations.

Researchers extracted nine frequent caregiver questions from Reddit’s New Parents forum, covering topics such as sleep position, co‑sleeping, swaddling, and managing sick infants. Each query was submitted three times to ChatGPT, Google Gemini, and Anthropic Claude, and three independent reviewers scored responses for accuracy, completeness, tone, empathy, and readability using the Flesch‑Kincaid grade level. Statistical analysis (ANOVA with post‑hoc tests) revealed significant differences: Gemini achieved the highest mean accuracy (1.85) versus ChatGPT’s 1.30 (p = 0.01), while all models scored uniformly high on empathy.

Only three of the nine questions received top‑ranking accuracy across all models, and answers to straightforward guideline‑based queries (e.g., “What position should I use when I put my baby down to sleep?”) were markedly more accurate than nuanced, scenario‑specific questions. Readability hovered at middle‑school to high‑school levels, and ChatGPT produced the lowest readability scores, indicating more complex language.

The findings underscore that, although LLMs can deliver empathetic responses, their factual consistency is uneven, posing risks for parents who rely on AI for medical advice. The authors call for pediatric oversight and collaboration with AI developers to embed evidence‑based safe‑sleep protocols into these tools, safeguarding infant health as digital guidance becomes commonplace.

Original Description

Evin Rothschild, a medical student at the Johns Hopkins University School of Medicine, discusses her research on parents and caregivers seeking information online and whether large language models follow the 2022 American Academy of Pediatrics safe-sleep recommendations. In this research, Rothschild and team submitted common questions from online forums to large language models to assess the results. #infantsleep #aihealthcare

Comments

Want to join the conversation?

Loading comments...