Preclinical Design and Algorithm Testing
Apple conducted research studies to develop the Breathing Disturbances metric and the associated sleep
apnea notification algorithm.Adult participants from multiple research sites provided informed consent via
protocols approved by an institutional review board (IRB). To enhance performance generalizability, Apple
recruited a diverse population of research participants across various demographic factors (age, biological
sex, race, ethnicity, and BMI) and evaluated sleep in both at-home and in-laboratory sleeping environments.
In addition, the studies included a broad range of sleep apnea severity, from normal (fewer than five apnea
and hypopnea events per hour) to severe (more than 30 events per hour).
Adult research participants wore an Apple Watch while reference recordings were conducted simultaneously
from in-laboratory PSG (one night) or at-home HSAT recordings (one to four nights).In each case, the
reference device recordings followed a standard clinical approach.Certified PSG technologists scored the
clinical recordings according to American Academy of Sleep Medicine (AASM) standards to provide reference
labels of each apnea and hypopnea event.It’s important to note that current AASM clinical standards allow
multiple definitions for scoring sleep apnea events.Both the training and validation studies used the strictest
scoring definition, which requires a 4% oxygen desaturation for hypopneas.
The design phase of algorithm development consisted of 3936 nights of at-home and in-lab PSG recordings
from 2160 participants.Some participants contributed multiple nights of recordings.The performance
testing phase included an additional 7220 nights from 2542 participants (see Table 1).None of the data
from the testing set was used in the design phase. That is, the testing set was sequestered and unseen
by the algorithm during the training phase.Participants self-reported race and ethnicity in the design and
testing sets, respectively, as White (70.7% and 69.1%), Black (9.0% and 8.9%), Asian (11.4% and 13.5%),
and Hispanic (7.2% and 7.6%).
The algorithm output of Breathing Disturbances is expressed as a continuous variable, which has units of
events per hour.The sleep apnea notification algorithm assesses the Breathing Disturbance data every
30 days (non-rolling), starting 30 days after onboarding. When at least 10 sleep recordings (not required
tobesequential) with Breathing Disturbances values occur within a given 30-day period, the notification
algorithm checks if at least 50% of these values are elevated.If so, the algorithm surfaces a notification of
possible sleep apnea to the user; otherwise, it remains silent.The algorithm also stays silent if there are fewer
than 10 nights with a Breathing Disturbances value in a 30-day window, as the data is insufficient for analysis.
Whether or not a notification is surfaced, the sleep apnea notification algorithm will remain active and attempt
to analyze data after each 30-day window.
The operating point on a receiver operating characteristic (ROC) curve reflects the trade-offs incurred when
choosing a threshold for a binary classifier to balance sensitivity and specificity goals.Sensitivity, or the true
positive rate, refers to the percentage of participants with moderate to severe sleep apnea who are correctly
identified by the algorithm.Specificity refers to the percentage of those without moderate to severe sleep
apnea who wouldn’t receive a notification. The operating point on the ROC curve was intentionally chosen to
favor specificity, understanding that operating point positions with high specificity have lower sensitivity.This
choice supported the goal of minimizing the false positive risk, which is particularly important for a feature
designed to repeatedly check for signs of possible sleep apnea over time (and in consideration of the
cumulative false positive rate) while simultaneously maintaining an impactful true positive rate.
In the sequestered algorithm testing data set, notification performance was 66.6% for sensitivity and 95.9%
for specificity.