Automated sleep scoring models like those from Fitbit or Oura report high accuracy on population-level datasets, but these metrics mask critical failures on individual users. The core problem is model overfitting to common polysomnography (PSG) patterns, which ignores the vast biological diversity in sleep architecture.














