Developing an Objective Framework for Evaluating Mobile Health Applications

  • Post author:
  • Post category:Blog

The proliferation of mobile health (mHealth) apps offers promising avenues for enhancing patient care and health management. However, the sheer number of available applications presents a significant challenge for clinicians seeking to identify high-quality, evidence-based tools. Recognizing this gap, researchers from the Defense Health Agency developed the App Rating Inventory—a standardized, objective method aimed at supporting clinical decision-making by systematically evaluating health apps. This development study outlines the rigorous process behind creating a reliable and comprehensive assessment tool that can be applied across various medical and behavioral conditions, ensuring both clinical relevance and usability.

Introduction

Background

The absence of universally accepted guidelines for assessing the quality of health-related applications often leads to confusion and hesitancy among healthcare providers. A 2019 Australian survey by Byambasuren et al. revealed that while many general practitioners incorporate mobile apps into their practice, their usage is primarily limited to medical reference purposes. Barriers to broader clinical integration include a lack of knowledge about effective app use and concerns over the trustworthiness of sources. As a result, clinicians face difficulties in vetting apps, which can hinder their adoption into routine care.

Beyond superficial descriptions, user ratings and testimonials on app distribution platforms fail to reliably indicate an app’s clinical validity or therapeutic value. Developer descriptions often lack accuracy, and while usability is important, it does not necessarily equate to clinical efficacy. User ratings tend to be only moderately correlated with objective quality measures, often reflecting limited personal experience rather than comprehensive app evaluation. This underscores the need for standardized, criteria-based appraisal tools that can objectively assess app features, evidence support, and content quality to facilitate safer and more effective use in clinical settings.

Existing App Rating Systems

Numerous frameworks and scales have been developed to evaluate mobile health applications. These often encompass multiple aspects such as usability, visual design, engagement, content accuracy, and privacy considerations. For example, the Mobile App Rating Scale (MARS) and PsyberGuide provide structured approaches to app evaluation, focusing on usability and evidence backing, respectively. The Enlight system further emphasizes therapeutic value, derived from extensive literature review.

Other notable models include assessments by the American Psychiatric Association (APA), the Anxiety and Depression Association of America, and the UK National Health Service (NHS). Each system incorporates criteria suited to specific health domains, such as mental health or nutrition, and often include checklists for privacy, security, and technical support. Despite these efforts, no existing tool fully meets the needs of the Military Health System (MHS), which requires an all-encompassing, objective evaluation method applicable across diverse conditions and app types, including those developed by government agencies and commercial vendors.

Setting the Stage for the App Rating Inventory

The Defense Health Agency’s Connected Health branch spearheaded the creation of the App Rating Inventory to address this gap. This team recognized that existing tools did not satisfy the need for an objective, comprehensive, and versatile assessment method suitable for the MHS. The ideal system must be user-friendly, applicable across various clinical scenarios, and capable of evaluating apps developed by different entities—be they civilian or government.

A critical requirement was that the rating system be free from subjective scoring and personal bias, relying instead on clear, measurable criteria. It was also essential that the assessment encompass multiple dimensions, including evidence base, user experience, content quality, and privacy features, all within a unified framework. After reviewing current literature and rating models, the team concluded that no existing tool fully aligned with these needs, prompting the development of a new, tailored evaluation system.

Methods

App Selection Procedure

The evaluation process begins with a comprehensive market scan on app distribution platforms such as the Apple App Store and Google Play. This initial search aims to identify a broad pool of apps relevant to a specific health condition. The criteria for inclusion include factors like being free, patient-focused, containing educational content, and supporting features like mindfulness, sleep tracking, or cognitive behavioral therapy components. Apps that meet these inclusion benchmarks are then ranked based on user reviews, number of downloads, and overall ratings.

If more than ten apps qualify, a top-10 list is generated to streamline review efforts. This ranking process is validated through statistical analysis, such as ANOVA, which confirms that higher-ranked apps based on user-generated data tend to score better when assessed with the App Rating Inventory. This approach ensures that the selection for detailed review is grounded in data-driven evidence, enhancing the reliability of subsequent evaluations.

Development of the App Rating Inventory

Initially, the research team comprised seven subject matter experts who identified key characteristics of high-quality clinical apps. These included empirical backing, educational content, interactive features, ease of use, and security considerations. This baseline list was refined over multiple iterations, with each version piloted and tested for clarity, operational definitions, and redundancy. The inventory was ultimately reduced from 40 to 28 items, focusing on core features that could be objectively assessed.

The development process incorporated extensive literature review, consultations with app evaluation authorities, and testing across various clinical domains. Key enhancements included the addition of privacy, peer support, and emerging technology features such as encryption and artificial intelligence, which are increasingly relevant for app safety and efficacy.

Results

Overview

The development process involved three rounds of testing to optimize the inventory’s reliability. Early tests showed low interrater reliability (around 0.48-0.50), but after refining item definitions and providing additional rater training, the reliability improved significantly to 0.62. A subsequent six-month pilot involving 96 apps across multiple health conditions yielded high interrater reliability (0.92–0.95), confirming the tool’s consistency.

Factor and commonality analyses identified redundancies and outliers, leading to a streamlined 28-item, four-category rating system. These categories encompass evidence, content, and customizability, each scored on a binary basis—either the app possesses a feature or it does not—minimizing subjective judgment. The total score, summing the three categories, provides an overall measure of app quality.

Final Iteration of the App Rating Inventory

The finalized system evaluates each app based on 28 items, with equal weighting across categories. The scoring is straightforward: a feature’s presence earns one point; absence earns zero. A higher total score indicates a more comprehensive and potentially effective app. Clinicians can use these scores to inform their selections, considering individual app strengths and patient needs; for example, an app with high customizability may be favored for patients requiring tailored interventions.

The simplicity of binary scoring reduces the need for extensive training, allowing raters to complete assessments in approximately 15 to 40 minutes, depending on the app’s complexity. Although the inventory is primarily designed for mobile apps, its principles can guide evaluations of other digital health tools, such as web-based platforms or telehealth services.

Case Example

To demonstrate its application, the inventory was used to evaluate sleep-related apps. An initial search for sleep and insomnia apps yielded over 1,000 results, which were filtered based on criteria like cost, educational content, and interactivity, leading to a final selection of eight apps. These were rated using the inventory, with scores indicating their suitability for clinical use. Typically, apps meeting at least a 50% threshold (scoring positively on 14 or more items) are considered for integration into treatment plans or further research.

Discussion

Lessons Learned

Over three years of practical use, the development team identified six key insights:

1. Popularity Reflects Quality: Apps with high download numbers and positive reviews tend to be better candidates, reflecting sustained user engagement. However, popularity alone does not guarantee evidence-based content.

2. Engaging, Dynamic Content Promotes Reuse: Features like reminders, adaptive content, and interactive elements foster ongoing engagement, which can improve treatment adherence.

3. Beware of the “Bait and Switch”: Many apps are marketed as free but contain hidden costs or locked features, emphasizing the importance of thorough evaluation before recommendation.

4. Market Saturation: The vast number of available apps means that initial environmental scans are time-consuming but necessary to find relevant, high-quality tools.

5. Content Accuracy Is Critical: Distribution platforms do not vet clinical content, risking the proliferation of inaccurate or harmful information. Rigorous vetting is essential before endorsement.

6. Apps Can Enhance Patient Care: When properly selected, apps can improve treatment fidelity, promote health literacy, and increase patient engagement, ultimately leading to better health outcomes.

Key Considerations

Deciding whether to recommend an app requires a comprehensive approach, balancing clinical judgment with objective evaluation scores. Factors such as patient engagement, technological literacy, accessibility, cost, and data security must be weighed. While scoring systems like the App Rating Inventory offer valuable guidance, they complement rather than replace clinical expertise.

A multi-step vetting process—including literature review, app store searches, social media analysis, pilot testing, and patient feedback—can optimize app selection. However, this process must be efficient enough to be feasible in busy clinical practice. The development of a centralized certification system or a trusted app clearinghouse could further streamline this process, but such systems are still evolving.

Clinicians should also consider that apps are dynamic; frequent updates and modifications may alter their features and effectiveness. Therefore, periodic re-evaluation is advisable, and app content should be monitored to ensure ongoing safety and relevance.

Acknowledgments

The authors express gratitude to colleagues who contributed to the development and validation of the App Rating Inventory, including Shaunesy Walden-Behrens, MPH and MBA; Danielle Sager, MPH and MHIIM; Renee Cavanagh, PsyD; Sarah Stewart, PhD; Christina Armstrong, PhD; Julie Kinn, PhD; David Bradshaw, PhD; and Sarah Avery-Leaf, PhD.

References

This structured, objective approach to app evaluation ensures that healthcare providers can confidently incorporate digital tools into patient care, ultimately enhancing outcomes and safety.