Introduction
Software testing is going through rapid evolution with AI, automation, DevOps, and cyber-security. As AI applications become more advanced, the old testing methods have to evolve to tackle the unique challenges associated with, among others, the unpredictability of AI, explainability, bias, and continuous learning. Testing AI applications works by a combination of traditional software testing methods and AI-specific techniques to meet the challenges of bias, explainability, security, and performance. With the increasing adoption of AI applications in organizations, there comes the need for an evolution of testing procedures so that AI systems become more trustworthy, ethical, and efficient. Before we talk about our case study, it’s important to understand general practices that apply to most AI systems. When testing AI-driven solutions, we focus on validating:
- Domain-Specific and Non-Domain Prompts: Testing could be performed with a mixed blend of highly technical, industry-specific queries as well as general, everyday, nonspecific questions. The purpose is to differentiate between relevant and irrelevant questions.
- Relevant and Ambiguous Queries: AI often struggles when the input is incomplete or unclear. We test, how well the system interprets vague or partially formed queries. Whether it asks for clarification instead of guessing.
- Accessibility Testing: AI enabled apps should be checked for accessibility standards, including screen reader compatibility and keyboard navigation, to accommodate users with disabilities.
- Performance Testing: AI apps should be observed for its response time, scalability, and reliability under different loads to ensure its optimal performance during peak usage.
- Consistency Checks: The same prompt presented in multiple sessions should return the same outputs and whether its responses remained coherent through multiple interactions.
Case Study:
Ensuring Reliability in AI: Testing a GenAI Chatbot for Medical Equipment Troubleshooting
The rapid advancement of Generative AI (GenAI) has unlocked transformative potential across industries, but with great power comes great responsibility. In mission-critical domains such as healthcare and maintenance of medical equipment, ensuring reliability and accuracy of AI-driven applications is paramount. Here at Origin, we have successfully tested and validated several GenAI applications, and in this article, we share insights based on our practical experience.
Working on testing GenAI applications has given us excellent insights into how to properly validate their performance with the challenges they present. We tested a chatbot based on AI intended to support technicians in diagnosing and troubleshooting medical equipment problems.
Testing Approach
Confined to manual testing, we focused on different scenarios for testing and evaluating the performance of the chatbot in order to check its consistency, accuracy, and reliability. Here is how we set up our testing:
- Domain- Specific Functional Testing: Testing was performed with prompts to troubleshoot different type of medical equipment and then validate the step-by-step instructions provided are as per the manual.
- Accuracy Against Reference Data: Validated chatbot answers against service manuals and historical work orders and referenced real documents — no hallucinated part numbers, IDs, or manuals
- Safety-Critical Response Testing: We verified that the bot didn’t suggest dangerous shortcuts. It included safety pre-checks (isolating equipment, disconnecting power).
- Handling Ambiguous or Incomplete Technical Inputs: Validated that bot asked clarifying questions instead of guessing. It didn’t provide inaccurate steps when details were missing.
- Context Retention in Technical Conversations: We validated that the bot remembers previous steps in a multi-turn troubleshooting session. Validated ability to continue where technician left off
- Integration With Knowledge Sources: We verified correct pull of historical work orders. Checked inventory parts and whether the bot suggests replacement parts still in stock. Confirmed correct retrieval of technical manuals
- Integration Testing: Given that the chatbot was pulling answers from historical work orders and equipment manuals, we did integration testing for confirmation that data pulling was accurate and queried in context.
- User Acceptance Testing (UAT): We invited the end-users, including technicians and engineers, to validate the usability, functionality, and effectiveness of the chatbot in real-world scenarios.
Challenges Faced
While the chatbot had infinite potential, its testing brought several key challenges. Some of these challenges came out of the ever-complicated nature of the task of advocating for and using GenAI applications:
- Inconsistent responses for the same prompts: In some cases, the chatbot generated varying answers for the same input, making it difficult to rely on its responses in real-world troubleshooting scenarios.
- Hallucination Issues: The chatbot sometimes invented work order IDs or referred to documents that do not exist, which raised red flags about misinformation and becoming trustworthy.
- Contextual Misinterpretations: The AI generally understood the structure of the questions posed to them but occasionally failed in reacting to context switching or to recognize the technician intent behind the phrased question differently.
- Bias and Other Ethical Issues: The AI models were sometimes biased towards certain undertones or could favour one action instead of another when prioritizing troubleshooting or make incorrect assumptions about user intent in its responses because of limitations in the datasets used.
- Scalability and Latency Issues: During high-demand scenarios, the chatbot occasionally exhibited slower response times, impacting technician workflow efficiency.
- Poor Handling of Edge Cases: The AI often failed when faced with unusual or complex failures where prior work orders had not documented the issues sufficiently, thus either providing incomplete or generally unhelpful answers.
Solutions Implemented
To address these challenges, we applied several targeted solutions:
- Model Fine-tuning: Provided more training data and reinforced learning using real-world technician input to further enhance consistency of response.
- Hallucination Mitigation: Implementing stricter data validation mechanisms reduced the frequency of fabricated responses, ensuring higher accuracy.
- Bias Reduction Measures: Diversifying our training datasets and employing post-processing measures made the responses less biased and increased fairness.
- Performance Enhancement: Backend architecture and response caching improvements were made to reduce the latency during peak loads.
- Enhanced Error Handling: Technical Error Handling Mechanisms: Implementation of fallback mechanisms meant that ambiguous or unanswerable queries could be redirected to human support, preserving user trust.
Conclusion
Testing AI-driven applications, especially in specialized fields like medical equipment maintenance, is crucial to ensuring their effectiveness and reliability. Our rigorous manual testing approach, combined with User Acceptance Testing, Accessibility Testing, and Performance Testing, provided valuable insights into the chatbot’s strengths and areas for improvement. As AI continues to evolve, robust testing methodologies will remain essential in mitigating risks and enhancing user trust in these powerful technologies.
At Origin, we specialize in testing GenAI applications to ensure they are robust, ethical, and scalable. As AI evolves, our testing methodologies continue to adapt, ensuring that businesses can deploy AI solutions with confidence. If you're developing a GenAI application, let’s connect to discuss how we can help validate and optimize your AI models.

