Your Model Insights
Real-Time Insights into your Model's Performance on a Per Run Basis
Last updated
Real-Time Insights into your Model's Performance on a Per Run Basis
Last updated
Real-Time Insights into your Model's Performance on a Per Run Basis
Trust should be the biggest concern with AI/ML. In Appsurify's case, how can we TRUST the ML model to select the right tests given developer changes.
Welcome to Your Model Insights page with Real-Time Confidence Curve
Whether you are already running Appsurify or still strengthening your Model, Your Model Insights allows any user to SIMULATE what would be chosen to run versus not chosen to run on a Per Run Basis. Works in both Learning or Active Mode. Note: Your Model Insights will only Build / Simulate once your Model is Trained (See Dashboard for Maturity Wheel).
Users can select a Percentage Test Selection, such as 10, 20, 30,....90% of Tests in the Testsuite, and Appsurify will run the trained Model against the latest 20 test runs to display current performance on that chosen Percentage Subset against those latest 20 test runs.
How the model performs on an Individual Run Basis are lower on the page.
Example:
In the picture above, the user has chosen the Percentage Test Selection dropdown of "20%"
Appsurify will now run the Model against the most recent 20 Test Runs to display on Average it caught 95% of Regressions Early in it's Smart Test Selection Subset.
Transparency of results is everything for Quality Teams and to the Appsurify Team, and Appsurify's Model Insights Page displays unbiased view into the performance of the AI Model working for any Team.
Further Example:
In the picture above, on the same project, the user has chosen the Percentage Test Selection dropdown of "30%"
Appsurify's model is run against the most recent 20 Test Runs to display on Average it caught "97%" of Regressions Early in it's Smart Test Selection Subset.
If you find that the model is not strong enough for your standards, easy - simply keep or update the command to "ALL TESTS" and Appsurify's Model will continue to strengthen with each test run.
The Model will recalibrate on a rolling basis every 25 test runs.
The Model Insights page has been designed for teams to see how their Model is performing at any given time and to allow informed decisions on when and how to implement the model for best results based on Test Strategy and Risk Tolerance.
Low Risk Tolerance:
Perhaps let the Model Train for a longer period of time and then start with a conservative test selection subset, such as 40% or 50% test subset.
Still halve your test execution time and CI builds for faster feedback and resource savings.
Medium Risk Tolerance:
Start with a moderate subset around 30%
Optimize your test runs by 70% while catching vast majority of regressions.
High Risk Tolerance:
Can do an aggressive subset around 20% or steeper.
Optimize your test runs by 80-90%+ for RAPID test feedback
Transparency of results down to the most granular level, individual Test Runs.
Each run is dynamically updated per the user's Test Selection Criteria to which tests would be Optimized, Prioritized, Found Regressions, and Deferred Regressions
Below, Appsurify caught the Bug by only running 10% of Tests! 🚀🚀🚀
See both the overview of the latest 20 runs, and have the option to dive into each run as it's Dynamically Simulated to display the Performance of your Model on each Test Run.
It isn't designed to.
The AI-Model is designed to catch as many defects running as few tests as possible.
For example, if 1 bug causes 10 Tests to Fail, the AI-Model only needs to pick up 1 of those 10 Failed Tests to raise the underlying Defect. That leaves room in the smart subset selection to find other underlying bugs efficiently. This is explained in more detail in Smart Test Selection Explained.
Additionally, the AI Model is trained to catch real bugs caused by Real Test Failures, and cut through the noise of Flaky Tests. So if you have a high degree of Flakiness in your testsuite, it may appear that the Model isn't running Failed Tests when in fact those tests aren't real Failures and are indeed Flaky. These Flaky Tests are lumped into the "Deferred Regression" category, and in a subsequent test run - the user will see that these tests likely passed as they are indeed Flaky.
The AI Model is designed to give Developers and Testers clean signals on their Builds and Test Runs and to raise Defects as quickly as possible and avoid Flaky tests distracting the team to failures that are not real.
Below, the AI Model successfully Failed the Build early.
In this run, there were 5 newly introduced defects that caused 97 tests to fail. The AI Model caught all 5 bugs through the smart test selection only needing to run 18 Failed Tests that correspond to the 5 newly introduced bugs.
The 79 other failed tests were chosen not to be executed because the AI Model had already caught Failed Tests that raised the 5 underlying defects in it's smart test selection and Successfully Failed the Build/Run early.
By catching the bugs running fewer tests allows the AI Model to catch more bugs in the case there is a larger number of real defects that are introduced in the developer change.
For more information on Failed Tests that were not selected to be run in the Smart Subset and Why, please see Smart Test Selection Explained.
Every project is different, and so with that - each Your Model Insights Page may look different. Here are some examples:
Term | Definition |
---|---|
Number of Failures | Number of Test Runs | Model Display |
---|---|---|
Tests Optimized
Tests the AI chose NOT to run for purposes of efficiency and them NOT being relevant to recent Developer changes based on the incoming Commit Data.
Tests Prioritized
Tests chosen to be run in the Smart Subset based on recent Developer activity
Found Regressions
REAL Regressions / Failures / Bugs found in the Smart Subset
Deferred Regressions
Regressions / Failures/ Bugs not yet found in the Smart Subset and are subsequently picked up in the next Smart Subset or next Full Run / Catch All. Why are there sometimes more Deferred Regressions than Found Regressions? The AI-Model is designed to catch as many defects in as few tests as possible. For example, if 1 bug causes 10 Tests to Fail, the AI-Model only needs to pick up 1 of those 10 Failed Tests to raise the underlying Defect. That leaves room in the smart subset to find other underlying bugs efficiently. This is explained in more detail in Smart Test Selection Explained. Additionally, Tests that are Broken, Flaky, or Not Run are lumped into "Deferred Regressions" column.
High
High (5+) Day
Your Model Curve will be round and full.
High
Low (less than 5 per day)
Your Model Curve will take longer to build as there are a low flow of test runs and data.
Low
High
Your Model Curve will be subject to swings as there aren't that many Failures to build the rounded Curve. It will look sharper and each Failure has a bigger impact on the Average Regressions Found.
Low
Low
Your Model will take longer to train and your Model Confidence curve will be subject to swings. (As we average across latest 20 Runs), even 1 deferred regression would be enough to alter the Confidence curve dramatically (even when Appsurify catches the underlying defect and fails the build early!)