
Traditional software testing has been, in the image of the software itself, deterministic. From Wikipedia:
In mathematics, computer science and physics, a deterministic system is a system in which no randomness is involved in the development of future states of the system. A deterministic model will thus always produce the same output from a given starting condition or initial state.
Unit tests are an excellent example of deterministic testing - we create a test fixture to place the system into a predetermined state (Arrange), execute some code (Act), and check that it behaved as expected (Assert). If we haven’t made any changes to the code since the last time the tests were executed we can rely on the tests giving the same result. We rely heavily on this, as do Salesforce; using successful test execution as a quality gate to control whether code can continue through a deployment pipeline (us) and can be deployed to production (Salesforce). When we write code we want it to do the same thing, every time, forever, and we have the same expectation of our tests.
Artificial Intelligence changes how we must think about testing, as applications involving AI are probabilistic. This is particularly the case in Agentic AI applications like Agentforce - randomness in training, added to reasoning engines means that the Agents are continuously learning and refining their strategies, rather than following a strict set of pre-defined rules.
Creating the same test fixture and executing the same test won’t give the same results every time, so our approach to asserting correctness has to change to reflect this. Instead of verifying that a specific record has been updated, or a specific field exists, we have to look at the Agent’s response to a request and decide if it is reasonable or not. Or more likely another Large Language Model will do this for us. This can be seen in the example Agentforce Testing Center CSV file, where the Expected Response column contains entries such as :
summary of Account details are shown
which, I think we can all agree, is discomfortingly vague for automated testing.
We also need to take a different approach to evaluating the outcome of a test run. In deterministic testing, nothing less than a 100% success rate is acceptable - failures are ruthlessly hunted down and exterminated - but for AI agents that isn’t necessarily the case. A zero error rate is a no-brainer for an Agent that carries out financial transactions in response to detailed customer instructions, but for a customer support Agent responding to vague or inaccurate free text requests, an 80% success rate may be acceptable. The exact figures obviously depend on the nature of the business, appetite for risk, compliance and regulatory requirements etc, but non-zero failures will be reasonable a lot of the time.
Another area where AI Agent testing differs from traditional testing is around observation. When unit testing software we typically concern ourselves with what happened rather than how it happened; we check the outcome was as expected, but the software is free to achieve this outcome however it likes. Avoiding tying unit tests to the implementation in this way is one of it’s biggest benefits, allowing us to refactor at will to change the implementation, safe in the knowledge that if we inadvertently broke something and changed the external behaviour the tests will catch it.
When testing Agentforce we take the opposite approach - we specify how we expect the Agent to handle the request we make, by defining the Topics and Actions it should choose, then after executing the request we verify that the Agent did indeed choose the ones we expected. The response may be reasonable, but if the Agent didn’t arrive at it using the route we expected the test is a failure.
Accepting a success rate of less than 100% and verifying how rather than what are a mindset shift for developers, especially those (like me) who are big fans of unit tests. It’s a shift in mindset that has to happen though, as Agents need just as much (if not more testing) than other software before being unleashed on an unsuspecting customer base. Salesforce also doesn’t mandate that any test “coverage” of an Agent is required before it can be deployed to production, so it’s all on us. We must take this seriously, as there’s already plenty of risk around things like hallucinations and data leakage without adding to it by chucking a poorly-tested Agent over the fence and hoping for the best.
One final point: while we can’t use deterministic techniques to test Agents or Prompt Template actions, we can and should use them to test Flow and Apex custom actions. Just because they are being used in a probabilistic Agent doesn’t mean we can skip the rigorous testing applied elsewhere.
Strong yes (and no marketing) until 1:59 -https://www.youtube.com/watch?v=uGUP5sXwDq0&t=4s :)