GenAI validations for banking MRM
-
Dear RB,
We are updating our model risk management (MRM) policy & standards to include GenAI. In particular, we are looking for validation tests relevant for GenAI and how to fit that into the overall MRM framework.
Would very grateful for any views the group has on the following
1. Which GenAI models/applications should be validated before deployment at banks?
- Does the requirement differ by dev/deployment mode (e.g., open-source deployed on-premise vs open-source API)?
- Does it differ by use cases (e.g., customer-facing vs internal productivity, fully autonomous vs human in the loop…)
- Should both models and use cases be validated, or it depends on the dev/deployment mode?
2. How to validate GenAI models/ use cases?
- What’s new for GenAI vs the validation tests for traditional models, e.g., a classifier built on logistic regression
- What would be a minimum set of requirements for validating GenAI?
3. How should the validation approach differ based on risk tiering/ materiality assessment
- Whether / how should the traditional risk tiering framework be applied/adapted for GenAI?
- What would be the minimum set of requirements or criteria for risk tiering GenAI?
4. Roles across 3 LoD in a bank for GenAI e.g. developers implement the GenAI applications with the necessary guardrails and the validator to assess the effectiveness of those guardrails as part of the application holistically
Many thanks in advance!
-
Gen AI makes plausible demos embarrassingly easy to produce -> thousands of internal Gen AI “apps” with really under-specified goals and uses appear (“some of us find it useful for X”) -> apps are therefore not tested rigorously / even basic tests (against what?) -> back to a similar chaos to the million Excel spreadsheets that nobody understands circa 2008, except now way harder to understand ex-post because they are not deterministic
I would be defining hard requirements around use and testing (performance “evals”) up front. They don’t need to be so onerous as to squash all innovation, but without this discipline I foresee big problems - especially if everyone insists on using expressive chat as the interface, rather than something allowing a bit narrower.
Then model risk is essentially forcing the firm to build a discipline in specification and testing (as we’re doing in our DNA team), which will serve it well in future
-
Does the requirement differ by dev/deployment mode (e.g., open-source deployed on-premise vs open-source API)?
Yes, given it affects data privacy, controls and stabilityDoes it differ by use cases (e.g., customer-facing vs internal productivity, fully autonomous vs human in the loop…)
Completely. In EU we risk tier primarily by applicationShould both models and use cases be validated, or it depends on the dev/deployment mode?
Model risk and application risk both combine to give a single risk appetite. You can have a modular set of previous assessments of aspects of models, but the final validation is specific to that model in that applicationWhat’s new for GenAI vs the validation tests for traditional models, e.g., a classifier built on logistic regression
For a classifier built on logit you’d be testing for accuracy and repeatability using a test dataset to see if it performs as expected. For GenAI you’re checking robustness is defined, errors will either be caught and corrected or immaterial, and usage is as intended. Monitoring includes for residual copyright, defamation and reputational risk. Basically it’s an entire paradigm shiftWhat would be a minimum set of requirements for validating GenAI?
Either it doesn’t matter when it gets something wrong, or you’ll catch itWhether / how should the traditional risk tiering framework be applied/adapted for GenAI?
By application (i.e., potential for impact)What would be the minimum set of requirements or criteria for risk tiering GenAI?
You’ve defined robustness and which applications can take a lack of interpretabilityRoles across 3 LoD in a bank for GenAI – e.g. developers implement the GenAI applications with the necessary guardrails and the validator to assess the effectiveness of those guardrails as part of the application holistically
Yes, but someone has to set and continually approve those standards, decisions and trade offs. Board level accountability is the big topic right now -
Typically, one starts with use case validation because this is mainly business driven.
The main question to answer is: “How large is the value add (or cost saving), when this use case gets implemented?”Doing so requires some hypothesis, e.g. using a commercial via API with a certain quality (by leveraging public information or benchmark insides).
Also, you might already find some guardrails, e.g. when personal data is involved, then InfoSec policies might require on-prem deployment.As an outcome of the use case validation, one typically gets some min/max costs ranges, in which the implementation would have an acceptable ROI.
You can easily find use cases, which will not pass this stage. E.g. when automating processes, which only happen once a year and require 2 days of manual work – but using AI agents costs 500k+ implementation effort. This would be no meaningful ROI.
Once use cases are validated, model validation needs to happen as well, because all models produce errors*,
Within the validation phase, and within scope of the use case, one needs to find out if the error rate is acceptable to the business (assuming pre-trained models here).
If pre-trained models are not sufficient, there are options to fine-tune, re-train, or even build individual models = require further evaluation.Depending on the use case, the validation of a model could also mean to compare off-the-shelf model vendors.
Of course, the more concrete use case and model information are available, the better operational costs can be calculated and added to the overall estimations to decide if an implementation makes sense.
Worth to mention: every use case should be challenged: IF AI is really an efficient solution to the business problem, and no NON-AI solutions exists which might be cheaper in development.PS: in terms of wording, I would like to clarify:
there is no such thing as “open-source API deployment mode” –
either its deployed on-premise by the bank with full control,
OR its deployed to a private cloud by the bank with almost full control (typically private cloud operators have access to the hardware as well which is a risk to evalutate),
OR its deployed to a public cloud (a.k.a. SaaS offering) with minimal to zero control.*Philosophical: a model is a simplified thing/representation of reality, with the goal to make it easier to understand or handle. With this definition, all models have errors
-
Lots of good answers here.
One tough learning from implementing this at scale is that unlike traditional ML, automated tests can only capture a small fraction of what can go wrong with GenAI. While the automated validation is necessary, it is not sufficient.
We have typically needed to also develop large manual testing protocols for releases, where humans (either developers or a set of test users), attempts a mixed of predefined and new prompts, and judge the quality of the answers. Often we will uncover “issues” that are very subjective, such as the answers technically being correct but pulling from different files that we wished, or answers being less/more detailed than the average user prefers, or an entirely new file format having issues (hence not covered by tests yet), or a million other things!
For one of our recent clients, we ran “hackathons” along with releases where both new and power users would try various prompts and score the output. It was incredibly helpful to identify things the tests had failed to see