Lots of good answers here.
One tough learning from implementing this at scale is that unlike traditional ML, automated tests can only capture a small fraction of what can go wrong with GenAI. While the automated validation is necessary, it is not sufficient.
We have typically needed to also develop large manual testing protocols for releases, where humans (either developers or a set of test users), attempts a mixed of predefined and new prompts, and judge the quality of the answers. Often we will uncover “issues” that are very subjective, such as the answers technically being correct but pulling from different files that we wished, or answers being less/more detailed than the average user prefers, or an entirely new file format having issues (hence not covered by tests yet), or a million other things!
For one of our recent clients, we ran “hackathons” along with releases where both new and power users would try various prompts and score the output. It was incredibly helpful to identify things the tests had failed to see