
“artificial intelligence chatbot Grok being used to create non-consensual sexualised deepfake images of women and girls” BBC website
The Grok story would have the power to shock even if it hadn’t become almost routine – both for Elon Musk and AI. It serves to demonstrate that AI systems need testing – and the test results need acting on. Machines have always done unexpected things, thats why we test. As they do more and get more powerful they need more testing.
I learned long ago that just because something is syntactically correct, and may even compile, does not mean it delivers the desired result. And even if something does deliver a result who knows if it is the correct result?
AI systems, and AI generated code, still needs testing. I don’t know how to be any clearer.
The Grok case is pretty extreme. In many ways the system does what it was designed to do, but a good tester would have noticed, and reported, that it went beyond expectations and delivered ethically dubious results.
Our previous generation of technology could mess up just as badly: look at the Post Office Horizon system which put people in goal and lead to suicides. And humans covered up.
Hopefully, once we understand AI and what it does we can avoid these things. But just this morning I discovered the AI Incident Database.
Ethics
Some of these things – like autonomous cars hitting pedestrians – are just good old fashioned failures. They are worse because we are asking the machines to do more and there are many more variables which aren’t tested for. Other things, like Grok undressing people are simply things humans know are wrong, humans know it so obviously that we don’t expect it to be coded, we don’t expect to need to test for it. There is probably no law against computer undressing but it is ethically wrong.
Testing computer systems for ethics isn’t something testers have had to spend much time on before. Complicating matters is that ethics are more difficult to define and vary across people, countries and culture. I’m pretty sure that what is ethically acceptable to Elon Musk isn’t acceptable to me. But then, gun ownership in the USA is ethically acceptable but not here in the UKs. Who’s ethics are we testing for?
But even at a more basic level how can you be sure your AI generated code is producing what you expect?
Imagine you have you AI generate code for an invoicing system. Did you ask it to include VAT? And if you did does it apply it correctly? To the correct products? Does it work correctly across national boundaries? – VAT rates and exemptions differ across countries.
Even if you give you AI your national VAT rule book can you be sure it produced the right results?
You still need to test it.
Which means: there is testing work to be done. And since the system does more there is more to test.
Sure you can have an AI write tests but are you confident in those tests?
Safe AI in regulated domains
My old friend Paul Massey published a video before Christmas, Safe AI Coding in Regulated Domains.
Paul fed a specification into an AI and generated some code. To test it he fed the spec into an AI and asked it to generate tests. Not all the tests passed, the AI generated code contained bugs, fortunately the AI generated tests found them and Paul fixed them.
Paul then applied mutation testing to the code: >= became <=, == became != and so on. He ran the tests again: only 30% of the tests which should have failed did fail. Think about that, 70% of the tests passed when they should have failed.
This leave us with 2 facts:
- AI can generate code with bugs
- AI generated tests are not sufficient
Paul also pointed out that the specifications contained gaps. This fits with the older work from Capers Jones where he discusses defects in specification. I can’t remember if it was Jones or Tom Gilb (another old friend) who claims that 30% of defects are defects in the specification.
Now good specification take time to write – even with AI assistance. If you are happy for the AI to make all your decisions then OK, but if you have ideas on how you want the system to be you need humans in the loop. Anyone who has written specification will tell you how stakeholders often don’t agree on what is wanted.
Do you test your spec?
Where do your tests come from?
AI may help but is not enough.
Again, AI may help with the writing but it will need humans in the loop.
In fact, even if AI helps writing the spec, helps write the code and helps with the tests things are going to get harder. There will be more systems created, more code created, more tests needed.
Jevons paradox is at work: when things get more efficient we use more of them. The question is not so much, can AI write all the code? but How are we going to tests everything?
Enter ethical testing
When spec, code and test took time and many people there were more opportunities to for someone to raise the question of ethics. Having reduced the time and people in all those earlier steps there is now a new step that needs to be included: ethical testing.
The process of programming was never just about cutting code nor was the writing of the code the limiting factor – typing is not the bottleneck. In the creation of a system – specification, coding, testing – lots of decisions were being made. Those decisions still need making. Ignoring them simply lets an AI decide, for better or worse.
Do you know all the decisions the AI silently made? Do all your stakeholders agree with those decisions? Are those decisions legal and ethical?
Signup for the my latest posts by e-mail and download a free book