In July last year, the United Nations called for the regulation of the use of artificial intelligence, and in a statement they emphasized that member states should establish mutually agreed rules "before it is too late."
The UN announcement also calls for mechanisms to prevent the use of AI tools to promote hatred, disinformation and mislead the public, which encourages extremism and exacerbates conflicts, reinforces stereotypes and prejudices in communities around the world.
The fact is that various professional organizations have been warning for a long time that the time when artificial intelligence makes it impossible to distinguish fact from fiction has largely arrived, which creates additional challenges in the fight against fake news and malicious manipulation of audience attitudes.
When we think of media content modified or completely created using artificial intelligence, the most common associations are deepfake video content to which the Serbian audience was also exposed through a national television. However, media and technology professionals draw attention to an even greater danger - deepfake audio content.
It appeared recently deepfake audio in which, in the voice of the current President of the United States, Joe Biden, a message was sent that was intended to discourage voters, suggesting that their turnout was not decisive. In October last year, the subject deepfake of audio manipulation was Barak Obama, and with many other examples, it is understandable that there is a growing concern that fake audio content is becoming a new, powerful weapon in the online war against disinformation, bringing the manipulation of citizens' attitudes ahead of numerous election processes this year to completely new levels.
Creating deepfake audio content is relatively simple and can be done by anyone. Experts dealing with digital forensics state that it takes about a few minutes of a person's authentic voice, and with the use of a cheap, widely available tool, it is possible to clone a voice based on a reference. After that, it is only necessary to type the sentences we want to hear and a convincing speech is obtained in a few seconds. Besides the text-to-sound mechanism, there is also a way to do it with the speech-to-speech mechanism.
In contrast to simple and cheap production, the detection of deepfake audio content is much more complex, expensive, and requires highly developed digital services and skills. While deepfake video provides much more room for seeing manipulation - from unusual facial expressions to blurred parts of the image - in the case of a fake voice, noises, music or simply reducing the quality of the recording more easily hide deviations from the authentic one.
Barack Obama's fake voice was exposed by the NewsGuard network and linked to 17 TikTok accounts that used hyperrealistic AI voice technology to misinform. NewsGuard said the account network has produced around 2023 videos since May 5.000, many of which contain apparently AI-generated voiceovers.
TikTok, formally, requires clear labeling for content that is produced using AI, but the aforementioned videos with a fake voice of Barack Obama were not registered, nor were they labeled. In parallel, the Meta company developed AudioSeal - the first audio "watermark" system, specialized for the localization of synthesized speech within audio clips, which is a big step, but not the final solution to the problem.
In the fight against disinformation using deepfake content, coordinated and simultaneous action by industry, legislators and the education system will be necessary. It is already certain that serious, systemic regulation lags far behind developments in practice and everyday life.
The author is a media theorist
Bonus video: