On a recent night, a woman named Robin was asleep next to her husband, Steve, in their Brooklyn home, when her phone buzzed on the bedside table. Robin is in her mid-thirties with long, dirty-blond hair. She works as an interior designer, specializing in luxury homes. The couple had gone out to a natural-wine bar in Cobble Hill that evening, and had come home a few hours earlier and gone to bed. Their two young children were asleep in bedrooms down the hall. âIâm always, like, kind of one ear awake,â Robin told me, recently. When her phone rang, she opened her eyes and looked at the caller I.D. It was her mother-in-law, Mona, who never called after midnight. âIâm, like, maybe itâs a butt-dial,â Robin said. âSo I ignore it, and I try to roll over and go back to bed. But then I see it pop up again.â
She picked up the phone, and, on the other end, she heard Monaâs voice wailing and repeating the words âI canât do it, I canât do it.â âI thought she was trying to tell me that some horrible tragic thing had happened,â Robin told me. Mona and her husband, Bob, are in their seventies. Sheâs a retired party planner, and heâs a dentist. They spend the warm months in Bethesda, Maryland, and winters in Boca Raton, where they play pickleball and canasta. Robinâs first thought was that there had been an accident. Robinâs parents also winter in Florida, and she pictured the four of them in a car wreck. âYour brain does weird things in the middle of the night,â she said. Robin then heard what sounded like Bobâs voice on the phone. (The family members requested that their names be changed to protect their privacy.) âMona, pass me the phone,â Bobâs voice said, then, âGet Steve. Get Steve.â Robin took thisâthat they didnât want to tell her while she was aloneâas another sign of their seriousness. She shook Steve awake. âI think itâs your mom,â she told him. âI think sheâs telling me something terrible happened.â
Steve, who has close-cropped hair and an athletic build, works in law enforcement. When he opened his eyes, he found Robin in a state of panic. âShe was screaming,â he recalled. âI thought her whole family was dead.â When he took the phone, he heard a relaxed male voiceâpossibly Southernâon the other end of the line. âYouâre not gonna call the police,â the man said. âYouâre not gonna tell anybody. Iâve got a gun to your momâs head, and Iâm gonna blow her brains out if you donât do exactly what I say.â
Steve used his own phone to call a colleague with experience in hostage negotiations. The colleague was muted, so that he could hear the call but wouldnât be heard. âYou hear this???â Steve texted him. âWhat should I do?â The colleague wrote back, âTaking notes. Keep talking.â The idea, Steve said, was to continue the conversation, delaying violence and trying to learn any useful information.
âI want to hear her voice,â Steve said to the man on the phone.
The man refused. âIf you ask me that again, Iâm gonna kill her,â he said. âAre you fucking crazy?â
âO.K.,â Steve said. âWhat do you want?â
The man demanded money for travel; he wanted five hundred dollars, sent through Venmo. âIt was such an insanely small amount of money for a human being,â Steve recalled. âBut also: Iâm obviously gonna pay this.â Robin, listening in, reasoned that someone had broken into Steveâs parentsâ home to hold them up for a little cash. On the phone, the man gave Steve a Venmo account to send the money to. It didnât work, so he tried a few more, and eventually found one that did. The app asked what the transaction was for.
âPut in a pizza emoji,â the man said.
After Steve sent the five hundred dollars, the man patched in a female voiceâa girlfriend, it seemedâwho said that the money had come through, but that it wasnât enough. Steve asked if his mother would be released, and the man got upset that he was bringing this up with the woman listening. âWhoa, whoa, whoa,â he said. âBaby, Iâll call you later.â The implication, to Steve, was that the woman didnât know about the hostage situation. âThat made it even more real,â Steve told me. The man then asked for an additional two hundred and fifty dollars to get a ticket for his girlfriend. âIâve gotta get my baby mama down here to me,â he said. Steve sent the additional sum, and, when it processed, the man hung up.
By this time, about twenty-five minutes had elapsed. Robin cried and Steve spoke to his colleague. âYou guys did great,â the colleague said. He told them to call Bob, since Monaâs phone was clearly compromised, to make sure that he and Mona were now safe. After a few tries, Bob picked up the phone and handed it to Mona. âAre you at home?â Steve and Robin asked her. âAre you O.K.?â
Mona sounded fine, but she was unsure of what they were talking about. âYeah, Iâm in bed,â she replied. âWhy?â
Artificial intelligence is revolutionizing seemingly every aspect of our lives: medical diagnosis, weather forecasting, space exploration, and even mundane tasks like writing e-mails and searching the Internet. But with increased efficiencies and computational accuracy has come a Pandoraâs box of trouble. Deepfake video content is proliferating across the Internet. The month after Russia invaded Ukraine, a video surfaced on social media in which Ukraineâs President, Volodymyr Zelensky, appeared to tell his troops to surrender. (He had not done so.) In early February of this year, Hong Kong police announced that a finance worker had been tricked into paying out twenty-five million dollars after taking part in a video conference with who he thought were members of his firmâs senior staff. (They were not.) Thanks to large language models like ChatGPT, phishing e-mails have grown increasingly sophisticated, too. Steve and Robin, meanwhile, fell victim to another new scam, which uses A.I. to replicate a loved oneâs voice. âWeâve now passed through the uncanny valley,â Hany Farid, who studies generative A.I. and manipulated media at the University of California, Berkeley, told me. âI can now clone the voice of just about anybody and get them to say just about anything. And what you think would happen is exactly whatâs happening.â
Robots aping human voices are not new, of course. In 1984, an Apple computer became one of the first that could read a text file in a tinny robotic voice of its own. âHello, Iâm Macintosh,â a squat machine announced to a live audience, at an unveiling with Steve Jobs. âIt sure is great to get out of that bag.â The computer took potshots at Appleâs main competitor at the time, saying, âIâd like to share with you a maxim I thought of the first time I met an I.B.M. mainframe: never trust a computer you canât lift.â In 2011, Apple released Siri; inspired by âStar Trekâ âs talking computers, the program could interpret precise commandsââPlay Steely Dan,â say, or, âCall Momââand respond with a limited vocabulary. Three years later, Amazon released Alexa. Synthesized voices were cohabiting with us.
Still, until a few years ago, advances in synthetic voices had plateaued. They werenât entirely convincing. âIf Iâm trying to create a better version of Siri or G.P.S., what I care about is naturalness,â Farid explained. âDoes this sound like a human being and not like this creepy half-human, half-robot thing?â Replicating a specific voice is even harder. âNot only do I have to sound human,â Farid went on. âI have to sound like you.â In recent years, though, the problem began to benefit from more money, more dataâimportantly, troves of voice recordings onlineâand breakthroughs in the underlying software used for generating speech. In 2019, this bore fruit: a Toronto-based A.I. company called Dessa cloned the podcaster Joe Roganâs voice. (Rogan responded with âaweâ and acceptance on Instagram, at the time, adding, âThe future is gonna be really fucking weird, kids.â) But Dessa needed a lot of money and hundreds of hours of Roganâs very available voice to make their product. Their success was a one-off.
In 2022, though, a New York-based company called ElevenLabs unveiled a service that produced impressive clones of virtually any voice quickly; breathing sounds had been incorporated, and more than two dozen languages could be cloned. ElevenLabsâs technology is now widely available. âYou can just navigate to an app, pay five dollars a month, feed it forty-five seconds of someoneâs voice, and then clone that voice,â Farid told me. The company is now valued at more than a billion dollars, and the rest of Big Tech is chasing closely behind. The designers of Microsoftâs Vall-E cloning program, which débuted last year, used sixty thousand hours of English-language audiobook narration from more than seven thousand speakers. Vall-E, which is not available to the public, can reportedly replicate the voice and âacoustic environmentâ of a speaker with just a three-second sample.
Voice-cloning technology has undoubtedly improved some lives. The Voice Keeper is among a handful of companies that are now âbankingâ the voices of those suffering from voice-depriving diseases like A.L.S., Parkinsonâs, and throat cancer, so that, later, they can continue speaking with their own voice through text-to-speech software. A South Korean company recently launched what it describes as the first âAI memorial service,â which allows people to âlive in the cloudâ after their deaths and âspeakâ to future generations. The company suggests that this can âalleviate the pain of the death of your loved ones.â The technology has other legal, if less altruistic, applications. Celebrities can use voice-cloning programs to âloanâ their voices to record advertisements and other content: the College Football Hall of Famer Keith Byars, for example, recently let a chicken chain in Ohio use a clone of his voice to take orders. The film industry has also benefitted. Actors in films can now âspeakâ other languagesâEnglish, say, when a foreign movie is released in the U.S. âThat means no more subtitles, and no more dubbing,â Farid said. âEverybody can speak whatever language you want.â Multiple publications, including The New Yorker, use ElevenLabs to offer audio narrations of stories. Last year, New Yorkâs mayor, Eric Adams, sent out A.I.-enabled robocalls in Mandarin and Yiddishâlanguages he does not speak. (Privacy advocates called this a âcreepy vanity project.â)
But, more often, the technology seems to be used for nefarious purposes, like fraud. This has become easier now that TikTok, YouTube, and Instagram store endless videos of regular people talking. âItâs simple,â Farid explained. âYou take thirty or sixty seconds of a kidâs voice and log in to ElevenLabs, and pretty soon Grandmaâs getting a call in Grandsonâs voice saying, âGrandma, Iâm in trouble, Iâve been in an accident.â â A financial request is almost always the end game. Farid went on, âAnd hereâs the thing: the bad guy can fail ninety-nine per cent of the time, and they will still become very, very rich. Itâs a numbers game.â The prevalence of these illegal efforts is difficult to measure, but, anecdotally, theyâve been on the rise for a few years. In 2020, a corporate attorney in Philadelphia took a call from what he thought was his son, who said he had been injured in a car wreck involving a pregnant woman and needed nine thousand dollars to post bail. (He found out it was a scam when his daughter-in-law called his sonâs office, where he was safely at work.) In January, voters in New Hampshire received a robocall call from Joe Bidenâs voice telling them not to vote in the primary. (The man who admitted to generating the call said that he had used ElevenLabs software.) âI didnât think about it at the time that it wasnât his real voice,â an elderly Democrat in New Hampshire told the Associated Press. âThatâs how convincing it was.â