Computer Science: Just Say — Then Edit — the Word ...

By Carrie Compton

Published June 30, 2017

2 min read

Marc Rosenthal ’71

Anyone who’s recorded a five-second voicemail greeting knows how easy it is to flub a line. Now imagine the many verbal stumbles inevitable in something longer, such as a 90-minute voice-over for a nature documentary.

Finkelstein

Courtesy Adam Finkelstein

While the technology for video has come a long way since reel-to-reel editing, in some ways, remedying an unintelligible or inaccurate word in an audio track entails the same laborious process that bedeviled editors 40 years ago. There are essentially three options in dealing with a problematic word in audio editing: painstakingly re-create it by cobbling together snippets of the speaker’s voice, rerecord the one word, or redo the entire the track.

Enter Project VoCo, new audio-editing software being developed by computer science graduate student Zeyu Jin and his adviser, Professor Adam Finkelstein, whose joint paper on the subject appears in the July issue of the journal ACM Transactions on Graphics. VoCo uses an algorithm that identifies and then stitches together snippets of recorded material to reproduce a word in the speaker’s voice. For example, imagine an editor must create the word “purchase” for a voice-over. To do that, he or she might combine other parts of the recording — such as the beginning of the word “pursue,” with the middle of “speeches,” and the final sound in “this.” VoCo would search out these sounds and combine them in a fraction of the time it would take a human.

The team also built the software to be more user-friendly than the highly esoteric editing software currently on the market. VoCo’s interface more closely resembles word-processing platforms, allowing for desired audio to be typed in or moved around in a project via copy and paste.

“Right now, audio processing is for experts,” and existing software is “very unintuitive to average users,” says Jin. “My interest is to … bridge human and digital audio and make editing a creative and enjoyable experience.”

Both Jin and Finkelstein acknowledge the ethical quandaries that await VoCo, which is jointly owned by Princeton and Adobe and is still in the development phase, but Finkelstein says that altering what someone says on an audio track has always been possible — just tedious.

“This discussion already happened with photographs when digital tools for editing photos became really powerful,” says Finkelstein, adding that just as readers must trust that photographs in newspapers and magazines are genuine, consumers will have to have faith in the integrity of audio editors, too.

Jin says he hopes to add some kind of verification process to determine if audio has been tampered with, much as watermarks bring authenticity to electronic photos and documents. He also believes that VoCo could lead to further advancements for those who rely on synthetic speech — like Stephen Hawking — that would enable them to produce words in their original voice.

Published in the July 12, 2017, Issue