AI Dubs Over Subs? Translating and Dubbing Videos with AI
Alongside cooking for myself and walking laps around the house, Japanese cartoons (or “anime” as the kids are calling it) are something I’ve learned to love during quarantine. The problem with watching anime, though, is that short of learning Japanese, you become dependent on human translators and voice actors to port the content to your language. Sometimes you get the subtitles (“subs”) but not the voicing (“dubs”). Other times, entire seasons of shows aren’t translated at all, and you’re left on the edge of your seat with only Wikipedia summaries and 90s web forums to ferry you through the darkness. So what are you supposed to do? The answer is obviously not to ask a computer to transcribe, translate, and voice-act entire episodes of a TV show from Japanese to English. Translation is a careful art that can’t be automated, and requires the loving touch of a human hand. Besides, even if you did use machine learning to translate a video, you couldn’t use a computer to dub… I mean, who would want to listen to machine voices for an entire season? It’d be awful. Only a real sicko would want that. So in this post, I’ll show you how to use machine learning to transcribe, translate, and voice-act videos from one language to another, i.e. “AI-Powered Video Dubs.” It might not get you Netflix-quality results, but you can use it to localize online talks and YouTube videos in a pinch. We’ll start by transcribing audio to text using Google Cloud’s Speech-to-Text API. Next, we’ll translate that text with the Translate API. Finally, we’ll “voice act” the translations using the Text-to-Speech API, which produces voices that are, according to the docs, “humanlike.” (By the way, before you flame-blast me in the comments, I should tell you that YouTube will automatically and for free transcribe and translate your videos for you. So you can treat this project like your new hobby of baking sourdough from scratch: a really inefficient use of 30 hours.) AI-Dubbed Videos: Do they axe usually sound grood? Before you embark on this journey, you probably want to know what you have to look forward to. What quality can we realistically expect to achieve from an ML-video-dubbing pipeline? Here’s one example dubbed automatically from English to Spanish (the subtitles are also automatically generated in English). I haven’t done any tuning or adjusting on it: As you can see, the transcriptions are decent but not perfect, and the same for the translations. (Ignore the fact that the speaker sometimes speaks too fast–more on that later.) Overall, you can easily get the gist of what’s going on from this dubbed video, but it’s not exactly near human-quality. What makes this project trickier (read: more fun) than most is that there are at least three possible points of failure: The video can be incorrectly transcribed from audio to text by the Speech-to-Text API That text can be incorrectly or awkwardly translated by the Translation API Those translations can be mispronounced by the Text-to-Speech API In my experience, the most successful dubbed videos were those that featured a single speaker over a clear audio stream and that were dubbed from English to another language. This is largely because the quality of transcription (Speech-to-Text) was much higher in English than other source languages. Dubbing from non-English languages proved substantially more challenging. Here’s one particularly unimpressive dub from Japanese to English of one of my favorite shows, Death Note: If you want to leave translation/dubbing to humans, well–I can’t blame you. But if not, read on! Building an AI Translating Dubber As always, you can find all of the code for this project in the Making with Machine Learning Github repo. To run the code yourself, follow the README to configure your credentials and enable APIs. Here in this post, I’ll just walk through my findings at a high level. First, here are the steps we’ll follow: Extract audio from video files Convert audio to text using the Speech-to-Text API Split transcribed text into sentences/segments for translation Translate text Generate spoken audio versions of the translated text Speed up the generated audio to align with the original speaker in the video Stitch the new audio on top of the fold audio/video I admit that when I first set out to build this dubber, I was full of hubris–all I had to do was plug a few APIs together, what could be easier? But as a programmer, all hubris must be punished, and boy, was I punished. The challenging bits are the ones I bolded above, that mainly come from having to align translations with video. But more on that in a bit. Using the Google Cloud Speech-to-Text API The first step in translating a video is transcribing its audio to words. To do this, I used Google Cloud’s Speech-to-Text API. This tool can recognize audio spoken in 125 languages, but as I mentioned above, the quality is highest in English. For our use case, we
Alongside cooking for myself and walking laps around the house, Japanese cartoons (or “anime” as the kids are calling it) are something I’ve learned to love during quarantine.
The problem with watching anime, though, is that short of learning Japanese, you become dependent on human translators and voice actors to port the content to your language. Sometimes you get the subtitles (“subs”) but not the voicing (“dubs”). Other times, entire seasons of shows aren’t translated at all, and you’re left on the edge of your seat with only Wikipedia summaries and 90s web forums to ferry you through the darkness.
So what are you supposed to do? The answer is obviously not to ask a computer to transcribe, translate, and voice-act entire episodes of a TV show from Japanese to English. Translation is a careful art that can’t be automated, and requires the loving touch of a human hand. Besides, even if you did use machine learning to translate a video, you couldn’t use a computer to dub… I mean, who would want to listen to machine voices for an entire season? It’d be awful. Only a real sicko would want that.
So in this post, I’ll show you how to use machine learning to transcribe, translate, and voice-act videos from one language to another, i.e. “AI-Powered Video Dubs.” It might not get you Netflix-quality results, but you can use it to localize online talks and YouTube videos in a pinch. We’ll start by transcribing audio to text using Google Cloud’s Speech-to-Text API. Next, we’ll translate that text with the Translate API. Finally, we’ll “voice act” the translations using the Text-to-Speech API, which produces voices that are, according to the docs, “humanlike.”
(By the way, before you flame-blast me in the comments, I should tell you that YouTube will automatically and for free transcribe and translate your videos for you. So you can treat this project like your new hobby of baking sourdough from scratch: a really inefficient use of 30 hours.)
AI-Dubbed Videos: Do they axe usually sound grood?
Before you embark on this journey, you probably want to know what you have to look forward to. What quality can we realistically expect to achieve from an ML-video-dubbing pipeline?
Here’s one example dubbed automatically from English to Spanish (the subtitles are also automatically generated in English). I haven’t done any tuning or adjusting on it:
As you can see, the transcriptions are decent but not perfect, and the same for the translations. (Ignore the fact that the speaker sometimes speaks too fast–more on that later.) Overall, you can easily get the gist of what’s going on from this dubbed video, but it’s not exactly near human-quality.
What makes this project trickier (read: more fun) than most is that there are at least three possible points of failure:
- The video can be incorrectly transcribed from audio to text by the Speech-to-Text API
- That text can be incorrectly or awkwardly translated by the Translation API
- Those translations can be mispronounced by the Text-to-Speech API
In my experience, the most successful dubbed videos were those that featured a single speaker over a clear audio stream and that were dubbed from English to another language. This is largely because the quality of transcription (Speech-to-Text) was much higher in English than other source languages.
Dubbing from non-English languages proved substantially more challenging. Here’s one particularly unimpressive dub from Japanese to English of one of my favorite shows, Death Note:
If you want to leave translation/dubbing to humans, well–I can’t blame you. But if not, read on!
Building an AI Translating Dubber
As always, you can find all of the code for this project in the Making with Machine Learning Github repo. To run the code yourself, follow the README to configure your credentials and enable APIs. Here in this post, I’ll just walk through my findings at a high level.
First, here are the steps we’ll follow:
- Extract audio from video files
- Convert audio to text using the Speech-to-Text API
- Split transcribed text into sentences/segments for translation
- Translate text
- Generate spoken audio versions of the translated text
- Speed up the generated audio to align with the original speaker in the video
- Stitch the new audio on top of the fold audio/video
I admit that when I first set out to build this dubber, I was full of hubris–all I had to do was plug a few APIs together, what could be easier? But as a programmer, all hubris must be punished, and boy, was I punished.
The challenging bits are the ones I bolded above, that mainly come from having to align translations with video. But more on that in a bit.
Using the Google Cloud Speech-to-Text API
The first step in translating a video is transcribing its audio to words. To do this, I used Google Cloud’s Speech-to-Text API. This tool can recognize audio spoken in 125 languages, but as I mentioned above, the quality is highest in English. For our use case, we’ll want to enable a couple of special features, like:
- Enhanced models. These are Speech-to-Text models that have been trained on specific data types (“video,” “phone_call”) and are usually higher-quality. We’ll use the “video” model, of course.
- Profanity filters. This flag prevents the API from returning any naughty words.
- Word time offsets. This flag tells the API that we want transcribed words returned along with the times that the speaker said them. We’ll use these timestamps to help align our subtitles and dubs with the source video.
- Speech Adaption. Typically, Speech-to-Text struggles most with uncommon words or phrases. If you know certain words or phrases are likely to appear in your video (i.e. “gradient descent,” “support vector machine”), you can pass them to the API in an array that will make the more likely to be transcribed:
client = speech.SpeechClient()
# Audio must be uploaded to a GCS bucket if it's > 5 min
audio = speech.RecognitionAudio(uri="gs://path/to/my/audio.wav")
config = speech.RecognitionConfig(
language_code="en-US"
# Automatically transcribe punctuation
enable_automatic_punctuation=True,
enable_word_time_offsets=True,
speech_contexts=[
# Boost the likelihood of recognizing these words:
{"phrases": ["gradient descent", "support vector machine"],
"boost": 15}
],
profanity_filter=True,
use_enhanced="video",
model="video")
res = client.long_running_recognize(config=config, audio=audio).result()
The API returns the transcribed text along with word-level timestamps as JSON. As an example, I transcribed this video. You can see the JSON returned by the API in this gist. The output also lets us do a quick quality sanity check:
What I actually said:
“Software Developers. We’re not known for our rockin’ style, are we? Or are we? Today, I’ll show you how I used ML to make me trendier, taking inspiration from influencers.”
What the API thought I said:
“Software developers. We’re not known for our Rock and style. Are we or are we today? I’ll show you how I use ml to make new trendier taking inspiration from influencers.”
In my experience, this is about the quality you can expect when transcribing high-quality English audio. Note that the punctuation is a little off. If you’re happy with viewers getting the gist of a video, this is probably good enough, although it’s easy to manually correct the transcripts yourself if you speak the source language.
At this point, we can use the API output to generate (non-translated) subtitles. In fact, if you run my script with the `–srt` flag, it will do exactly that for you (srt is a file type for closed captions):
python dubber.py my_movie_file.mp4 "en" outputDirectory --srt --targetLangs ["es"]
Machine Translation
Now that we have the video transcripts, we can use the Translate API to… uh… translate them.
This is where things start to get a little