Documentation Center

Automatic speech recognition

Transcribe and translate audio and video files in Language Weaver Edge.

In Language Weaver Edge you can use automatic speech recognition (ASR) to transcribe audio and video files. This feature uses the Whisper v2 open-source model from OpenAI. See https://openai.com/index/whisper/ for more information. You will also find there a list of the available languages for audio recognition and the word error rate.

The Whisper model is integrated as is in Language Weaver Edge, meaning that we are not updating it or fine-tuning it.

The ASR functionality is available as of Language Weaver Edge 8.6.5. If you would like to integrate it in your installation, you can obtain it as a separate installer. No changes are required in terms of licensing. Please contact your Language Weaver Edge representative in case you have any questions.

After installing the ASR functionality, you need to add a new audio transcription engine. Each engine will use one processing unit (PU). Multiple ASR engines can run in parallel, and each engine can run any language.

Audio transcription can run on both GPU and CPU. When running on GPU, the average transcription Real-Time Factor is 0.25 (it takes on average 1 second to process 4 seconds of audio content). When running on CPU, the average transcription Real-Time Factor is 3.2 (it takes 3.2 seconds to process 1 second of audio content). Please note that this is an average and may vary depending on the hardware, the languages and the quality of the audio source file. Performance testing based on source type is advised. See Recommended system requirements for more information about the system requirements to perform audio transcription in Language Weaver Edge.

Multiple audio transcription engines can share the same GPU, but they cannot share it with a training engine. Training engines require full access to the GPU, and the GPU must be dedicated to training. Training engines and audio transcription engines must be started on different hosts with their own GPU if they will be used in parallel.

Once an audio transcription engine has been started, you will see an Audio Files option under Settings in the Translate tab. You can then upload .wav, .mp3 or .mp4 files in the same way you would do with any other file type. The limitation in terms of size is of 100 MB per file.

You can choose between transcribing only, and transcribing and translating into any of the languages available on your setup. For the output format, you can choose between .txt and .vtt.