Sign in to bookmark tools for later and leave reviews (it's free!)

Whisper (OpenAI)

0 out of 5

Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It is an open-source neural net that has been trained on a large and diverse dataset consisting of 680,000 hours of multilingual and multitask supervised data collected from the web. This extensive training has resulted in improved robustness to accents, background noise, and technical language. Whisper is capable of transcribing speech in multiple languages and can also translate speech from other languages into English.

The architecture of Whisper is based on a simple end-to-end approach, implemented as an encoder-decoder Transformer. The input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, with the inclusion of special tokens that direct the model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Compared to other existing approaches, Whisper stands out due to its training on a large and diverse dataset, which allows it to be more robust and make 50% fewer errors across diverse datasets. While it may not beat models specialized in LibriSpeech performance, Whisper’s zero-shot performance is highly impressive. Approximately one-third of Whisper’s audio dataset is non-English, and it is trained to transcribe in the original language or translate to English, making it particularly effective at speech-to-text translation.

OpenAI hopes that Whisper’s high accuracy and ease of use will enable developers to incorporate voice interfaces into a wider range of applications. To learn more about Whisper, you can check out the paper, model card, and code provided on the OpenAI website.

In terms of pricing, OpenAI offers a simple and flexible payment model where users only pay for the resources they use. When signing up, users are granted an initial spend limit or quota, which can be increased over time based on their application’s track record. OpenAI also provides a free credit of $5 for users to start experimenting with the service during their first 3 months.

Features

  • Automatic speech recognition (ASR) system
  • Trained on 680,000 hours of multilingual and multitask supervised data
  • Improved robustness to accents, background noise, and technical language
  • Enables transcription in multiple languages
  • Translation from multiple languages into English
  • Simple end-to-end approach using encoder-decoder Transformer
  • Input audio split into 30-second chunks
  • Converts audio into log-Mel spectrogram
  • Predicts corresponding text caption
  • Supports tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation
  • More robust and makes 50% fewer errors compared to other models
  • About a third of the audio dataset is non-English
  • Effective at learning speech to text translation
  • High accuracy and ease of use
  • Flexible pricing based on usage

Reviews (0)

This article doesn't have any reviews yet.

Leave a review

Overview

Overall (0 out of 5)