Google Cloud Speech-to-Text Tutorial
Introduction
Google Cloud's Speech-to-Text service enables developers to convert audio to text by applying powerful neural network models. The API recognizes over 120 languages and variants, to support your global user base. It's especially useful for transcribing audio files or real-time processing of audio streams.
Prerequisites
Before you begin, ensure you have the following:
- A Google Cloud Platform (GCP) account.
- Billing enabled for your GCP account.
- Google Cloud SDK installed.
- A project created in the Google Cloud Console.
Enable the Speech-to-Text API
First, you need to enable the Speech-to-Text API for your project:
- Go to the Speech-to-Text API page in the Google Cloud Console.
- Click "Enable" to enable the API.
Set Up Authentication
Google Cloud uses a service account to manage authentication. Follow these steps to set up your service account:
- In the Google Cloud Console, go to the Service accounts page.
- Click "Create Service Account".
- Enter a name and description for the service account, then click "Create".
- Assign the "Project > Editor" role to the service account, then click "Continue".
- Click "Done" to finish creating the service account.
- Click the "Actions" menu for your new service account, then select "Create Key".
- Select "JSON" as the key type and click "Create".
- Save the JSON key file to a secure location.
Set the environment variable GOOGLE_APPLICATION_CREDENTIALS
to the file path of the JSON file:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/service-account-file.json"
Using the Speech-to-Text API
Now that you have set up authentication, you can start using the Speech-to-Text API.
Transcribing Audio Files
Here's an example of how to transcribe an audio file using the Speech-to-Text API with Python:
pip install google-cloud-speech
Create a Python script with the following content:
from google.cloud import speech_v1p1beta1 as speech
import io
client = speech.SpeechClient()
def transcribe_audio(file_path):
with io.open(file_path, "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
)
response = client.recognize(config=config, audio=audio)
for result in response.results:
print("Transcript: {}".format(result.alternatives[0].transcript))
transcribe_audio("path/to/your/audiofile.wav")
Replace "path/to/your/audiofile.wav"
with the path to your audio file. Run the script:
Transcript: Your transcribed text will appear here.
Streaming Transcription
For real-time transcription, you can use the streaming API. Here's an example:
import pyaudio
from six.moves import queue
from google.cloud import speech_v1p1beta1 as speech
import sys
RATE = 16000
CHUNK = int(RATE / 10) # 100ms
class MicrophoneStream(object):
def __init__(self, rate, chunk):
self.rate = rate
self.chunk = chunk
self._buff = queue.Queue()
self.closed = True
def __enter__(self):
self._audio_interface = pyaudio.PyAudio()
self._audio_stream = self._audio_interface.open(
format=pyaudio.paInt16,
channels=1,
rate=self.rate,
input=True,
frames_per_buffer=self.chunk,
stream_callback=self._fill_buffer,
)
self.closed = False
return self
def __exit__(self, type, value, traceback):
self._audio_stream.stop_stream()
self._audio_stream.close()
self.closed = True
self._buff.put(None)
self._audio_interface.terminate()
def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
self._buff.put(in_data)
return None, pyaudio.paContinue
def generator(self):
while not self.closed:
chunk = self._buff.get()
if chunk is None:
return
data = [chunk]
while True:
try:
chunk = self._buff.get(block=False)
if chunk is None:
return
data.append(chunk)
except queue.Empty:
break
yield b"".join(data)
def listen_print_loop(responses):
num_chars_printed = 0
for response in responses:
if not response.results:
continue
result = response.results[0]
if not result.alternatives:
continue
transcript = result.alternatives[0].transcript
overwrite_chars = " " * (num_chars_printed - len(transcript))
if not result.is_final:
sys.stdout.write(transcript + overwrite_chars + "\r")
sys.stdout.flush()
num_chars_printed = len(transcript)
else:
print(transcript + overwrite_chars)
if re.search(r"\b(exit|quit)\b", transcript, re.I):
print("Exiting...")
break
num_chars_printed = 0
def main():
client = speech.SpeechClient()
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=RATE,
language_code="en-US",
)
streaming_config = speech.StreamingRecognitionConfig(
config=config,
interim_results=True,
)
with MicrophoneStream(RATE, CHUNK) as stream:
audio_generator = stream.generator()
requests = (speech.StreamingRecognizeRequest(audio_content=content)
for content in audio_generator)
responses = client.streaming_recognize(streaming_config, requests)
listen_print_loop(responses)
if __name__ == "__main__":
main()
Conclusion
In this tutorial, you learned how to set up and use Google Cloud's Speech-to-Text API for both transcribing audio files and real-time streaming transcription. With these tools, you can build powerful applications that convert speech to text, enhancing accessibility and user interaction.