Swiftorial Logo
Home
Swift Lessons
AI Tools
Learn More
Career
Resources

Google Cloud Speech-to-Text Tutorial

Introduction

Google Cloud's Speech-to-Text service enables developers to convert audio to text by applying powerful neural network models. The API recognizes over 120 languages and variants, to support your global user base. It's especially useful for transcribing audio files or real-time processing of audio streams.

Prerequisites

Before you begin, ensure you have the following:

  • A Google Cloud Platform (GCP) account.
  • Billing enabled for your GCP account.
  • Google Cloud SDK installed.
  • A project created in the Google Cloud Console.

Enable the Speech-to-Text API

First, you need to enable the Speech-to-Text API for your project:

  1. Go to the Speech-to-Text API page in the Google Cloud Console.
  2. Click "Enable" to enable the API.

Set Up Authentication

Google Cloud uses a service account to manage authentication. Follow these steps to set up your service account:

  1. In the Google Cloud Console, go to the Service accounts page.
  2. Click "Create Service Account".
  3. Enter a name and description for the service account, then click "Create".
  4. Assign the "Project > Editor" role to the service account, then click "Continue".
  5. Click "Done" to finish creating the service account.
  6. Click the "Actions" menu for your new service account, then select "Create Key".
  7. Select "JSON" as the key type and click "Create".
  8. Save the JSON key file to a secure location.

Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file:

export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/service-account-file.json"

Using the Speech-to-Text API

Now that you have set up authentication, you can start using the Speech-to-Text API.

Transcribing Audio Files

Here's an example of how to transcribe an audio file using the Speech-to-Text API with Python:

pip install google-cloud-speech

Create a Python script with the following content:

from google.cloud import speech_v1p1beta1 as speech
import io

client = speech.SpeechClient()

def transcribe_audio(file_path):
with io.open(file_path, "rb") as audio_file:
content = audio_file.read()

audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
)

response = client.recognize(config=config, audio=audio)

for result in response.results:
print("Transcript: {}".format(result.alternatives[0].transcript))

transcribe_audio("path/to/your/audiofile.wav")

Replace "path/to/your/audiofile.wav" with the path to your audio file. Run the script:

Transcript: Your transcribed text will appear here.

Streaming Transcription

For real-time transcription, you can use the streaming API. Here's an example:

import pyaudio
from six.moves import queue
from google.cloud import speech_v1p1beta1 as speech
import sys

RATE = 16000
CHUNK = int(RATE / 10) # 100ms

class MicrophoneStream(object):
def __init__(self, rate, chunk):
self.rate = rate
self.chunk = chunk

self._buff = queue.Queue()
self.closed = True

def __enter__(self):
self._audio_interface = pyaudio.PyAudio()
self._audio_stream = self._audio_interface.open(
format=pyaudio.paInt16,
channels=1,
rate=self.rate,
input=True,
frames_per_buffer=self.chunk,
stream_callback=self._fill_buffer,
)

self.closed = False

return self

def __exit__(self, type, value, traceback):
self._audio_stream.stop_stream()
self._audio_stream.close()
self.closed = True
self._buff.put(None)
self._audio_interface.terminate()

def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
self._buff.put(in_data)
return None, pyaudio.paContinue

def generator(self):
while not self.closed:
chunk = self._buff.get()
if chunk is None:
return
data = [chunk]

while True:
try:
chunk = self._buff.get(block=False)
if chunk is None:
return
data.append(chunk)
except queue.Empty:
break

yield b"".join(data)

def listen_print_loop(responses):
num_chars_printed = 0
for response in responses:
if not response.results:
continue

result = response.results[0]
if not result.alternatives:
continue

transcript = result.alternatives[0].transcript
overwrite_chars = " " * (num_chars_printed - len(transcript))

if not result.is_final:
sys.stdout.write(transcript + overwrite_chars + "\r")
sys.stdout.flush()

num_chars_printed = len(transcript)
else:
print(transcript + overwrite_chars)
if re.search(r"\b(exit|quit)\b", transcript, re.I):
print("Exiting...")
break

num_chars_printed = 0

def main():
client = speech.SpeechClient()
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=RATE,
language_code="en-US",
)

streaming_config = speech.StreamingRecognitionConfig(
config=config,
interim_results=True,
)

with MicrophoneStream(RATE, CHUNK) as stream:
audio_generator = stream.generator()
requests = (speech.StreamingRecognizeRequest(audio_content=content)
for content in audio_generator)

responses = client.streaming_recognize(streaming_config, requests)
listen_print_loop(responses)

if __name__ == "__main__":
main()

Conclusion

In this tutorial, you learned how to set up and use Google Cloud's Speech-to-Text API for both transcribing audio files and real-time streaming transcription. With these tools, you can build powerful applications that convert speech to text, enhancing accessibility and user interaction.