Google Cloud Speech-to-Text Tutorial

Introduction

Google Cloud's Speech-to-Text service enables developers to convert audio to text by applying powerful neural network models. The API recognizes over 120 languages and variants, to support your global user base. It's especially useful for transcribing audio files or real-time processing of audio streams.

Prerequisites

Before you begin, ensure you have the following:

A Google Cloud Platform (GCP) account.
Billing enabled for your GCP account.
Google Cloud SDK installed.
A project created in the Google Cloud Console.

Enable the Speech-to-Text API

First, you need to enable the Speech-to-Text API for your project:

Go to the Speech-to-Text API page in the Google Cloud Console.
Click "Enable" to enable the API.

Set Up Authentication

Google Cloud uses a service account to manage authentication. Follow these steps to set up your service account:

In the Google Cloud Console, go to the Service accounts page.
Click "Create Service Account".
Enter a name and description for the service account, then click "Create".
Assign the "Project > Editor" role to the service account, then click "Continue".
Click "Done" to finish creating the service account.
Click the "Actions" menu for your new service account, then select "Create Key".
Select "JSON" as the key type and click "Create".
Save the JSON key file to a secure location.

Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file:

export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/service-account-file.json"

Using the Speech-to-Text API

Now that you have set up authentication, you can start using the Speech-to-Text API.

Transcribing Audio Files

Here's an example of how to transcribe an audio file using the Speech-to-Text API with Python:

pip install google-cloud-speech

Create a Python script with the following content:


                    from google.cloud import speech_v1p1beta1 as speech

                    import io


                    client = speech.SpeechClient()


                    def transcribe_audio(file_path):

                        with io.open(file_path, "rb") as audio_file:

                            content = audio_file.read()


                        audio = speech.RecognitionAudio(content=content)

                        config = speech.RecognitionConfig(

                            encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,

                            sample_rate_hertz=16000,

                            language_code="en-US",

                        )


                        response = client.recognize(config=config, audio=audio)


                        for result in response.results:

                            print("Transcript: {}".format(result.alternatives[0].transcript))


                    transcribe_audio("path/to/your/audiofile.wav")

Replace "path/to/your/audiofile.wav" with the path to your audio file. Run the script:

Transcript: Your transcribed text will appear here.

Streaming Transcription

For real-time transcription, you can use the streaming API. Here's an example:


                    import pyaudio

                    from six.moves import queue

                    from google.cloud import speech_v1p1beta1 as speech

                    import sys


                    RATE = 16000

                    CHUNK = int(RATE / 10)  # 100ms


                    class MicrophoneStream(object):

                        def __init__(self, rate, chunk):

                            self.rate = rate

                            self.chunk = chunk


                            self._buff = queue.Queue()

                            self.closed = True


                        def __enter__(self):

                            self._audio_interface = pyaudio.PyAudio()

                            self._audio_stream = self._audio_interface.open(

                                format=pyaudio.paInt16,

                                channels=1,

                                rate=self.rate,

                                input=True,

                                frames_per_buffer=self.chunk,

                                stream_callback=self._fill_buffer,

                            )


                            self.closed = False


                            return self


                        def __exit__(self, type, value, traceback):

                            self._audio_stream.stop_stream()

                            self._audio_stream.close()

                            self.closed = True

                            self._buff.put(None)

                            self._audio_interface.terminate()


                        def _fill_buffer(self, in_data, frame_count, time_info, status_flags):

                            self._buff.put(in_data)

                            return None, pyaudio.paContinue


                        def generator(self):

                            while not self.closed:

                                chunk = self._buff.get()

                                if chunk is None:

                                    return

                                data = [chunk]


                                while True:

                                    try:

                                        chunk = self._buff.get(block=False)

                                        if chunk is None:

                                            return

                                        data.append(chunk)

                                    except queue.Empty:

                                        break


                                yield b"".join(data)


                    def listen_print_loop(responses):

                        num_chars_printed = 0

                        for response in responses:

                            if not response.results:

                                continue


                            result = response.results[0]

                            if not result.alternatives:

                                continue


                            transcript = result.alternatives[0].transcript

                            overwrite_chars = " " * (num_chars_printed - len(transcript))


                            if not result.is_final:

                                sys.stdout.write(transcript + overwrite_chars + "\r")

                                sys.stdout.flush()


                                num_chars_printed = len(transcript)

                            else:

                                print(transcript + overwrite_chars)

                                if re.search(r"\b(exit|quit)\b", transcript, re.I):

                                    print("Exiting...")

                                    break


                                num_chars_printed = 0


                    def main():

                        client = speech.SpeechClient()

                        config = speech.RecognitionConfig(

                            encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,

                            sample_rate_hertz=RATE,

                            language_code="en-US",

                        )


                        streaming_config = speech.StreamingRecognitionConfig(

                            config=config,

                            interim_results=True,

                        )


                        with MicrophoneStream(RATE, CHUNK) as stream:

                            audio_generator = stream.generator()

                            requests = (speech.StreamingRecognizeRequest(audio_content=content)

                                        for content in audio_generator)


                            responses = client.streaming_recognize(streaming_config, requests)

                            listen_print_loop(responses)


                    if __name__ == "__main__":

                        main()

Conclusion

In this tutorial, you learned how to set up and use Google Cloud's Speech-to-Text API for both transcribing audio files and real-time streaming transcription. With these tools, you can build powerful applications that convert speech to text, enhancing accessibility and user interaction.