Speech to Text Transcription and Text Analytics on Viya

4 Likes

Companies are missing out on actionable insights when they only work on textual data. Companies can learn more about people's opinions and thoughts through customer feedback via call centers. To analyze customer feedback calls, Speech to Text Transcription is required to transcribe the audio call logs into text, which can be analyzed via Visual Text Analytics (VTA). In this post, I will show you how to build a machine learning pipeline that will contain a speech-to-text and text mining portion.

Speech to Text pipeline comprises of 2 models, namely Acoustic Model and Language Model. The Acoustic Model scores computed audio features, then the Language Model decodes the scored audio features, transcribing the scored number values into text. In other words, the Acoustic Model output shows us how the machine comprehends each part of the audio, which is through the numerical analysis of the audio wave. And the Language Model output provides us with the text transcription that we are used to, that can be used to gain valuable insights that we can base decisions on once VTA model is applied on it.

Obtaining Resources

For this project, the working data is from test-clean-100.tar.gz folder from LibriSpeech ASR Corpus.

The SAS pre-trained Speech-to-Text model is Models - English (ZIP) from SAS VDMML Support Documentation Page, which runs in SAS Viya.

The project is run in Python environment executing SAS Viya with the use of DLPy (SAS Viya Deep Learning API for Python).

Required Library Packages to Load into Python

import os
import sys

from swat import *

import getpass

Connecting and Setting up SAS Viya Environment

os.environ['CAS_CLIENT_SSL_CA_LIST'] = r'<cert location>'

s = CAS("<server>", port=<port>, username='<log in username>', password=getpass.getpass("Password: "))

s.loadactionset("audio")
s.loadactionset("searchanalytics")
s.loadactionset("deeplearn")
s.loadactionset("langmodel")

The code snippet establishes the connection to the SAS Viya Server through username and password log in. getpass.getpass ensures that the password of the server would not be leaked.

Reasons for the respective loaded action sets:

audio - load the working audio files and compute audio features from the loaded audio files.
search analytics - merged the respective transcripts to the computed audio features later.
deep learn - score the input data through the acoustic model.
lang model - load language model to decode the scored output from the acoustic model.

s.addcaslib(name="myCasLib",
            path="<path to files>",
            activeOnAdd = True,
            dataSource={"srctype":"path"})

s.addcaslib(name="audioCasLib",
            path="<path to audio>",
            activeOnAdd = False,
            dataSource={"srctype":"path"})

The Cloud Analytic Server (CAS) library is added, containing the audio, transcript, and models.

Import Speech to Text Model (Acoustic and Language Models)

s.langmodel.lmImport(table="language_model.sashdat",
                     casout={"name":"og_lm",
                             "caslib":"myCasLib",
                             "replace":True})

s.table.loadTable(path="acoustic_model_cpu.sashdat",
                  caslib="myCasLib",
                  casout={"name":"asr",
                          "caslib":"myCasLib",
                          "replace":True})

s.table.loadTable(path="acoustic_model_cpu_weights.sashdat",
                  caslib="myCasLib",
                  casout={"name":"pretrained_weights",
                          "caslib":"myCasLib",
                          "replace":True})

s.table.loadTable(path="acoustic_model_cpu_weights_attr.sashdat",
                  caslib="myCasLib",
                  casout={"name":"pretrained_weights_attr",
                          "caslib":"myCasLib",
                          "replace":True})

s.table.attribute(task="ADD", 
                  name="pretrained_weights", 
                  attrtable="pretrained_weights_attr")

The Language Model is loaded via lmImport. The Acoustic Model, its weights, and weights of attributes are loaded via table.loadTable. The model files obtained are from the SAS Pretrained Speech-to-Text Model. The weights attributes are added to the weights table, as an attribute, via table.attribute.

Loading Input Data

s.audio.loadAudio(path = "audio.txt",
                  caslib = "audioCasLib",
                  casout = {"name":"audio", 
                            "caslib":"myCasLib",
                            "replace":True})

vars_list = ["_path_"]

nFrames = 3500
nToken = 40

s.audio.computeFeatures(table = "audio",
                        copyVars = vars_list,
                        casout = {"name":"scoring_data",
                                  "caslib":"myCasLib",
                                  "replace":True},
                        audioColumn = "_audio_",
                        frameExtractionOptions = {"frameshift":10,
                                                  "framelength":25,
                                                  "dither":0.0},
                        melBanksOptions = {"nBins":nToken},
                        mfccOptions = {"nCeps":nToken},
                        featureScalingMethod = "STANDARDIZATION",
                        nOutputFrames = nFrames)

"audio.txt" contains the path to each individual audio file, with reference to the path of "myCasLib". The CAS table output "audio" contains the complete path to and binary form of each audio file. "audio" CAS table is then used as the input table, where the audio features are computed, with each frame being 25ms long, with each frame being a 10ms deviation from the previous.

nToken value is 40 which is the number of unique units of sounds for each frame. 40 is selected as the English Language has an estimated of 40 distinct sounds that are useful for speech recognition.

nFrames value is 3500 which will produce a maximum number of 3500 small portions of the audio to process, which is able to compute for a maximum audio length of 35 seconds.

The audio files are computed to obtain statistical representations of the unique units of sound in the language, as computed audio features.

Scoring with Acoustic Model

s.dlscore(table = "scoring_data",
          modelTable = "asr",
          initWeights = "pretrained_weights",
          nThreads = 12,
          copyVars = ["_path_"],
          casout = {"name":"scoredData", "replace": True})

The Acoustic Model is used to score the audio features, creating statistical representations for each feature, outputting as Hidden Markov Models. Adjust the nThreads value accordingly to match the computational ability of the server, to ensure that the scoring can be completed successfully.

Decode Scored Data with Language Model

s.langModel.lmDecode(table = "scoredData",
                     langModelTable = "og_lm",
                     copyVars = ["_path_"],
                     blankLabel = " ",
                     spaceLabel = "&",
                     casout = {"name":"results",
                               "replace": True})

With the scored output from the Acoustic Model, it is fed through a Language Model to decode the Hidden Markov Models into phonemes, which are recorded until there is a pause within the user's speech. The recorded phonemes will be compared to a series of phonemes to identify the word that was spoken. The words identified are subject to grammar and sentence structure checks by calculating the probability of word sequences, before they are accepted as the textual transcriptions for each audio file.

Calculate Error Rates

To calculate the error rate, the original transcripts of the audio files have to be first cleaned and loaded as a single file, as done so by the code snippet below.

wav_folder = "<input directory>"
audio_folder = "<myCasLib directory>"
file_out = open(wav_folder + "actual_transcription.txt", "w")
file_out.writelines("_path_,_transcript_\n")

for f in os.listdir(audio_folder):
    if f.endswith(".trans.txt"):
        file_in = open(audio_folder + f, "r") # open the file to read
        txt_flist = file_in.readlines()
        file_in.close()

        trans_list = []
        for line in txt_flist:
            line = line.strip()
            # find(): finds the first occurence of a specified string 
            audio_id = wav_folder + line[:line.find(" ")] + ".wav" # finds the first space and adds ".wav" to the text before the first space
            audio_text = line[line.find(" ") + 1:] # reads from the first character after the space within the line
            trans = audio_id + "," + audio_text.strip() + "\n"
            trans_list.append(trans)


        trans_list.sort()
        file_out.writelines(trans_list)

file_out.close()

s.table.loadtable(path="actual_transcription.txt",
                  caslib="myCasLib",
                  casout={"name":"reference_table", 
                          "caslib":"myCasLib",
                          "replace":True})

s.calculateErrorRate(table={"name":"results", 
                            "caslib":"myCasLib"},
                     tableId="_path_",
                     tableText="_audio_content_",
                     reference={"name":"reference_table",
                                "caslib":"myCasLib"},
                     referenceId="_path_",
                     referenceText="_transcript_")

After feeding the LibriSpeech Audio through the Speech-to-Text Model, the error rates are calculated, obtaining low character and word error rate of 6.69% and 13.61% respectively.

Conclusion

All in all, the SAS Pretrained Speech-to-Text model is highly accurate, with a 95% accuracy, to transcribe audio with American English accent and slang. With the highly accurate transcription obtained, further analytics can be done on it to obtain valuable insights into the data.

Attachments

A zip folder is attached to this article, containing template script files that can be used to complete this project, including a script for audio splitting and simple sentiment analysis.

As mentioned earlier when computing audio features, the number of frames limit the length of the audio to a maximum of 35 seconds. As such, the script for audio splitting, which uses a SAS GitLab package ASRLab, is used to split the audio by pauses.

Script	Description	Sequence to be run
calculateErrorRate.py	Compares the output transcript and the original transcript to get character, word and sentence error rates. This is not required if the aim is just to transcribe the audio files into textual transcription.	8 (if applicable)
cleaningActualTranscript.py	Cleans the original transcript of audio files so that it can be compared to the output transcript of Speech-to-Text and calculate the error rate. This script cleans the transcripts for the audio in the LibriSpeech dataset, which is pre-split according to the split in the audio files that are stored in 'audio_folder'. This is not required if the aim is just to transcribe the audio files into textual transcription.	4 (if applicable. this script is essential is error rate is to be calculated.)
dataPrep.py	Loads audio files and compute audio features	5
decode.py	Decode scored data (from Acoustic Model) using Language Model	7
establishConnection.py	Establish connection to CAS and add CAS libraries	2
importModels.py	Import Speech-to-Text Models (Acoustic Model and Language Model)	3
mergeTranscripts.py	Merge transcribed transcripts (since they were split originally) so that Visual Text Analytics can be done effectively.	8
score.py	Score audio features using Acoustic Model	6
splitAudio.py	Split audio files in specified folder location and move to working audio directory (for consistency)	4
standardizeAudio.py	Move pre-split audio files into working audio file directory (for consistency)	4
variablesToChange.py	Variables that need to be changed to adjust to user's environment. These variables are imported and called in the various other scripts.	1
vta.py	Perform a simple sentiment analysis on the transcribed output	9
writeAudioFilesintoTXT.py	List all WAV audio files in working audio file directory into a TXT file	5

Surojit · ‎01-24-2021

Nice, it is indeed a very informative article on Speech to Text Analytics on SAS Viya!! Great work!! Thanks.

ashley999 · ‎08-03-2021

It is so difficult. Why don't you use audio to text converter ? I really like such tools. You definitely should try it to save your time.

SAS Communities Library