Companies are missing out on actionable insights when they only work on textual data. Companies can learn more about people's opinions and thoughts through customer feedback via call centers. To analyze customer feedback calls, Speech to Text Transcription is required to transcribe the audio call logs into text, which can be analyzed via Visual Text Analytics (VTA). In this post, I will show you how to build a machine learning pipeline that will contain a speech-to-text and text mining portion.
Speech to Text pipeline comprises of 2 models, namely Acoustic Model and Language Model. The Acoustic Model scores computed audio features, then the Language Model decodes the scored audio features, transcribing the scored number values into text. In other words, the Acoustic Model output shows us how the machine comprehends each part of the audio, which is through the numerical analysis of the audio wave. And the Language Model output provides us with the text transcription that we are used to, that can be used to gain valuable insights that we can base decisions on once VTA model is applied on it.
For this project, the working data is from test-clean-100.tar.gz folder from LibriSpeech ASR Corpus.
The SAS pre-trained Speech-to-Text model is Models - English (ZIP) from SAS VDMML Support Documentation Page, which runs in SAS Viya.
The project is run in Python environment executing SAS Viya with the use of DLPy (SAS Viya Deep Learning API for Python).
import os
import sys
from swat import *
import getpass
os.environ['CAS_CLIENT_SSL_CA_LIST'] = r'<cert location>'
s = CAS("<server>", port=<port>, username='<log in username>', password=getpass.getpass("Password: "))
s.loadactionset("audio")
s.loadactionset("searchanalytics")
s.loadactionset("deeplearn")
s.loadactionset("langmodel")
s.addcaslib(name="myCasLib",
path="<path to files>",
activeOnAdd = True,
dataSource={"srctype":"path"})
s.addcaslib(name="audioCasLib",
path="<path to audio>",
activeOnAdd = False,
dataSource={"srctype":"path"})
The Cloud Analytic Server (CAS) library is added, containing the audio, transcript, and models.
s.langmodel.lmImport(table="language_model.sashdat",
casout={"name":"og_lm",
"caslib":"myCasLib",
"replace":True})
s.table.loadTable(path="acoustic_model_cpu.sashdat",
caslib="myCasLib",
casout={"name":"asr",
"caslib":"myCasLib",
"replace":True})
s.table.loadTable(path="acoustic_model_cpu_weights.sashdat",
caslib="myCasLib",
casout={"name":"pretrained_weights",
"caslib":"myCasLib",
"replace":True})
s.table.loadTable(path="acoustic_model_cpu_weights_attr.sashdat",
caslib="myCasLib",
casout={"name":"pretrained_weights_attr",
"caslib":"myCasLib",
"replace":True})
s.table.attribute(task="ADD",
name="pretrained_weights",
attrtable="pretrained_weights_attr")
s.audio.loadAudio(path = "audio.txt",
caslib = "audioCasLib",
casout = {"name":"audio",
"caslib":"myCasLib",
"replace":True})
vars_list = ["_path_"]
nFrames = 3500
nToken = 40
s.audio.computeFeatures(table = "audio",
copyVars = vars_list,
casout = {"name":"scoring_data",
"caslib":"myCasLib",
"replace":True},
audioColumn = "_audio_",
frameExtractionOptions = {"frameshift":10,
"framelength":25,
"dither":0.0},
melBanksOptions = {"nBins":nToken},
mfccOptions = {"nCeps":nToken},
featureScalingMethod = "STANDARDIZATION",
nOutputFrames = nFrames)
"audio.txt" contains the path to each individual audio file, with reference to the path of "myCasLib". The CAS table output "audio" contains the complete path to and binary form of each audio file. "audio" CAS table is then used as the input table, where the audio features are computed, with each frame being 25ms long, with each frame being a 10ms deviation from the previous.
nToken value is 40 which is the number of unique units of sounds for each frame. 40 is selected as the English Language has an estimated of 40 distinct sounds that are useful for speech recognition.
nFrames value is 3500 which will produce a maximum number of 3500 small portions of the audio to process, which is able to compute for a maximum audio length of 35 seconds.
The audio files are computed to obtain statistical representations of the unique units of sound in the language, as computed audio features.
s.dlscore(table = "scoring_data",
modelTable = "asr",
initWeights = "pretrained_weights",
nThreads = 12,
copyVars = ["_path_"],
casout = {"name":"scoredData", "replace": True})
The Acoustic Model is used to score the audio features, creating statistical representations for each feature, outputting as Hidden Markov Models. Adjust the nThreads value accordingly to match the computational ability of the server, to ensure that the scoring can be completed successfully.
s.langModel.lmDecode(table = "scoredData",
langModelTable = "og_lm",
copyVars = ["_path_"],
blankLabel = " ",
spaceLabel = "&",
casout = {"name":"results",
"replace": True})
With the scored output from the Acoustic Model, it is fed through a Language Model to decode the Hidden Markov Models into phonemes, which are recorded until there is a pause within the user's speech. The recorded phonemes will be compared to a series of phonemes to identify the word that was spoken. The words identified are subject to grammar and sentence structure checks by calculating the probability of word sequences, before they are accepted as the textual transcriptions for each audio file.
To calculate the error rate, the original transcripts of the audio files have to be first cleaned and loaded as a single file, as done so by the code snippet below.
wav_folder = "<input directory>"
audio_folder = "<myCasLib directory>"
file_out = open(wav_folder + "actual_transcription.txt", "w")
file_out.writelines("_path_,_transcript_\n")
for f in os.listdir(audio_folder):
if f.endswith(".trans.txt"):
file_in = open(audio_folder + f, "r") # open the file to read
txt_flist = file_in.readlines()
file_in.close()
trans_list = []
for line in txt_flist:
line = line.strip()
# find(): finds the first occurence of a specified string
audio_id = wav_folder + line[:line.find(" ")] + ".wav" # finds the first space and adds ".wav" to the text before the first space
audio_text = line[line.find(" ") + 1:] # reads from the first character after the space within the line
trans = audio_id + "," + audio_text.strip() + "\n"
trans_list.append(trans)
trans_list.sort()
file_out.writelines(trans_list)
file_out.close()
s.table.loadtable(path="actual_transcription.txt",
caslib="myCasLib",
casout={"name":"reference_table",
"caslib":"myCasLib",
"replace":True})
s.calculateErrorRate(table={"name":"results",
"caslib":"myCasLib"},
tableId="_path_",
tableText="_audio_content_",
reference={"name":"reference_table",
"caslib":"myCasLib"},
referenceId="_path_",
referenceText="_transcript_")
After feeding the LibriSpeech Audio through the Speech-to-Text Model, the error rates are calculated, obtaining low character and word error rate of 6.69% and 13.61% respectively.
All in all, the SAS Pretrained Speech-to-Text model is highly accurate, with a 95% accuracy, to transcribe audio with American English accent and slang. With the highly accurate transcription obtained, further analytics can be done on it to obtain valuable insights into the data.
A zip folder is attached to this article, containing template script files that can be used to complete this project, including a script for audio splitting and simple sentiment analysis.
As mentioned earlier when computing audio features, the number of frames limit the length of the audio to a maximum of 35 seconds. As such, the script for audio splitting, which uses a SAS GitLab package ASRLab, is used to split the audio by pauses.
Script | Description | Sequence to be run |
calculateErrorRate.py |
Compares the output transcript and the original transcript to get character, word and sentence error rates.
This is not required if the aim is just to transcribe the audio files into textual transcription. |
8 (if applicable) |
cleaningActualTranscript.py |
Cleans the original transcript of audio files so that it can be compared to the output transcript of Speech-to-Text and calculate the error rate.
This script cleans the transcripts for the audio in the LibriSpeech dataset, which is pre-split according to the split in the audio files that are stored in 'audio_folder'.
This is not required if the aim is just to transcribe the audio files into textual transcription. |
4 (if applicable. this script is essential is error rate is to be calculated.) |
dataPrep.py |
Loads audio files and compute audio features |
5 |
decode.py | Decode scored data (from Acoustic Model) using Language Model | 7 |
establishConnection.py | Establish connection to CAS and add CAS libraries | 2 |
importModels.py |
Import Speech-to-Text Models (Acoustic Model and Language Model) |
3 |
mergeTranscripts.py | Merge transcribed transcripts (since they were split originally) so that Visual Text Analytics can be done effectively. | 8 |
score.py | Score audio features using Acoustic Model | 6 |
splitAudio.py | Split audio files in specified folder location and move to working audio directory (for consistency) | 4 |
standardizeAudio.py | Move pre-split audio files into working audio file directory (for consistency) | 4 |
variablesToChange.py | Variables that need to be changed to adjust to user's environment. These variables are imported and called in the various other scripts. | 1 |
vta.py | Perform a simple sentiment analysis on the transcribed output | 9 |
writeAudioFilesintoTXT.py | List all WAV audio files in working audio file directory into a TXT file | 5 |
Nice, it is indeed a very informative article on Speech to Text Analytics on SAS Viya!! Great work!! Thanks.
It is so difficult. Why don't you use audio to text converter ? I really like such tools. You definitely should try it to save your time.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.