Create Your To-Do Lists Using Speech Recognition With Python

Create Your To-Do Lists Using Speech Recognition With Python
Last winter, I dived into speech recognition. If you recall, I used the SpeechRecognition library in Python (from PyAudio to be more specific). However, that project came with its difficulties. Because I was trying to learn in real-time, I needed to have my microphone as an input device.
Over an Ubuntu Server, I couldn’t use my microphone because it wasn’t accepted as I couldn’t change the default input device. When I tried WSL, there were similar locks on changing the input device. Even with Fedora, and with hours of research, the audio jack still wouldn’t allow my microphone. At least not without far more time and research. Finally, I just used Ubuntu on my computer. Not over a server, just the OS.
Now that it’s been almost a year, I wanted to revisit speech recognition, as I need it now for a new project. If you’ve read some of my other articles, you may know that I’ve been trying to slowly build out my own smart home. The ultimate goal is to have a smart mirror to tell me the news and weather in the morning, maybe some interactivity, and all implemented with AR (Augmented Reality). That’s where I found inspiration for my newest project.
Currently, I use Trello cards to manage my to-do tasks so I can stay somewhat organized. In the future, although I may keep these cards, there will be a time where I want to tell the smart mirror things I plan to do, lists I want to be written down, ideas I may have for a new project, and etcetera. Once it is something I can get in real-time, maybe it could even capture my input to complete certain tasks, such as researching something or giving me certain information. But, I’m getting ahead of myself.
For this project, since I’m just starting back in, I want to be able to pre-record some audio to make a to-do list.
While starting, this time I wanted to try using an existing API. Hopefully, that would help with some of the clunkiness from my last attempt, and it would be an easier place to start for hopefully more accurate results. More than just a few words like my last speech recognition project anyway. After a little time looking on Google, I found AssemblyAI and decided to give it a try. I signed up for an account, got my API key, and went straight to the documentation to get coding. So, without any further delay, let’s take a look at what I needed to be installed for this project.

Installing packages

Because I’ll be making an API call, I had to install the requests library. That was done with a quick Pip command:
pip3 install requests
Because I needed to store the to-do, I also needed a database engine for this. Because it’s small, I decided to go with a built-in DBMS. TinyDB was my choice for this, as it is lightweight and easy to use:
pip3 install tinydb
The next installs we will need have to do with the audio input. To record the audio, the program needed the ability to hook up a microphone, which in my case I used the sounddevice library. To install sounddevice, I also needed to download libportaudio.
sudo apt-get install libportaudio2
pip3 install sounddevice
The final installation needed, had to do with writing to a file. To write to the file, I used scipy, which can be installed using pip:
pip3 install scipy

Signing up for AssemblyAI

To get the API key, I first had to create an account. Luckily, there was a free option to get started. So, after only a minute or so I had my API key and was ready to get started. One feature I did find interesting was the Usage page. For now, I’ll only be transcribing a single file, but it’s good to know as a manual check how much I’ll need down the road:
Another interesting page I found was the Processing Queue, which also had the Completed and Error. The API calls interact with those queues, so it was just good to have the visualization of where that API is calling to:
But, now that I had my API, I was ready to start coding.

Beginning the code

As a rundown of what the code does, we will first create an audio file. From there, it needs to be sent via the API, which in a few calls we can have transcribed and pulled back to be read. However, I did this in two separate files to try to have it somewhat dynamic. To help with reusability, all of my API calls were captured into a class that’ll be the first file we’ll write. The second file will be the more specific code that will create the audio file, make the necessary calls to the API via the class, save it into the DB, and read the results as well.
The first file will be titled speech_to_text.py. From there, two imports were needed:
import requests
import sys
Next, I declared the class. Following the declaration I included two lines of a real string as a good practice, then moved to declare the initializer:
class speech_to_text:
     “””
     “””

      def __init__(self, api_key):
          self.endpoint = “https://api.assemblyai.com/v2/”
          self.header = {
               “authorization”: api_key
           }
All of the methods were created inside the class, so they are indented. Now, the way I had to send the audio file is a little complicated, as I had it stored locally after it was created. If it was a URL, I wouldn’t need to include this next function. Instead, I could put it in the next function, which I will explain momentarily. So, first, for the file, I had to upload and jsonify it. Recall, to “insert” in the API world we use the keyword “post”.
def upload_file(self, file):
     response = requests.post(f”{self.endpoint}upload”, headers=self.header, data=self.__read_file(file))
     return response.json()
Breaking that down, it is posting, or uploading to the API endpoint called “upload”. It’s being wrapped in an f string, which is just a cleaner way to concatenate. Now, you may also notice the read_file function. That’s something I wrote and will explain in a few minutes.
Next, the audio was ready to be submitted using the uploaded file from the previous function. This means from the uploaded file, the code is ready to start extracting text or transcribing it. To do this, a JSON object of our uploaded file has to be created, then it can be sent through the “transcript” endpoint. Though really, this isn’t where it’s extracted. If you recall from the dashboard, there was a “processing” queue for audio files. This endpoint will place it in that queue. Here is where I was saying that I could have included the URL if I already had one. In the JSON object, there is an “audio_url” field. Because my file was locally saved, I needed the upload_file function to generate a CDN URL, which acts as a placeholder. If there was already a URL, I could have bypassed that function and simply included the URL under that “audio_url” field.
def submit_audio(self, uploaded_file):
     self.header[“content-type”] = “application/json”

     json = {
            "audio_url": uploaded_file,
            "punctuate": True,
            "format_text": True
        }

     url = f”{self.endpoint}transcript”

     response = requests.post(url, json=json, headers=self.header)

     return response.json()
The JSON response could be an error or a success. Catching the potential errors will be handled in the second file. This one will only need to return the response. But if the transcript is successful, I would see the progress in the queue.
Now that the audio file was successful, next it’s time to get the text from the transcript that was just created. This requires another call with the transcript URL, and also another return for the JSON response as per usual.
def get_text(self, transcript_id):
     url = f”{self.endpoint}transcript/{transcript_id}”

     response = requests.get(url, headers=self.header)

     return response.json()
Alright, almost there. Now, it’s time to write the function to read the file. To read it, it needs to be broken down into chunks. After, the file can be opened, and as long as it has data to read it can yield that data. Otherwise, it will break out of the loop, which ends the file reading.
def __read_file(self, filename, chunk_size=5242880):
     print(filename)
     with open(filename, ‘rb’) as file:
          data = file.read(chunk_size)
          if not data:
               break
          yield data
For my testing purposes, that was all the functions I needed to get it to work. But, as I said earlier, I was hoping to write a more dynamic class. So, the API has two more options. The first is to list all the transcripts. Once the transcripts are created, we could view anyone we want. In the get_text function, we specified which transcript_id we wanted to view. But what if I wanted to list all the transcripts we have? To do this, all we would need to do is specify the limit of how many we want to view, and also what status to search for. In this case, I’ll have the limit at 200 and the status as completed:
def list_transcripts(self, limit = 200, status = “completed”):
     self.header[“content-type”] = “application/json”

     url = f”{self.endpoint}transcript?{limit}/{status}”

     response = requests.get(url, headers=self.header)

     return response.json()
The final function grants the ability to delete transcripts. Should an error occur that we have fixed, or if we simply no longer need the transcript, we can delete it by calling the same transcript (with specified id) endpoint as the get_text function. However, instead of the calling being a “get”, it will be a “delete”.
def remove_transcript(self, transcript_id):
     url = f”{self.endpoint}transcript/{transcript_id}”

     response = requests.delete(url, headers=self.header)

     return response.json()
With those last two methods written, that’s all there is for the speech_to_text class.

Writing the test file

I called this file main.py. In this file, the audio file is created, sent, processed after waiting a small amount of time, then the transcript is picked up. This is also where the error handling is written in case the processing response is not successful. So, to get things started, the file begins with the imports. Because there will be a waiting period, I first needed to import time, so that the program will sleep to allow the file time to process:
import time
As mentioned earlier, I chose TinyDB as my means of storage, so the database was next on the import list.
from tinydb import TinyDB, Query
Next, there needs to be a way to record the audio and to save it to a file. The next two imports will handle the sounddevice and the ability to write to a file:
import sounddevice as sd
from scipy.io.wavfile import write
After that, the class we created in the first file will need to be referenced. Recall, to import from another file in Python you simply need to take from the name of the file (without the extension) and import whichever function, or in this case class, you want:
from speech_to_text import speech_to_text
With our imports done, it’s time to initiate the database within a variable. Also, a variable should be created to store the file location.
db = TinyDB(‘db.json’)
FILE_LOCATION = YOUR_FILE_LOCATION
With the variables created, it’s time to declare the main method. Remember to indent all code that’s stored inside this method. To record, I also needed a few other variables to determine how long to record for:
def main():
     #record audio file
     fs = 44100
     seconds = 5
With those variables created, it’s time to record the file. This will need a variable that records with the sound device declared at the imports, it will then wait for the allotted recording time, and finally, the recording will be written to a file using another import:
recording = sd.rec(int(seconds * fs), samplerate=fs, channels=2)
sd.wait()
write(FILE_LOCATION, fs, recording)
Now that the file is recorded and saved, time to initialize the class using the API key:
converter = speech_to_text(YOUR_API_KEY)
With the class initialized, I made the first API call to upload the file that was just recorded:
upload = converter.upload_file(FILE_LOCATION)
Moving down the list of methods, now the file is ready to submit for transcription. Because this could take a little bit of time, I also wanted to wait for a short period. So, because the recording is only five seconds long, I only ask the system to wait 10 seconds, which should be more than enough. If later down the road I wanted to make the recording longer than five seconds, depending on how long I make it I would also need to update how long the system sleeps for. That way, there will be enough time to transcribe the audio.
audio_file = converter.submit_audio(upload["upload_url"])
 
time.sleep(10)
With the audio transcribed, the final call is required to get the text:
text = converter.get_text(audio_file["id"])
From here, you could print the results as long as there were no errors. However, I decided to store everything in a database for my project, so inserting it into the database will be the next step for me:
# insert into database
db.insert({"todo": text["text"]})
With the results in the database, I wanted to do a little error handling. If the status returned as an error, that would need to be printed separately. But as long as the status is not an error, the results are printed:
# print list of todosuc
if text["status"] != "error":
    print("Todo List")
    for item in db:
        print(item["todo"])
    return
else:
    print(f"Error getting list: {text['error']}")
    return
That was all the code within the main method. Now for the final code outside of the main block:
if __name__ == "__main__":
    main()
And finally, all that’s left to do is test the results.
python3 main.py

Conclusion

In today’s article, I found an easier way to do speech recognition in Python. This method used an API with an audio file that was also created within the code. There are ways to make this real-time as well, but that will be for next time.
With very little experience in speech recognition, I was able to make a few API calls to transform my recorded text file into a to-do list. In the next one, I’ll work on making this project real-time. But until then, cheers!

References:

Leave a Reply

Your email address will not be published.