Handy: privacy-preserving open-source speech-to-text
I’ve recently discovered Handy, a small speech-to-text (dictation) application, that is open source and can use only local AI models on Linux, preserving the privacy of what I’m saying out loud by not sending it to some cloud servers. I found it handy (lol), so I figured that I could share some hardships that I’ve run into while setting it up on ArchLinux.
Handy is a speech-to-text application, which means you can talk out loud and it will render the text on your screen after you’re done. This is especially nice if you are a quick talker but a slow typer; and while I consider that I would be one, I can still type quickly enough, but speech is so much faster and nicer, as a lazy interface. You can learn more about Handy on its official website, notably how you can install it and so on.
Here’s a small list of all the steps that I’ve had to do for setting it up for ArchLinux:
- installation method: I’ve installed it using the AUR archive, where it is called
handy-bin. For instance, withyay, that would beyay -S handy-bin. - install a model: in
handy‘s settings, I’ve chosen theParakeet V3model, which seems to be the best combination of speed and accuracy. - missing dependency: At first when starting it in the command line, it would complain about some missing GTK dependency
gtk-layer-shellthat I’ve installed withyay, which fixed it:yay gtk-layer-shell. - allow remote interaction: The first time I’m using Handy, after I’ve recorded a non-empty sentence, the GNOME permission window for remote disk desktop will automatically open up, and it will ask me if I want to allow remote interaction. I need to toggle that permission, and then click share. It seems that this is needed every time I quit and restart the Handy application. So hopefully I don’t have to do it on every single computer startup. But after I’ve done it once when opening Handy for the first time, it will keep on working for the rest of the session.
- GNOME global shortcuts: After a bit of time I’ve realized that the global shortcuts as defined in the application settings would not work, and that I needed to use the GNOME global keyboard shortcuts configuration instead. So I went ahead and created a new global keyboard shortcut 1 that would trigger the following invocation:
handy --toggle-transcription. The trick is that it needs to be run once at the beginning of the speech to text transcription, and then once again when you want to stop, as it’s a toggle, but that’s fine. - paste method: Finally the last issue that I’ve run into was that the paste method was not working because it was using something related to X11 and I’m using Wayland. As a result, it would remove some individual letters from the output of the transcription, while I could see in the application’s logs that the transcription was in fact correct. So I had to change it to clipboard.
So I’m happy to report that after all these steps, this is working just fine 🥳 In fact, I’ve used mostly Handy to write this blog post, which is kind of a delight for me! I am not quite sure that this has been faster than typing by hand, but this has been more pleasant, for sure. It is still a bit of a pain when compared to the experience that I’ve had on macOS where I could install a single application called OpenWhispr 2, and it just worked from the start, showing me the output in real-time (whereas on Linux it’s displayed after stop-toggling the transcription). Well, Linux on the desktop can be quite an adventure, am I right? At least I do have a proper working solution now, which makes me happy, and probably will help me write ~~~more emails~~~ blog posts in the future.
Thanks to Jolivier who told me about its existence on Mastodon!