Voice Input in Sway with Handy

Voice input is faster than typing text by hand. It just is. I speak faster than I can hunt for keys, and this matters more now that I am working with AI agents constantly. Most of what I type for them is prose, not code.

Here is how I set up instant voice input with Handy in Sway.

Setup

This is my setup on Wayland with Sway. If you are on another system, such as macOS or X11, the configuration will be different.

On Arch, install Handy from the AUR (handy or handy-bin). You will also need wtype for text injection.

On first launch, Handy asks which model to use and downloads it. You can download additional models later in the settings if needed.

Add this to your Sway config:

# Handy - Voice Input
exec handy
bindsym --no-repeat $mod+i exec handy --toggle-transcription
bindsym --release $mod+i exec handy --toggle-transcription

The exec handy line automatically starts Handy when Sway launches. The two keybindings create a push-to-talk flow: press to start recording, release to insert the text. The text appears in whatever window has focus.

Handy has its own default shortcut, Control+Space, but it did not work reliably for me in Sway. I did not investigate the exact reason. Instead, I let Sway handle the keybinding and call Handy’s CLI directly.

Models

Handy supports several ASR models. The most relevant ones for me are Parakeet, Whisper, and Canary. There are also more specialised options, such as Moonshine for lightweight English transcription and SenseVoice for Chinese, Japanese, Korean, and Cantonese. See the Handy model documentation for the full list.

I first tried Whisper Large V3 on my RTX 4070 Max-Q / Mobile. It worked well enough, but inference noticeably loaded the GPU and still felt slower than I wanted for push-to-talk input.

Parakeet V3 currently runs on the CPU in Handy, so I expected it to feel slower. It does not. I press Super+I, speak, release, and the text appears almost immediately.

It uses multiple CPU cores, and the response is fast enough that the CPU-only execution does not feel like a compromise on my AMD Ryzen 7 8845HS.

Language Support

The right model depends on your needs. Whisper handles accents and dialects better and likely works better with less common languages. Parakeet is faster but struggles with dialects like Bavarian.

For English and standard German, both models work well. I stick with Parakeet because I rarely use dialect, and speed matters more to me.

Post-Processing

Handy has experimental support for post-processing transcriptions with LLMs. It supports quite a few providers out of the box, including custom OpenAI-compatible APIs. It even auto-detects the list of available models.

If you want to stay fully local, you can use it with Ollama, llama.cpp, LM Studio, and so on.

I tried post-processing with Mistral Small 3.2, and it works quite well. I also tried reasoning models, but I do not recommend them. Reasoning is not useful for this task and just adds latency.

I still had to adjust the prompt a bit. Otherwise, the model would sometimes include opening phrases such as “Here is the cleaned transcript” instead of returning only the cleaned text.

The default prompt looks like this:

Clean this transcript:
1. Fix spelling, capitalization, and punctuation errors
2. Convert number words to digits (twenty-five → 25, ten percent → 10%, five dollars → $5)
3. Replace spoken punctuation with symbols (period → ., comma → ,, question mark → ?)
4. Remove filler words (um, uh, like as filler)
5. Keep the language in the original version (if it was french, keep it in french for example)

Preserve exact meaning and word order. Do not paraphrase or reorder content.

Return only the cleaned transcript.

Transcript:
${output}

My custom prompt is stricter about returning only the transcript:

Clean this transcript:
1. Fix spelling, capitalization, and punctuation errors
2. Convert number words to digits (twenty-five → 25, ten percent → 10%, five dollars → $5)
3. Replace spoken punctuation with symbols (period → ., comma → ,, question mark → ?)
4. Remove filler words (um, uh, like as filler)
5. Keep the language in the original version (if it was french, keep it in french for example)

Preserve exact meaning and word order. Do not paraphrase or reorder content.

Return only the cleaned transcript. You MUST NOT return anything else.

<bad_example>
Here is the cleaned transcript:
</bad_example>

<good_example>
the actual transcript...
</good_example>

Transcript:
${output}

In my experience, LLM post-processing improved the output quality quite a bit. However, it does add noticeable latency. Still acceptable, but noticeable.

I also tested local models and had a pretty good experience with Gemma 4 E4B IT and Qwen3.5-4B. Qwen3.5-9B and gpt-oss-20b already added too much latency for me. You really need a small model that runs fast to keep the latency low, but quality needs to be good enough, too. And it needs to support the languages you use. On my 8 GB VRAM system, Gemma 4 E4B IT and Qwen3.5-4B strike the right balance.

Overall, though, it is probably not worth it for me. Plain Parakeet V3 is good enough.

In Practice

Voice input is not new, but Handy makes it incredibly convenient. Press Super+I, speak, release, and the text is there. With AI agents, I am constantly entering prompts and other prose. Voice input handles this much faster than typing.

The combination of Handy’s simple interface and Parakeet’s responsiveness creates a workflow that fits how I actually work now.

Give it a try.