Embedding mp3s into your voice UI is one of the most compelling capabilities that SSML (Speech Synthesis Markup Language) has to offer voice designers. Intro music, sound effects, and recorded voices are underused yet powerful elements of voice design projects. And best of all, they’re easy to implement.

Get started including audio in your Voice design

1. Encode your mp3 to make it compatible.

Use converter software to convert your MP3 files to codec version MPEG version 2, bit rate 48 kbps, sample rate 16000 Hz.

UPDATE: The team at Sayspring has created a much easier way to convert files to a format compatible with voice assistants. Just drag and drop your mp3 or wav file into our converter and you’ll have the compatible file. Try it out here.

Or you can convert files on your own with free software, Audacity. Here are the instructions directly from Amazon Dev Blog:

  1. Open the file to convert.
  2. Set the Project Rate in the lower-left corner to 16000.
  3. Click File > Export Audio and change the Save as type to MP3 Files.
  4. Click Options, set the Quality to 48 kbps and the Bit Rate Mode to Constant.This requires the Lame library, which can be foundat: http://lame.buanzo.org/#lamewindl.

But you should just use our audio converter. It’s much simpler.

2. Host your encoded mp3. Grab the link to the file.

You must host your mp3 at an internet-accessible HTTPS. The domain hosting MUST have a valid, trusted SSL certificate.

You can host on Amazon Simple Storage Service (Amazon S3). It’s through Amazon Web Services and it meets the above requirements.

3. Insert your sound using simple SSML.

Write your speech inside the speech brackets. Embed your mp3 in the audio brackets. Follow the lead of the example below.


<speak> Thanks.

<audio src=”https://s3-us-west-1.amazonaws.com/sayspring-prod/media/celtic-open-chime.mp3″ />

Your deposit has been processed. What would you like to do next? </speak>

NOTE: If you’re prototyping in Sayspring, you don’t need to use <speak></speak> to insert audio. Only the <audio src=”file_url” /> tag is necessary.

The audio clip will now play as part of the response in your project.

The above example sounds like this:

Some important limitations to note.

  • You can use up to 5 audio tags in one singular response.
  • The time used by all your audio files can’t be more than 90 seconds cumulatively.

Play audio as your entire response, or as an accompaniment to a voice response. The audio tag lets you include sound effects, earcons and short music. If your brand has a particular voice, you can include recordings of that in your design.

Audio is a compelling and memorable way to brand a voice-first user experience.

Think of the NBC chimes, the McDonald’s “I’m Lovin’ it” jingle, or the Law & Order dun-dun. With this simple code, all voice designs have the same capability to have a more emotional, more delightful and more memorable brand.

Mark Webster

Founder and CEO of Sayspring. Designer. Developer. Follow me on Twitter at @markcwebster.

Start Designing Voice Apps for Free

Prototyping voice-powered user experiences with Sayspring is fast and easy.