You Talk Too Much: Limiting Privacy Exposure Via Voice Input

Abstract

Voice synthesis uses a voice model to synthesize arbitrary phrases. Advances in voice synthesis have made it possible to create an accurate voice model of a targeted individual, which can then in turn be used to generate spoofed audio in his or her voice. Generating an accurate voice model of target's voice requires the availability of a corpus of the target's speech. This paper makes the observation that the increasing popularity of voice interfaces that use cloud-backed speech recognition (e.g., Siri, Google Assistant, Amazon Alexa) increases the public's vulnerability to voice synthesis attacks. That is, our growing dependence on voice interfaces fosters the collection of our voices. As our main contribution, we show that voice recognition and voice accumulation (that is, the accumulation of users' voices) are separable. This paper introduces techniques for locally sanitizing voice inputs before they are transmitted to the cloud for processing. In essence, such methods employ audio processing techniques to remove distinctive voice characteristics, leaving only the information that is necessary for the cloud-based services to perform speech recognition. Our preliminary experiments show that our defenses prevent state-of-the-art voice synthesis techniques from constructing convincing forgeries of a user's speech, while still permitting accurate voice recognition.

References

Page 1

	Year	Citations

Page 1