Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Pr𝜀𝜀ch: A System for Privacy-Preserving Speech Transcription
Shimaa Ahmed, Amrita Roy Chowdhury, Kassem Fawaz, and Parmesh Ramanathan
USENIX Security Symposium 2020
1
Speech Transcription*
Speech transcription applications:
2
• Scalable
• Privacy-preserving
• Accurate
* Transcription by IBM Speech-to-Text Demo
Speech Transcription Services
Cloud-based transcription
Deep Speech
+by
3
Open-source offline transcription
Performance Comparison
Standard datasets
4
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
LibriSpeech DAPS TIMIT
WER* (%)
GoogleAWSDeep Speech
* Word-Error-Rate (WER) = "#$#%&𝐷:𝐼:𝑆:𝑁:
# Deletion# Insertion# Substitution# Reference words
Performance Comparison
Real world use-cases• Facebook hearing before the US Senate
• Supreme Court case “Carpenter v. United States”
• VCTK: non-American accent dataset• Speaker p266 of an Irish accent• Speaker p262 of a Scottish accent
5
Standard vs Real World Performance
0.00%5.00%
10.00%15.00%20.00%25.00%30.00%35.00%40.00%45.00%
LibriS
peec
hDAPS
TIMIT
VCTK - p266
VCTK - p262
Faceb
ook1
Faceb
ook2
Faceb
ook3
Carpne
ter1
Carpen
ter2
WER (%)
GoogleAWSDeepSpeech
6
Standard Real World
Off-the-shelf offline transcribers are not reliable for real world applications
Speech is a Rich Source of Sensitive Information
Voice biometrics
• Personal attributes
• Identity
• Impersonation
7technology can clone a speaker’s voice from a short segment of their speech
Textual content
• Sensitive words
• Statistical analysis of the entire transcript• Topic model• Stylometry analysis• Document classification• Sentiment analysis
8
Speech is a Rich Source of Sensitive Information
Utility-Privacy Trade-off
9
Utility
Cloud service providers
Offline service providers
Privacy
Cloud service providers
Offline service providers
Goal: Design an end-to-end transcription system that provides an intermediate solution along the utility-privacy spectrum
• Obfuscates the users’ voice biometrics
• Protects the sensitive textual content
• Improves on the transcription accuracy compared to offline systems
• Provides control knobs to customize its utility and privacy levels
Original SpeechPrivacy-Preserving Operations
High-accuracy Transcription Final Transcript
De-noising
10
Pr𝜀𝜀ch: Privacy-Preserving Speech Transcription
• Obfuscates the users’ voice biometrics
• Protects the sensitive textual content
• Improves on the transcription accuracy compared to offline systems
• Provides control knobs to customize its utility and privacy levels
Original SpeechPrivacy-Preserving Operations
High-accuracy Transcription Final Transcript
De-noising
11
Pr𝜀𝜀ch: Privacy-Preserving Speech Transcription
Voice Biometrics
Many-to-One Voice Conversion
Voice Conversion
12
Senator Harris
Senator Thune
Mark Zuckerberg
0% accuracy in matching original speakers with their voice-converted speech using Azure speaker identification API
• Obfuscates the users’ voice biometrics
• Protects the sensitive textual content
• Improves on the transcription accuracy compared to offline systems
• Provides control knobs to customize its utility and privacy levels
Original SpeechPrivacy-Preserving Operations
High-accuracy Transcription Final Transcript
De-noising
13
Pr𝜀𝜀ch: Privacy-Preserving Speech Transcription
Break the context
• Segmentation
• Sensitive words scrubbing Original speech Segmentation
~3 non-stop words
Sensitive WordsScrubbing
14
Deep Speechtranscription
information about 87million Facebook users being obtained by the company CambridgeAnalytica
Break the context
• Segmentation
• Sensitive words scrubbing Original speech Segmentation
~3 non-stop words
Sensitive WordsScrubbing
15
Deep Speechtranscription
PocketSphinx
information about 87million Facebook users being obtained by the company CambridgeAnalytica
Break the context
• Segmentation
• Sensitive words scrubbing
• The textual content is transferred into a bag-of-words model
Original speech Segmentation
~3 non-stop words
Sensitive WordsScrubbing
16
Deep Speechtranscription
information about 87million Facebook users being obtained by the company CambridgeAnalytica
Differentially Private (DP) Words’ Histogram
• Bag-of-words: histogram of words
• Apply DP to the true histogram• A randomized mechanism 𝐴: ℕ|"| → ℕ|"| satisfies (𝜀, 𝛿)-DP, if for any pair of
histograms 𝐻# and 𝐻$ such that ||𝐻#, 𝐻$||# = 𝑑, and for any set 𝑂 ⊆ ℕ|"|,
Words
Coun
t
17
Pr 𝐴(𝐻# ∈ 𝑂] ≤ 𝑒% Pr 𝐴(𝐻$ ∈ 𝑂] + 𝛿
DP Challenges
• Pr𝜀𝜀ch has access only to the speech, but not the transcript• No access to the true histogram
• The noise ‘dummy words’ must be added in the speech domain
• The dummy words must be indistinguishable from the true speech• Segment length• Voice• Language model
18
Dummy Segments Generation
The Lanham Act's banon federal registrationof scandaloustrademarks is not arestriction on speechbut a valid conditionon participation in afederal program.
Dummy Text Generation TTS
Domain
Legal, courtcase, violation
Original Speech
Offline transcriber
19
ban on federal registration
DP Mechanism
trademarks is not a restriction
in a federal program
End-to-End System(1) Hide voice biometrics (2) Ensure noise indistinguishability
Protect the textual content
20
The Lanham Act's banon federal registration ofscandalous trademarksis not a restriction onspeech but a validcondition onparticipation in a federalprogram.
At issue in thiscase is thegovernment'swarrantlesscollection of 127days ofPetitioner's cellsite locationinformation
Original speech
DomainLegal, courtcase, supremecourt
1. Segmentation
3. Dummy Text Generation 4. TTS
5. Voice Conversion
7. Transcription
8. De-noising and de-shuffling
6. Partitioning,adding noise, and shuffling.
government's At on scandalous
cell issue of location7. Transcription
2. Sensitive wordscrubbing
Final Transcript
In Pr𝜀𝜀ch, the DP noise does NOT deteriorate the utility, instead it adds monetary cost overhead
speech not a registration
in condition a federal restriction
valid Petitioner's 127 Lanham on collection
Participation in this program
• Obfuscates the users’ voice biometrics
• Protects the sensitive textual content
• Improves on the transcription accuracy compared to offline systems
• Provides the users with control knobs to customize its utility and privacy levels
Original SpeechPrivacy-Preserving Operations
High-accuracy Transcription Final Transcript
De-noising
21
Pr𝜀𝜀ch: Privacy-Preserving Speech Transcription
End-to-End System
22
The Lanham Act's banon federal registration ofscandalous trademarksis not a restriction onspeech but a validcondition onparticipation in a federalprogram.
At issue in thiscase is thegovernment'swarrantlesscollection of 127days ofPetitioner's cellsite locationinformation
Original speech
DomainLegal, courtcase, supremecourt
1. Segmentation
3. Dummy Text Generation 4. TTS
5. Voice Conversion
7. Transcription
8. De-noising and de-shuffling
6. Partitioning,adding noise, and shuffling.
government's At on scandalous
cell issue of location7. Transcription
2. Sensitive wordscrubbing
Final Transcript
speech not a registration
in condition a federal restriction
valid Petitioner's 127 Lanham on collection
Participation in this program
Utility: Transcription Accuracy
Dataset No Voice Conversion
One-to-one Voice Conversion
Many-to-One Voice Conversion
Deep Speech
VCTK p266 5.15 16.55 21.92 26.72VCTK p262 4.53 7.39 10.82 15.97Facebook1 8.26 14.60 20.30 24.72Facebook2 9.75 18.27 19.44 26.61Facebook3 14.93 23.25 27.06 30.72Carpneter1 14.43 23.88 22.63 25.85Carpenter2 13.53 33.71 38.90 39.71
Table: WER(%) at different settings of Pr𝜀𝜀ch vs Deep Speech
23
2% − 32.5%44% − 80%
Textual Content Privacy
Noise Level
Original Word Cloud
24
Formal Privacy Guarantee
Post-processing of DP:
25
For a speech file 𝑆, Pr𝜀𝜀ch provides perfect voice privacy using many-to-one voice conversion and an (𝜀, 𝛿)-DP guarantee on the word histogram for the domain considered, under the assumption that the dummy segments are indistinguishable from the true segments.
Any statistical analysis on the noisy words’ histogram does not cause loss in privacy
TakeawaysPr𝜀𝜀ch as a privacy-preserving speech transcription system:• Provides an improved performance relative to offline transcription• 2% to 32.52% relative improvement in WER
• Obfuscates the speakers’ voice biometrics• 0% accuracy in matching real speakers with their voice-converted speech
• allows only a DP view of the textual content.
26
Contact us: [email protected]