View
1
Download
0
Category
Preview:
Citation preview
1
Data Privacy in Biomedicine
Lecture 9: Availability of Data and
(timer permitting) the Curse of the SSN
Bradley Malin, PhD (b.malin@vanderbilt.edu)
Professor of Biomedical Informatics, Biostatistics, & Computer Science
Vanderbilt University
February 10, 2020
© 2020 Bradley Malin 2Lecture 9: Availability & Prediction
Overview
◼ Information Generation
◼ Models of Availability
◼ Some Resources
◼ A Look at Voter Registration
◼ Curse of the SSN
© 2020 Bradley Malin 3Lecture 9: Availability & Prediction
0
50
100
150
200
250
300
350
400
450
500
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003
Year
GD
SP
(M
B/p
ers
on
)
Information Explosion
0
5
10
15
20
25
30
35
1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003
Se
wrv
ers
(in
Millio
ns)
1st WWW
conference
2001
Growth in
available
disk storage
Growth in
active web
servers
1996 1991
L. Sweeney. Information explosion. In L. Zayatz, et al. (eds) Confidentiality, Disclosure, and Data Access: Theory and
Practical Applications for Statistical Agencies . Urban Institute, Washington, DC, 2001
© 2020 Bradley Malin 4Lecture 9: Availability & Prediction
Newer Data
◼ Exabyte = 1 billion gigabytes
* Includes analog data
radio communications, paper memos, etc.
** Includes new and replicated data
Some estimates put original information closer to 40 exabytes
New Data
Generated in
the “World”
Source
2003 5 exabytes* UC Berkeley
2006 161 exabytes** IDC
0
200
400
600
800
1000
1200
2002 2004 2006 2008 2010
Exab
yte
s
Year
NewNew + Replicated
© 2020 Bradley Malin 5Lecture 9: Availability & Prediction
Latest Numbers
◼ On average, the
US alone is now
generating
2,657,700 GB (or
quintillion bytes)
of Internet data
every minute
◼ https://www.domo.com/lear
n/data-never-sleeps-5© 2020 Bradley Malin 6Lecture 9: Availability & Prediction
Birth Certificates (circa 1925)Field# Field name
1 Child's first name
2 Child's middle name (sometimes or initial)
3 Child's last name
4 Day, month and year of birth
5 City and/or County of birth (sometimes hospital)
6 Father's name
7 Mother's name (including maiden name)
8 Place of birth (address and town/city)
9 Mother's age and address
10 Mother's birthplace (town/city, state, county)
11 Mother's occupation
12 Mother, number of previous children
13 Father's age and address
14 Father's birthplace (town/city, state, county)
15 Father's occupation
1 2
3 4
5 6
2
© 2020 Bradley Malin 7Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
1 1 File Status
2 50 Baby’s First Name
3 50 Baby’s Middle Name
4 50 Baby’s Last Name
5 1 Baby’s Suffix Code
6 3 Baby’s Suffix Text
7 8 Baby’s Date of Birth
8 5 Baby’s Time of Birth
9 1 AM/PM Indicator
10 1 Baby’s Sex
11 3 Blood Type
12 1 Born Here?
13 40 Place of Birth
14 1 Facility Type
15 20 City of Birth
Beyond the
Phenomenon
© 2020 Bradley Malin 8Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
16 20 County of Birth
17 6 Certifier’s Code
18 30 Certifier’s Name
19 1 Certifier’s Title
20 30 Attendant’s Name
21 1 Attendant’s Title
22 23 Attendant’s Address
23 19 Attendant’s City
24 2 Attendant’s State
25 10 Attendant’s Zip Code
26 50 Mother’s First Name
27 50 Mother’s Middle Name
28 50 Mother’s Last Name
29 9 Mother’s Social Security Number
30 8 Mother’s Date of Birth
© 2020 Bradley Malin 9Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
31 3 Mother’s State of Birth
32 7 Mother’s Residence Address
33 2 Mother’s Residence Direction
34 20 Residence Street Address
35 10 Residence Type
36 2 Residence Extension
37 10 Residence Apartment #
38 20 Mother’s Town of Residence
39 1 Mother’s Residence in City Limits
40 14 Mother’s County of Residence
41 3 Mother’s State of Residence
42 10 Mother’s Residence Zip Code
43 38 Mother’s Mailing Address
44 19 Mother’s Mailing City
45 2 Mother’s Mailing State© 2020 Bradley Malin 10Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
46 10 Mother’s Mailing Zip Code
47 1 Mother Married?
48 50 Father’s First Name
49 50 Father’s Middle Name
50 50 Father’s Last Name
51 1 Father’s Suffix Code
52 9 Father’s Suffix Text
53 9 Father’s Social Security Number
54 8 Father’s Date of Birth
55 3 Father’s State of Birth
56 14 Mother’s Origin
57 14 Mother’s Race
58 2 Mother’s Elementary Education
59 2 Mother’s College Education
60 11 Mother’s Occupation
© 2020 Bradley Malin 11Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
61 11 Mother’s Industry
62 14 Father’s Origin
63 14 Father’s Race
64 2 Father’s Elementary Education
65 2 Father’s College Education
66 11 Father’s Occupation
67 11 Father’s Industry
68 1 Plurality
69 1 Birth Order
70 2 Live Births Still Living
71 2 Live Births Now Dead
72 4 Month/Year Last Live Birth
73 2 Number of Terminations
74 4 Month/Year Last Termination
75 1 Baby’s Weight Unit© 2020 Bradley Malin 12Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
76 5 Baby’s Weight
77 6 Date of Last Normal Menses
78 1 Month Prenatal Care Began
79 2 Total Number of Visits
80 2 Apgar Score – 1 Minute
81 2 Apgar Score – 5 Minute
82 2 Estimate of Gestation
83 6 Date of Blood Test
84 22 Laboratory
85 1 Mother Transferred In
86 30 Facility Mother Transferred From
87 1 Baby Transferred Out
88 30 Facility Baby Transferred To
89 1 Tobacco Use During Pregnancy
90 3 Number of Cigarettes/Day
7 8
9 10
11 12
3
© 2020 Bradley Malin 13Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
91 1 Alcohol Use During Pregnancy
92 3 Number of Drinks/Week
93 3 Mother’s Weight Gain
94 1 Release Info For SSN
95 6 Operator Code
96 12 Hospital ID
97 1 Sent to Romans
98 1 Sent to APORS
99 16 Other Certifier Specify
100 12 Temporary Audit Number
101 16 Other Facility Specify
102 16 Other Attendant Specify
103 1 Mother’s Race
104 1 Father’s Race
105 2 Mother’s Origin© 2020 Bradley Malin 14Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
106 2 Father’s Origin
107 1 Attendant Same YN
108 1 Mailing Address Same YN
109 1 Capture Father’s Info YN
110 2 Mother’s Age
111 2 Father’s Age
112 12 Baby’s Hospital Med. Rec.
113 1 High Risk Pregnancy YN
114 1 Care Giver (For Chicago)
115 1 Record Selected For Download
116 1 Downloaded
117 1 Printed
118 12 Form Number
MEDICAL RISK FACTORS
119 1 Anemia
© 2020 Bradley Malin 15Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
120 1 Cardiac Disease
121 1 Acute/Chronic Lung Disease
122 1 Diabetes
123 1 Genital Herpes
124 1 Hydramnios/Oligohydramnios
125 1 Hemoglobinopathy
126 1 Hypertension, Chronic
127 1 Hypertension, Preg. Assoc.
128 1 Eclampsia
129 1 Incompetent Cervix
130 1 Previous Infant 4000+ Grams
131 1 Previous Preterm or SGA Infant
132 1 Renal Disease
133 1 Rh Sensitization
134 1 Uterine Bleeding© 2020 Bradley Malin 16Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
135 1 No Medical Risk Factors
136 40 Other Medical Risk Factors
OBSTETRIC PROCEDURES
137 1 Amniocentesis
138 1 Electronic Fetal Monitoring
139 1 Induction of Labor
140 1 Stimulation of Labor
141 1 Tocolysis
142 1 Ultrasound
143 1 No Obstetric Procedures
144 40 Other Obstetric Procedures
COMPLICATIONS OF LABOR & DELIVERY
145 1 Febrile (>100 or 38C)
146 1 Meconium Moderate, Heavy
147 1 Premature Rupture (>12 Hrs)
© 2020 Bradley Malin 17Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
METHOD OF DELIVERY
162 1 Vaginal
163 1 Vaginal After Previous C-Section
164 1 Primary C-Section
165 1 Repeat C-Section
166 1 Forceps
167 1 Vacuum
ABNORMAL CONDITIONS OF NEWBORN
168 1 Anemia
169 1 Birth Injury
170 1 Fetal Alcohol Syndrome
171 1 Hyaline Membrane Disease/RDS
172 1 Meconium Aspiration Syndrome
173 1 Assisted Ventilation <30
174 1 Assisted Ventilation >30© 2020 Bradley Malin 18Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
175 1 Seizures
176 1 No Abnormal Conditions of Newborn
177 40 Other Abnormal Condition of Newborn
CONGENITAL ANOMALIES OF CHILD
178 1 Anencephalus
179 1 Spina Bifida/Meningocele
180 1 Hydrocephalus
181 1 Microcephalus
182 40 Other CNS Anomalies
183 1 Heart Malformations
184 40 Other Circ./Resp. Anomalies
185 1 Rectal Atresia/Stenosis
186 1 Tracheo-Esophageal Fistula/Esophageal Atresia
187 1 Omphalocele/Gastroschisis
188 40 Other Gastrointestinal Ano.
13 14
15 16
17 18
4
© 2020 Bradley Malin 19Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
189 1 Malformed Genitalia
190 1 Renal Agenesis
191 40 Other Urogenital Anomalies
192 1 Cleft Lip/Palate
193 1 Polydactyly/Syndactyly/Adactyly
194 1 Club Foot
195 1 Diaphragmatic Hernia
196 40 Other Musculoskeletal/Integumental Anomalies
197 1 Down’s Syndrome
198 40 Other Chromosomal Anomalies
199 1 No Congenital Anomalies
200 40 Other Congenital Anomalies
CODE STRIP
201 1 Record Complete YN
202 1 Record Type© 2020 Bradley Malin 20Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)Field# Size Field name
203 4 Facility ID
204 4 City of Birth
205 3 County of Birth
206 2 Mother’s State of Birth
207 2 Mother’s State of Residence
208 4 Mother’s Town of Residence
209 3 Mother’s County of Residence
210 2 Father’s State of Birth
211 14 Certifier’s License Number
212 6 Laboratory ID Number
213 4 Mother Xfer Code
214 3 Mother Xfer County Code
215 4 Baby Xfer Code
216 3 Baby Xfer County Code
217 4 Year of Birth
© 2020 Bradley Malin 21Lecture 9: Availability & Prediction
Electronic Birth Certificates (post 1999)
Field# Size Field name
218 7 Certificate #
219 1 Unique Code
220 8 File Date
221 2 Community Area
222 4 Census Tract
223 2 Century of Last Live Birth
224 2 Century of Last Termination
225 2 Century of Last Menses
© 2020 Bradley Malin 22Lecture 9: Availability & Prediction
Overview
◼ Information Generation
◼ Models of Availability
◼ Some Resources
◼ A Look at Voter Registration
◼ Curse of the SSN
© 2020 Bradley Malin 23Lecture 9: Availability & Prediction
Accessibility
◼ Characterization of datasets / data
◼ Meta-information
◼ Cost: Price per record or cost per dataset?
◼ Attribute: Type of data (e.g., name, birthdate, profession)
◼ Availability = Credentials needed to access the dataset
Semantics
Dataset
Attribute
Credentials
Dataset
Availability
Economics
Dataset
Cost
© 2020 Bradley Malin 24Lecture 9: Availability & Prediction
Availability
Anyone can access the
information little, if any,
constraints
(e.g., Google / Public Records)
Public
The data is there but there are
some barriers to entry
(e.g., Money)
Semi-Public
Requires certain credentials to
access such information
(e.g., Census researchers)
Semi-Private
Only privileged individuals
are privy to the information
(e.g., Top Secret)
Private
19 20
21 22
23 24
5
© 2020 Bradley Malin 25Lecture 9: Availability & Prediction
Overview
◼ Information Generation
◼ Models of Availability
◼ Some Resources
◼ A Look at Voter Registration
◼ Curse of the SSN
© 2020 Bradley Malin 26Lecture 9: Availability & Prediction
https://data.census.gov/cedsci/
© 2020 Bradley Malin 27Lecture 9: Availability & Prediction
The Rise of Twitbookin
◼…Do we
really need to
talk about it?
© 2020 Bradley Malin 28Lecture 9: Availability & Prediction
Intelius.com
© 2020 Bradley Malin 29Lecture 9: Availability & Prediction
Property Assessments
◼ Tennessee
http://www.assessment.cot.tn.gov/RE_Assessment/
◼ Davidson County
http://www.padctn.org/real-property-search/
Search by {Owner, Parcel, Street Address}
◼ Imagine combining with Google Maps
(http://maps.google.com) or Zillow
(http://www.zillow.com)
© 2020 Bradley Malin 30Lecture 9: Availability & Prediction
Reverse Lookups
◼ Phone
http://www.anywho.com/reverse-lookup
https://www.ussearch.com/reverse-phone-
lookup
◼ DNS
http://remote.12dt.com/
http://psacake.com/web/eg.asp
http://www.dnsstuff.com/
25 26
27 28
29 30
6
© 2020 Bradley Malin 31Lecture 9: Availability & Prediction © 2020 Bradley Malin 32Lecture 9: Availability & Prediction
Collections on Everything
◼ Bankruptcy
◼ Birth
◼ Criminal
◼ Death
◼ Divorce
◼ DNS
◼ Employment
◼ Financial (e.g., donations)
◼ Marriage
◼ Military
◼ Residential
◼ Social Security
◼ Phone
◼ Voting
◼ …
© 2020 Bradley Malin 33Lecture 9: Availability & Prediction
Brokers are Real
© 2020 Bradley Malin 34Lecture 9: Availability & Prediction
Brokers are Big Business
Take a look at IQVIA
© 2020 Bradley Malin 35Lecture 9: Availability & Prediction
Birthdays
◼ http://www.birthdatabase.com/
◼ Search by {First Name, Last Name, Expected Age}
◼ Where does this information come from?
◼ Why is this available?
◼ Imagine combining with Facebook’s place of birth to
reveal DOB
© 2020 Bradley Malin 36Lecture 9: Availability & Prediction
Combining Databases?
◼ How do you integrate these
databases?
Do you trust names?
Do you trust phone
numbers?
How much information would
you need until you’re
confident of a match?
(We’ll return to this next
lecture)
31 32
33 34
35 36
7
© 2020 Bradley Malin 37Lecture 9: Availability & Prediction
Registries
◼ National Sex Offender Registry
https://www.nsopw.gov/© 2020 Bradley Malin 38Lecture 9: Availability & Prediction
Tennessee
◼ http://sor.tbi.tn.gov/SOMainpg.aspx
# Field
1 Photo
2 Date of Birth
3 Race
4 Sex
5 Home Address
6 County of Residence
7 Last Date Information Updated
# Field
8 Last Registration / Report Date
9 Status
10 Classification
11 TID
12 Supervision Site
13 Offenses
14 Aliases
© 2020 Bradley Malin 39Lecture 9: Availability & Prediction
Drug Offender Registries
◼ TN: https://apps.tn.gov/methor/
◼ Searchable by County
or Name + First Initial
Why?
# Field
1 Last Name
2 First Name
3 Type of Name
4 Date of Birth
5 County
6 Offense(s)
7 Date of Conviction
© 2020 Bradley Malin 40Lecture 9: Availability & Prediction
Overview
◼ Information Generation
◼ Models of Availability
◼ Some Resources
◼ A Look at Voter Registration
◼ Curse of the SSN
© 2020 Bradley Malin 41Lecture 9: Availability & Prediction
Remember the Voter Database?
◼ Public Information Sharing
◼ Example: Washington
If you are a voter, your name, address, political
jurisdiction, gender, date of birth, voting record, date
of registration, and registration number are public
information under state law.
(RCW 29A.08.710)
◼ This is public record by law and does not violate
security or privacy policy, however
© 2020 Bradley Malin 42Lecture 9: Availability & Prediction
Behind Closed Doors
◼ More than public information for registration of voters
◼ Goal: Enable sharing of information with other states securely and
accurately in fulfillment of HAVA (the Help America Vote Act)
◼ Example: Pennsylvania
“Statewide Uniform Registry of Electors”, county election officials have
direct access to the centralized statewide database
The state uses “identifying number, name, and date of birth” for linking
to motor vehicle and/or Social Security records
◼ Use a hybrid match: the number and first two characters of the last name
must match exactly, with discretion left to the county commission to
determine if the rest of the record is a match
◼ Currently uses the AAMVA (American Association of Motor Vehicles
Administrators) criteria to match information with SSN digits: exact match of
the last four digits of Social Security Number, first name, last name, month
of birth, and year of birth.
37 38
39 40
41 42
8
© 2020 Bradley Malin 43Lecture 9: Availability & Prediction
Washington
© 2020 Bradley Malin 44Lecture 9: Availability & Prediction
https://dl.ncsbe.gov/index.html?prefix=data/
North Carolina
© 2020 Bradley Malin 45Lecture 9: Availability & Prediction
Georgia
http://sos.ga.gov/index.php/e
lections/order_voter_registra
tion_lists_and_files
© 2020 Bradley Malin 46Lecture 9: Availability & Prediction
Privacy Violations By InferenceName Date Number of Voters
Catoosa 9/18/2007 1
Cobb 9/18/2007 1
Clayton 9/18/2007 1
Lee 11/06/2007 1
Gwinnett 9/18/2007 1
Dekalb 9/18/2007 3
Chattaho 11/06/2007 3
Sumter 11/06/2007 3
Seminole 11/06/2007 4
Charlton 11/06/2007 5
Dodge 9/18/2007 6
Mitchell 3/20/2007 7
Dawson 9/18/2007 7
Registration number
76686
When it is revealed how
Catoosa county voted in
this election (aggregate
results), then we uniquely
link this voter to their vote.
Georgia Voting 2007 History
© 2020 Bradley Malin 47Lecture 9: Availability & Prediction
Policy & Usage RestrictionsRCW 29A.08.740
Violations of restricted use of registered voter data - Penalties - Liabilities. (Effective January 1, 2006.)
(1) Any person who uses registered voter data furnished under RCW 29A.08.720 for the purpose
of mailing or delivering any advertisement or offer for any property, establishment,
organization, product, or service or for the purpose of mailing or delivering any solicitation
for money, services, or anything of value is guilty of a class C felony punishable by imprisonment in
a state correctional facility for a period of not more than five years or a fine of not more than ten
thousand dollars or both such fine and imprisonment, and is liable to each person provided such
advertisement or solicitation, without the person's consent, for the nuisance value of such person
having to dispose of it, which value is herein established at five dollars for each item mailed or
delivered to the person's residence. …
(2) Each person furnished data under RCW 29A.08.720 shall take reasonable precautions designed to
assure that the data is not used for the purpose of mailing or delivering any advertisement or
offer for any property, establishment, organization, product, or service or for the purpose of
mailing or delivering any solicitation for money, services, or anything of value. However, the
data may be used for any political purpose. Where failure to exercise due care in carrying out this
responsibility results in the data being used for such purposes, then such person is jointly and
severally liable for damages under subsection (1) of this section along with any other person liable
under subsection (1) of this section for the misuse of such data.
[2005 c 246 § 19. Prior: 2003 c 111 § 249; 2003 c 53 § 176; 1999 c 298 § 2; 1992 c 7 § 32; 1974 ex.s. c
127 § 3; 1973 1st ex.s. c 111 § 4. Formerly RCW 29.04.120.]
© 2020 Bradley Malin 48Lecture 9: Availability & Prediction
Overview
◼ Information Generation
◼ Models of Availability
◼ Some Resources
◼ A Look at Voter Registration
◼ Curse of the SSN
43 44
45 46
47 48
9
© 2020 Bradley Malin 49Lecture 9: Availability & Prediction
SSNs – Who Cares?
◼ The Social Security Number is one of, if not, the most overloaded
numbers in the United States
◼ It binds records on finances, insurance, education, death, taxes…
◼ Two HUGE problems: Fraud & Identity Theft
https://www.ftc.gov/system/files/documents/reports/consumer-sentinel-network-data-book-
january-december-2016/csn_cy-2016_data_book.pdf
YearTotal # of
Complaints
% of Complaints Reporting
Amount Paid
ReportedAmount
Paid
Avg.Amount
Paid
MedianAmount
Paid
2003 327,479 78% $459M $1.8k $222 2004 406,193 76% $567M $1.8k $267 2005 431,118 66% $682M $2.4k $350 2016 3,000,000 51% $744M $1.1k $450
◼ 63% Fraud: Internet Auction (12%), Foreign Money Offer (8%)
◼ 37% ID theft: Credit card (26%), Phone / Utilities (18%), Employment
(12%), Government documents / benefits (9%), Loan (5%)
© 2020 Bradley Malin 50Lecture 9: Availability & Prediction
A Brief Demonstration
© 2020 Bradley Malin 51Lecture 9: Availability & Prediction
Location and Age Matter (2005)
https://www.ftc.gov/system/files/documents/reports/consumer-sentinel-
network-data-book-january-december-2016/csn_cy-2016_data_book.pdf
◼ Highest per capita rate of identity theft (metropolitan areas)
Rank Metropolitan Area Complaints Per 100,000 People1Phoenix-Mesa-Scottsdale, AZ 178.32Las Vegas-Paradise, NV 158.53Riverside-San Bernardino-Ontario, CA 145.7
43Nashville-Davidson--Murfreesboro, TN 63.6
◼ Rate of victimization by
age range
© 2020 Bradley Malin 52Lecture 9: Availability & Prediction
Location and Age Matter (2016)
https://www.ftc.gov/system/files/documents/reports/consumer-sentinel-
network-data-book-january-december-2016/csn_cy-2016_data_book.pdf
◼ Highest per capita rate of identity theft (metropolitan areas)
◼ Rate of victimization by
age range
© 2020 Bradley Malin 53Lecture 9: Availability & Prediction
SSNs
◼ Federal paternalism for “social insurance”
◼ Benefits based on payroll tax contributions →
Federal Old-Age Benefits
◼ Issuance began in late 1936
◼ Issued by the Social Security Administration
Permanent residents
Temporary / working residents
http://www.ssa.gov/history
John David
Sweeney
◼ First number to John David Sweeney Jr.
(Baltimore, MD)
◼ Lowest number issued: 001-01-0001
© 2020 Bradley Malin 54Lecture 9: Availability & Prediction
SSN Policy Chronology
◼ 1935: Social Security Act creates “social insurance” program
◼ 1943: Executive Order – All federal agencies use SSN when identification
needed
◼ 1950–1971: “Adult category” for state run Supplemental Security Income
◼ 1961:
Civil Service adopts SSNs Federal employee identifier
IRS requires tax payers to use SSNs for tax reporting
◼ 1964: Treasury Dept. asks H bond purchasers for SSN
◼ 1966: VA adopts SSN as patient identifier
(1967 – Weed begins work on first EMR)
◼ 1969: DOD adopts SSN as Armed Forces personnel ID
◼ 1970: Bank Records & Foreign Translations Act: all banks, savings & loan,
credit unions & securities brokers/dealers → obtain SSNs of all customers
◼ 1971: SSA Task Force warn against overusage
◼ 1972: SSA Amendment – all legal aliens get SSN
Look at http://www.ssa.gov/history/ssn/ssnchron.html
49 50
51 52
53 54
10
© 2020 Bradley Malin 55Lecture 9: Availability & Prediction
SSN Policy Chronology
◼ 1974: Privacy Act – State & local gov’t cannot withhold benefit due to
failure of SSN presentation
◼ 1975: Social Services Amendment of ’74: Parent Locator Service can
collect SSN and whereabouts from SSA records
◼ 1976: Tax Reform Act of 1976: States can use SSN for tax, general
public assistance, driver's license or motor vehicle registration
◼ 1981: Omnibus Budget Reconciliation Act: SSN of each adult member
in household of child applying to school lunch program
◼ 1982: Debt Collection Act: Federal loan program SSN in application
◼ 1987: SSNs for infants
◼ 1998: Identity Theft & Assumption Deterrence Act: "means of
identification" includes SSN
◼ 2005: Real ID Act: States must confirm SSN (with SSA) drivers license
or identity card issuance
© 2020 Bradley Malin 56Lecture 9: Availability & Prediction
Medicare
◼ http://my.medicare.gov/
◼ Medicare Identification
Number (MIN) is usually
SSN + an added letter
◼ Ex: 000-00-0000A
A = wage earner (primary)
If spouse becomes eligible
for Medicare benefits
through primary, they are
assigned a B
Many valid suffixes
◼ MIN may be different than
the SSN
© 2020 Bradley Malin 57Lecture 9: Availability & Prediction
Ferree Snafu
◼ 1938: Wallet manufacturer, E.H. Ferree, promoted new wallet
◼ Sample card was a real card
Hilda Schrader Whitcher, secretary of the company vice president
◼ Wallet sold by Woolworth department stores across the USA
◼ 1943: 5,755 people using the number
◼ SSA voided the number; issued Hilda new card
◼ > 40,000 people reported the Whitcher number as their own
◼ 1976: 40 people found using the number
◼ 1977: 12 “ “ “ “ “
◼ It’s known as “the Social Security Number issued by Woolworth”
◼ Many other cases
1940: The 219-09-9999 vs. “Provo, Utah” Case
© 2020 Bradley Malin 58Lecture 9: Availability & Prediction
SSN Assignment
◼ SSNs are almost a one-time shot
◼ You can get a new SSN only in extremely
rare circumstances
◼ You must prove
Someone has stolen your number
Someone is using it illegally
The misuse is causing you serious harm
http://www.socialsecurity.gov/ssnumber/ss5doc.htm
© 2020 Bradley Malin 59Lecture 9: Availability & Prediction
Modern Times: Restricted Use
◼ Some state laws restrict SSN use, display, and
transfer (e.g., CA in 2001)
◼ Michigan prohibits use of more than 4
consecutive digits of an SSN
Is that sufficient protection?
© 2020 Bradley Malin 60Lecture 9: Availability & Prediction
The SSN
XXX-YY-ZZZ
Area
(AN)
Group
(GN)
Serial
(SN)
55 56
57 58
59 60
11
© 2020 Bradley Malin 61Lecture 9: Availability & Prediction
Area Numbers: XXX
◼ Prior to 1972: represented the state from
which a person applied for their social
security card
◼ After 1972: based on zip code in the
mailing address provided on the
application form
© 2020 Bradley Malin 62Lecture 9: Availability & Prediction
Area Numbers: XXX# STATE # STATE # STATE # STATE
001-003 NH 232
237-246
681-690
NC
387-399 WI 627-645 TX
004-007 ME 400-407 KY 468-477 MN
008-009 VT 408-415
756-763TN
478-485 IA
010-034 MA 247-251
654-658SC
486-500 MO
035-039 RI 416-424 AL 501-502 ND
040-049 CT 252-260
667-675GA 425-428
587-588
752-755
MI
503-504 SD
050-134 NY 505-508 NB
135-158 NJ 261-267
589-595
766-772
FL
509-515 KS
159-211 PA 429-432
676-679AR
516-517 MT
212-220 MD 518-519 ID
221-222 DE 268-302 OH 433-439
659-665LA
520 WY
223-231
691-699VA
303-317 IN521-524
650-653CO318-361 IL 440-448 OK
232-236 WVA 362-386 MI 449-467 TX
© 2020 Bradley Malin 63Lecture 9: Availability & Prediction
XXX
◼ **Discontinued
7/1/63
◼ 000 will NEVER
be a valid XXX
number
# STATE # STATE
525, 585
648-649NM
575-576
750-751HI
526-527
600-601
764-765
AZ
577-579 DC
580 Virgin Islands
580-584
596-599Puerto Rico
528-529
646-647UT
586 Guam
530,680 NV 586 American Samoa
531-539 WA 586 Philipine Islands
540-544 OR 700-728 Railroad Board**
545-573
602-626CA
729-733 Enumeration at Entry
574 AK
Area Numbers: XXX
© 2020 Bradley Malin 64Lecture 9: Availability & Prediction
Group Numbers: YY
◼ Range from 01-99
◼ They’re not allocated
consecutively!
Order Type Range
1st Odd 01 through 09
2nd Even 10 through 98
3rd Even 02 through 08
4th Odd 11 through 99
◼ Highest group issued as of 1/2/08
http://www.ssa.gov/employer/highgroup.txt
◼ Can also trace the allocation of group numbers
over time:
http://www.ssa.gov/employer/ssnvhighgroup.htm
Serial Numbers: ZZZZ
Last 4 digits
SNs have been issued “in monotonically increasing order” within
each State and within each GN
▪ From 0001 to 9999
However, SSA also writes:
“SSNs are assigned randomly by computer within the confines of
the area numbers allocated to a particular state based on data
keyed to the Modernized Enumeration System” (50, RM00201.060)
…reflecting SSA’s belief that idiosyncrasies in the SSN
application and vetting processes make the SN assignment
effectively random
From A. Acquisti © 2020 Bradley Malin 66Lecture 9: Availability & Prediction
Abuse Registry
https://apps.health.tn.gov/abuseregistry/
61 62
63 64
65 66
12
© 2020 Bradley Malin 67Lecture 9: Availability & Prediction © 2020 Bradley Malin 68Lecture 9: Availability & Prediction
© 2020 Bradley Malin 69Lecture 9: Availability & Prediction
Tennessee
◼ Committing Fraud
1. Birth records → Mother’s Maiden Name
https://sos.tn.gov/products/tsla/birth-records
2. Birthday → Remember those public
records databases?
3. Social Security Number
Ah…
© 2020 Bradley Malin 70Lecture 9: Availability & Prediction
Inside Knowledge?
◼ What about insiders’ information?
◼ Could you steal someone’s SSN?
How would you achieve this feat?
Do you think you would be caught?
Please, please, do not make an attempt.
◼ What if it was in plain site?
© 2020 Bradley Malin 71Lecture 9: Availability & Prediction
http://www.mc.vanderbilt.edu/root/vumc.php?site=vanderbiltnursing&doc=9352
© 2020 Bradley Malin 72Lecture 9: Availability & Prediction
http://sitemason.vanderbilt.edu/new
spub/crmQtG?id=21688
67 68
69 70
71 72
13
© 2020 Bradley Malin 73Lecture 9: Availability & Prediction
http://osfp.mc.vanderbilt.edu/Policies/Construction%20Identificatio
n%20Badge%20and%20Orientation%20Policy%20(12.28.06).pdf © 2020 Bradley Malin 74Lecture 9: Availability & Prediction
Online Validation
◼ http://www.ssnvalidator.com
Chances of correctly matching SSN digits by
random guess, under status quo knowledge
Alaska, 1998 New York, 1998
First 5 digits with 1 guess
All 9 digits with< 1,000 guesses
First 5 digits with 1 guess
All 9 digits with< 1,000 guesses
No auxiliary knowledge
0.0014% 0.00014% 0.0014% 0.00014%
Knowledge of state of SSN application
1% 0.1% 0.012% 0.0012%
Adapted from A. Acquisti © 2020 Bradley Malin 76Lecture 9: Availability & Prediction
Or SSDI
◼ You could wait until someone dies…
◼ Social Security Death Index (SSDI) Database
http://search.ancestry.com/search/db.aspx?dbid=3693
◼ Death reported to the Social Security
Administration
Possibly submitted by “relation” requesting Social
Security benefits
http://www.ssa.gov/pubs/10084.html
or to stop benefits
© 2020 Bradley Malin 77Lecture 9: Availability & Prediction
Rise of the SSN
◼ Over 80 million
records
◼ Data back to
1937, but the
majority is after
1962
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
1935 1945 1955 1965 1975 1985 1995 2005
Po
pu
lati
on
Siz
e
Year
SSDI
US Pop / 100
Reasons to believe that the assignment
lacks sufficient randomness
◼ In the last 30 years, SSN issuance has become more
regular
Increasing computerization of the public administration,
including SSA and its various fields offices
After 1972, SSN assignment centralized from Baltimore, MD
After 1989, Enumeration at Birth Process (EAB)
◼ Prior to 1989, only small percentage of people received SSN
when they were born
◼ Currently at least 90 percent of all newborns receive SSN via
EAB together with birth certificate
Adapted from A. Acquisti
73 74
75 76
77 78
14
Hence, two hypotheses
1. Expect SSN issuance patterns to have become
more regular over the years, i.e., increasingly
correlated with an individual’s birthday and
birthplace
This should be detected through analysis of available
SSN data
2. Expect these patterns to have become so regular
that it is possible to infer unknown SSNs based
on the patterns detected on available SSNs
This should be verified by contrasting estimated SSNs
against known SSNsAdapted from A. Acquisti
Compared to previous knowledge
◼ The SSN assignment scheme follows geographical and chronological
patterns - this is well known
◼ Focused on the inverse, harder, and much more consequential
inference: exploiting the presumptive day and location of SSN
application to predict unknown SSNs
Discovered that the interpretation of the assignment scheme held
outside SSA was wrong, and SSA’s assumption of randomness was
wrong
Adapted from A. Acquisti
Predictions Based on Public Data
The Social Security Administration’s Death Master File
(DMF) is a publicly available database of the SSNs of
individuals who are deceased
▪ More recent and up-to-date than the SSDI
▪ One purpose of making this data available is to combat fraud!
▪ But it can be analyzed to find SSN issuance patterns
Used DMF to find patterns in the issuance of SSNs by
date of birth and State of SSN issuance for deceased
individuals
▪ Sorted records by reported DOB and grouped them by
reported State of issuanceAdapted from A. Acquisti
A DMF record (example)
Name Birth Death Last Residence SSN Issued
JOHN
SMITH
21 Jun
1904
Oct
1979
33540 (Zephyrhills,
Pasco, FL)022-10-3459 Massachusetts
Adapted from A. Acquisti
SSN assignment patterns:
Two representative States SSN issuance sequence (MT)
516 01 0001
516 01 0002
…
516 01 9999
517 01 0001
517 01 0002
…
517 01 9999
516 03 0001
516 01 ????
517 01 ????
…
516 01 ????
517 01 ????
…
516 03 ????
517 03 ????
Expected Observed
Adapted from A. Acquisti
79 80
81 82
83 84
15
SSN predictions
1. TEST 1: Used > 500,000 DMF records to detect
patterns in SSN issuance based on birthplace and
state of issuance, and used those patterns to predict
(and verify) individual SSNs in the DMF
2. TEST 2: Mined data from an online social network to
retrieve individuals’ self reported birthdays and
birthplaces, and estimated their SSNs by
interpolating that data with DMF patterns.
1. Verified the estimates using official Enrollment data using a protected
(and IRB approved) protocol Adapted from A. Acquisti
Prediction Approach
◼ Area number
Mode AN in target’s state around target’s birthday
◼ Group number
Mode GN in target’s state around target’s birthday
◼ Serial number
Based on regressions coefficients, inserting target’s
birthday as dd
Adapted from A. Acquisti
Success metrics
◼ Accuracy in prediction of the first 5 digits of an
individual’s SSN with 1, 10, 100, and 1000
attempts
Note: 1,000 attempts is equivalent to 3-digit PIN
And is very insecure and vulnerable to brute force
attacks
Adapted from A. Acquisti
AN-GN predictability (first 5 digits)
EAB starts
here (1989)
1973 2003
CA
ME
Adapted from A. Acquisti
Full SSN predictability with <1,000
attempts
Adapted from A. Acquisti
Test 1: Overall results for DMF records
▪ With a single attempt (first five digits only):
▪ 7% (1973- 1988) 44% (1989-2003)
▪ With 10 attempts (complete 9-digit SSNs):
▪ 0.01% of (1973- 1988) 0.1% (1989-2003)
▪ With 1,000 attempts (complete 9-digit SSNs):
▪ 0.8% (1973-1988) 8.5% (1989- 2003)
▪ These are weighted averages – for smaller states and recent
years, prediction rates are higher
▪ (e.g., 1 out of 20 SSNs in DE, 1996, are identifiable with 10 or fewer
attempts)
Adapted from A. Acquisti
85 86
87 88
89 90
16
Chances of correctly matching SSN digits
by random guess, under our algorithm
Alaska, 1998 New York, 1998
First 5 digits
with 1 guess
All 9 digits with
< 1,000
guesses
First 5 digits
with 1 guess
All 9 digits with
< 1,000
guesses
No auxiliary
knowledge
0.0014% 0.00014% 0.0014% 0.00014%
Knowledge of
state of SSN
application
1% 0.1% 0.012% 0.0012%
Predictions
based on the
algorithm
94% 58% 30% 3%
Adapted from A. Acquisti
Test 2: From social networks data
to SSNs
▪ Used birthday data of 621 alive individuals to predict
their SSN, based on interpolation with DMF data
▪ Sample: born in 1986-1990 (i.e., mostly before EAB)
▪ In most populous states (i.e., worst case scenario)
▪ Birthday and birthplace data can be obtained from
several sources, but most easily, and in mass
amounts, from online social networks
Adapted from A. Acquisti
The approach, revisited
Name Birth Death Last Residence SSN Issued
JOHN
DOE
28 July
1987
Nov
200194720
022-12-
6744NJ
Name Birth Death Last Residence SSN Issued
JOHN
FBOOK
14 July
1987??? NJ
Name Birth Death Last Residence SSN Issued
JOHN
SMITH
1 July
1987
Oct
200533540
022-10-
4592NJ
Adapted from A. Acquisti
Facebook estimations
◼ Test 2 results confirmed Test 1 predictions
Overall AN prediction accuracy: 8.5%
Overall GN prediction accuracy: 29.1%
Combined AN-GN prediction accuracy: 6.3%
◼ Compare to corresponding weighted sample in Test 1
(based on DMF data): 11.21%
Adapted from A. Acquisti
Results and extrapolations
◼ Confirms interpolation of SSN data for deceased individuals and birthday data for alive individuals can lead to the prediction of the latter’s SSNs
◼ Extrapolating to living US population, that would imply the identification of around 40 million SSNs’ first 5 digits and almost 8 million individuals’ complete SSNs
Assumes knowledge of birth data
Adapted from A. Acquisti
91 92
93 94
95
Recommended