T5 B64 GAO Visa Docs 3 of 6 Fdr- 3-27-02 GAO- 2nd CLASS Briefing Re Namecheck System 564

Embed Size (px)

Citation preview

  • 8/14/2019 T5 B64 GAO Visa Docs 3 of 6 Fdr- 3-27-02 GAO- 2nd CLASS Briefing Re Namecheck System 564

    1/6

    Prepared by: Gabrielle Anderson'Date:4/17/02Job Code: 320087C/V Is13|ndex.Typebund|e|ndexhere

    DO C Library: Type library nam e hereDOC Number: Type document number here2nd CLASS BriefingRecord of Interview

    Reviewed by: Type reviewer name hereReview Date: Type review date here

    Purpose To find out how the CLASS namecheck system operatesContact Method In-personmeetingContact Place State Department, Consular Affairs BureauContact Date March 27, 2002Participants State:Dave WilliamsCathy Baskay

    GAP:Judy McCloskeyJodyWoodsKate BrentzelGabrielleAndersonRichard HungW e received an explanation of the principal techniques that are used inthe CLASS namecheck system. Wealso learned that the reason for thefailure of the originalAl-Jiddi namecheck was most likelyacountry-relationship table that did not take into account a possiblecountryassociation between Canada and Tunisia. Finally, we discussed theresource issues that wouldbe involved ifbiometrics were introduced intothe system.

    Architectural Conceptsof Name SearchingMr. W illiams informed us that there are five basic questions that need tobe answered when constructing a namecheck system.The first question is whether or not each namecheck query will consult allthe records in the system. In the case of CLASS, there would clearlynotbe enough capacity for all 6million records in the system to be scannedduring one namecheck, especially with many namechecks beingconducted each day.Therefore, the second question involves determining the criteria forestablishing what subset ofthese 6million that is to be searched. Thisdetermination is referred to as Phase I.

    Pagel Record of Interview

  • 8/14/2019 T5 B64 GAO Visa Docs 3 of 6 Fdr- 3-27-02 GAO- 2nd CLASS Briefing Re Namecheck System 564

    2/6

    I by : Gabnelie Anaersons : 4/17/02

    i Code:320087DOC Library: TypeDOC Number: Type

    The third question to be answered concerns what techniques (linguisticand logic) are to be used to evaluate that particular subset of records.This is considered Phase II.The fourth question concerns the criteria required to constitute a "hit,"i.e., what is considered a close enough match to be returned as a hit. Asthe subset of records is being evaluated, each record receives pointsbased on how close of a match it is. Howmanypoints must a hit receivein order to be considered a legitimate hit?The fifth question concerns the way in which the resulting hit list will beordered, e.g., with exact name matches first, or with CLASS I hits first?

    NamecheckingTechniquesM r. Williamsran through some of the techniques that may be used inorder to run a namecheck system. He stressed that no system will use justone of these techniques and that each technique should be considered as atool. Since each technique has its strengths and weaknesses, a goodnamecheck system will combine a variety of them in order to achieve thebest possible results. Healso stressed that visa adjudication involvesagood deal of subjective decision-making on the part of the consularofficer.Also, the information that is required when performing a CLASSnamecheck is surname and gender. Additional information for anamecheck is preferred, but not required, e.g., estimated date of birth,country of birth and first name.Name Compression: """This technique takes the first letter of a surname, drops all its vowels andreduces anydouble consonants to a single consonant. The system willthen return any surnames that fit this pattern. Its strength lies in the factthat it is fast and precise. However, this technique produces nearmissesif a surname is spelled slightly differently (e.g., reducing Gutierrez toGTRZ would miss Gutierres, which wouldbe compressed to GTRS). If, inan attempt to account for these near misses, you were to require that thesystem return all matches within one character, youwould pull in far toomany hits (since there is a maximum of 6characters in a compressedname.) Another weakness of this technique is that it does not work wellon short names, e.g., Lee.

    sSvnonvm Association:This technique can be used with several of the namecheck fields. Forexample, synonym association can be used to establish a relationshipbetween the name "Joe" and its derivations such as Joey, Jose, Joseph,Guiseppe, etc. Thus, a search for Joe would turn up not only persons withthis exact first name but also those whose name was one of thesederivatives, hi the case of country, Russia has been equated with all of theformer Soviet republics, so that a search for "Russia" will result in initialhits on any of the current independent republics, e.g.,Azerbaijan, Belarus,Estonia, Georgia, etc. Additionally, the synonym association technique

    Page 2 Record of Interview

  • 8/14/2019 T5 B64 GAO Visa Docs 3 of 6 Fdr- 3-27-02 GAO- 2nd CLASS Briefing Re Namecheck System 564

    3/6

    pared by: Gabrielle Andersonate: 4/17/02i Code: 320087 DOC Library: TypeDOC Number: Type

    ensures that surname qualifiers (e.g., Van,De,Al, etc.) are separated outwhen namechecks areperformed.\X 'N Gram Analysis:The bi-gram analysis breaks down a surname by two letters at a time. Forexample, Gutierrezis broken down into "_G, GU, UT , TI IE, ER, R R, RE,EZ , Z_." This particular technique compares the bi-gramfor the desiredname with the bi-grams for all other names in the data subset. Atpresent

    in CLASS, ifhalf of the bi-grams in a particular name match the bi-gramsin the desired name, then this name is returned as a hit. However, thelevel required to return a hit based upon this bi-gram analysis can bechanged. The same is true for tri-gram analysis, which is identical exceptthat it breaks down a surname into three-letter components. The strengthof the N-gram technique is that it is highly tunable, but its weakness lies inthe fact that it has a low level of discrimination. Hence, the N-gramanalysis is a coarse method, one that is used to develop subsections ofdata rather than to produce the desired "hit.""",Position Discounting:This technique allows you to determine how many of the bi-gramor tri-

    gram hits fall into the same position as they do in the desired name. Forexample, a namecheck on "Wilson," using a simple bi-gram analysis,would return "Sonils" as a hit (since 4 of the 7bi-grams in these namesmatch). However, when position discounting is used along with the bi-gram analysis, "Sonils" is rejected as a hit, since none of the matchingbi-grams in "Sonils" occupy the same positions as they do in "Wilson."Component Comparison:This technique assigns a value to surname endings based on the likelihoodthat a surname with a particular ending belongs to someone from aparticular country. For example, the Russian surname endingin "-ichna"is assigned a value of 0.93, indicating that there is a 93% likelihood that aperson whose surname ends in "-ichna" is from a Russian-speaking orSlavic country. Then it is clear that the most appropriate a^oritiim to useis the Russian/Slavic algorithm.Another component comparison technique to determine the appropriatealgorithm is the tri-gram probability table. In this table, all the possibletri-gram combinations in the alphabet (from "_AA" to "ZZ_") are listed, alongwith percentages that indicate to which linguistic algorithm theyarelikelyto belong. For example, with the tri-gram "MAS," there is a 38.5%likelihood that a name containing this tri-gram willbe Russian/Slavicanda 46.9% likelihood that it will be will be Arabic. This is a tool to select outwhat algorithms to apply in each namecheck case.Cultural Regularization: \/This technique involves transliterating a name from its foreign alphabetspelling into the many forms it could take using the Roman alphabet.

    A O < \, Qadafi, Khadafi, Cadhafi, etc.) This ensures that one spellingof

    PageS Record of Intervie

  • 8/14/2019 T5 B64 GAO Visa Docs 3 of 6 Fdr- 3-27-02 GAO- 2nd CLASS Briefing Re Namecheck System 564

    4/6

    Spared by: Gabrielle Andersonfate: 4/17/02i Code: 320087 DOC Library: TypeDOC Number: Type

    \

    \'will turn up other versions of the same name, provi

    possible spellings for that Arabic name have been entered.Letter Based Re-Write Rules:This is an alternative way of addressing the issue of names with multipletransliterations. This technique tries to regularize all spellings of anameinto a single entry. It does so by assigning a standard spelling to thephonetic sounds that make up the name. For example, the system willconvert Mafouz, Mahfoudh, and Mehfouth into Mahfouz for searchingpurposes. Letter based re-write rules are currently being used forArabicnames. Both the strength and the weakness of this technique lie in itsglobal reach. Although the technique prevents you from having to enter inevery possible spelling of a name, it is also likely to pull in a vastnumberof hits (e.g., with Arabic or Hispanic names) precisely because the systemrecognizes only one version of the name.Phonetic Transcription:This particular technique assigns a phonetic spelling to every name, e.g.,'Stephen' becomes 'Steven.' This is useful because, when presented withunfamiliar names, people tend to spell phonetically. Many names receivedfrom the intelligence community are spelled phonetically sincie they areoften names that are overheard. However, the use of phonetictranscription, which is tonal in nature, may require significant manualoversight.Edit Distance Algorithm: > - -This technique measures how many edits are necessary to change anamein the system into the desired name, i.e., what it takes to make the twonames equal. For example, if you enter "Waldmirr,'' the edit-distancealgorithm will take this name and compare it to aname in the system suchas Vladimir. It will determine how many edits need to be done in order tochange Waldmirr into Vladimir. In this case, there are 4 edit operationsthat need to take place: substitution (of 'V for 'W); insertion (of themiddle T in Vladimir); deletion (of the extra 'R' in Waldmirr); and reversal(of the 'AL' to 'LA'). Next, the technique looks at the positions of thesechanges within the two names and assigns values to the distancesbetween them. Using a formula to assess both the number of edits andthe distances between them in the two names, the namecheck system willreturn Vladimir as a hit for Waldmirr. However, if the bi-grammethodwere used on this particular example, the name Vladimir would not havebeen returned as a hit. Theedit-distance algorithm is a very strongtechnique; it is, in fact, the primary technique used in spellchecker. Itsweakness is that it is machine-intensive.We asked Mr.Williams about the Al-Jiddi namecheck done earlier thisyear by the U.S. Consulate in Montreal. They ran a namecheck on Al-Jiddi, a knownAl-Qaeda terrorist, entering in his known name, country ofbirth, estimated date ofbirth, and current nationality. This did not resultin a hit. Only after country of birth and nationality were left blank, did thesystem return a CLASS n hit for Al-Jiddi.

    Al-JiddiNamecheck

    Page 4 Record of Interviei

  • 8/14/2019 T5 B64 GAO Visa Docs 3 of 6 Fdr- 3-27-02 GAO- 2nd CLASS Briefing Re Namecheck System 564

    5/6

    -pare y: a r e e n ersonJate: 4/17/02Job Code: 320087 DOC Library: TypeDO C Number: Type

    Country-Relationship Tables

    CLASS

    M r. Williams gave the likely reason for this. When setting up thenamecheck system as w hole, one of the first problems that must beaddressed is establishing the criteria that will determine which records(out of 6million) will be checked. This is Phase I of the search, i.e., whenCLASS establishes a searchable subset of the 6 million total names. Oneof the most important criteria used in Phase I is the country field. InPhase I, the country field is analyzed using country-relationship tables.These tables indicate the likelihood that a person from the countryentered in the search will also possess biographical data from anothercountry. The country-relationship tables in CLASS do not indicate that aperson of Canadian citizenship is likely to have a Tunisian background.Hence, Al-Jiddi's record was thrown out in Phase I, i.e., it was notincluded in the subset ofnames that were then searched. Once thecountry fields were left blank, the country-relationship tables were notused to establish a subset and therefore Al-Jiddi's record was returned asa hit.However, Mr.Williams mentioned that attempting to fix a problem, suchas that posed by the Al-Jiddi namecheck, could have unintendedconsequences. Re-establishing the threshold for the subset may pull inAl-Jiddi's record but may very well pull in a great deal more records that willalso have to be examined.In terms of establishing these country-relationship tables in the first place,Mr. Williams stated that they rely on officers in the field to report back tothe Visa Office on migration patterns (which determine countryassociations.) Based on this new information, the Visa Office can adjustthe table relationships. These country-relationships do not have to bereciprocal. The last time such an adjustment took place was under JohnBrennan's predecessor.There are about 4major CLASS releases each year, e.g., screen changes,table changes, or new algorithms. Posts have access to the samealgorithms that exist at headquarters. The algorithms currently running inCLASS are: Russian/Slavic; Arabic; Hispanic; generic; date of birth; andcountry ofbirth. Linguistics teams usually put together four groupsofnames to test the various algorithms, but it is important to note that theycannot test outliers.Mr. Williams mentioned that on April22"", there wouldbe a 4-day CLASScourse for mid-level and senior consular officers and visa managers,though he admitted that the course might be of some interest to juniorofficers as well. The focus of the course wouldbe on the Arabiclanguagenamecheck. Since this course was just starting up, there were still manyquestions surrounding it.The CLASS back-up system is known as BNS. When BNS is in use, postscan make local updates on their local BNS system. But global changes toBNS, i.e., incorporating the changes made at individual locations

    PageS Record of Interview

  • 8/14/2019 T5 B64 GAO Visa Docs 3 of 6 Fdr- 3-27-02 GAO- 2nd CLASS Briefing Re Namecheck System 564

    6/6

    spared by: Gabrielle AndersonJate: 4/17/02Job Code: 320087

    Biometrics

    Documents

    DOC Library:TypeDOC Number: Typeworldwide, are compiled at headquarters and sent out to posts once amonth.

    iM r. Williamsalso noted that there is a new NIV system that is currently inthe beta-testing phase. It will be piloted in London.M r. Williams viewed biometrics as another tool to use in conducting acomprehensive security check. The use of biometrics would be amovetoward the development of an identity system, rather than simply anamecheck system. An individual would have to be much more intelligentto foil an identity system.Mr. Williams asserted that, despite vendor claims to the contrary, facialrecognition techniques are not especially successful. Atpresent, bothfacial recognition and fingerprinting run on very limited databases. Ifeither of these techniques were to become part of a standard identitycheck, there would have to be a significant increase in resources toaccommodate the millions of new records. In checking fingerprints, fo rexample, a turn-around time of a few seconds would be needed. Atpresent, a fingerprint inquiry sent to the FBItakes 24-48 hours. Theintroduction of biometrics would also have a significant impa'ct onoperations at post. Consular officers want to be able to adjudicate a visaapplication in the course of one day, or in as little time as possible.We would like to obtain copies of the country-relationship tables used inCLASS.

    Page 6 Record of Interview