Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm

Preview:

Citation preview

Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik

Presented by Bryan Wilhelm

Problem DescriptionA single entity may be referenced in separate

records in textually dissimilar ways.For example “Robert” and “Bob”.

Traditional text similarity functions such as edit distance and jaccard coefficient cannot handle these cases.

Current research is looking at string transformation databases.

These databases can be extremely large.

Problem Description

Solution: DefinitionsRule Application

Example: {Olathe→Olathe, 7, 4}

AlignmentRule applications cannot

overlapOrder does not matter

Coverage

Solution: Algorithm

Solution: Algorithm

Solution: Algorithm

Solution: Algorithm

Solution: Algorithm

Record Matching ApplicationGenerating Example Pairs

Traditional text matching methods are used (such as jaccard coefficient).

Input from domain experts could also be considered but this is expensive.

A few incorrect pairs will not effect the end result.

Validation of TransformationsAll approaches involve confirmation by a

domain expert.

Analysis

Analysis