RECORD LINKAGE THEORY (No. 27)

TITLE:

DATE:

Friday, November 14th, 2003

TIME:

3:30 PM

LOCATION:

GMCS 214

SPEAKER:

Farid Mehovic, EntrePro Corporation, La Jolla, CA

ABSTRACT:

Current database systems allow only semi-deterministic search of data, which means that the information being searched – the supplied search keys – must exist in the database in the exact form specified. The “semi” part of this statement addresses (a) searches where parts of the string are missing, as in String LIKE ‘%JOE%’ and (b) searches based on a primitive form of sound mapping, as in String = SOUNDEX(‘JOE’). If, for instance, there is a misspelling in either the input data or in what is stored in the database, the match is not possible. Transformed digits or letters, typos, missing characters, missing fields, etc. have the same effect – the data is not searchable.

Record linkage theory addresses this problem. It allows what’s called a probabilistic record search, or probabilistic record matching. It uses relatively uncomplicated probability theory to find the best fit of the input search information to the existing residence database, and it then determines, based on thresholds, if this best fit should be declared an actual match. The background parameters needed for this search include primarily the information on the quality of the fields being used for the search, such as what is the probability that the field has errors.

Although the mathematics in this approach is not complex, there is no elegant solution to this kind of search that is both maximally accurate and has adequate response times. Because of this and because of the lack of funding, which I believe is mostly due to a relatively small market so far as well as a strict intellectual property protection policy, academia has not worked on this problem to speak of. Most of the work has been done by analysts and expert that work in the “trenches” – in various data consolidation and transformation projects with large databases and data warehouses.

The following is a critical question that is both mathematically much more challenging (worthy of perhaps a doctorate dissertation research) and potentially very highly rewarding (with a potential of introducing an extremely powerful search algorithm into the mainstream database industry): Can the probabilistic search be mapped into a such a linear algorithm so as to use underlying traditional database indexing schemes (such as B-trees), while maintaining its (at least very close to) optimal accuracy? The second question is: Could research into this feasibility lead into such an improvement in the record linkage theory so as to very significantly advance the search accuracy at very fast performance? Both of these questions may result in quite a significant reward first in the ‘data cleansing industry’, which is rapidly growing as the Internet brings data quality into a forefront, and in the relational database industry, should the technology prove feasible.

HOST:

Teresa Larsen

DOWNLOAD: