Entity Resolution

Many databases contain imprecise references to real-world entities. For example, a social network database records names of real people. But multiple people can go by the same name and there may be different names which refer to the same person as well. In general, there may be many references to the same real-world entity. The goal of the entity resolution problem is to discover the unobserved entities and cluster the database references according to their entities. Traditionally, entities are resolved on the basis of the attributes of individual references. However, in many domains, such as social networks and academic circles, the underlying entities exhibit strong ties to each other, and as a result, their references often co-occur in the data. We focus on the use of such co-occurrence relationships between references for collective entity resolution , in which the entities for related references are determined jointly.

We explore different techniques for solving the collective entity resolution problem. We have designed a relational clustering algorithm, where references are iteratively clustered into entities taking into account the clusters of co-occurring references. We show that this approach locally minimizes a cut-based clustering cost that considers the co-occurrence relations in addition to the similarity between references. In addition, we have proposed a probabilistic generative model for co-occurring references that uses Latent Dirichlet Allocation to find hidden group structures among the domain entities as evidence for resolving entities. We have developed an efficient unsupervised inference algorithm for this model using Gibbs Sampling techniques. We show that both of these approaches improve performance over attribute baselines in multiple real world and synthetic datasets. Our algorithm also ranked among the top in the Government-sponsored KDD Challenge Competition that was organized by IBM EAS in August 2005 and involved participants from companies and other top-tier universities.

In addition to collective resolution over an entire database, we have investigated the problem of query-centric entity resolution. We have shown that queries can be collectively resolved by recursively exploring and resolving related references. However, collective resolution at query-time is computationally challenging since this recursive approach can span a very large number of references. We have proposed an unsupervised algorithm for adaptively selecting the most informative of the related references for a query. Using this adaptive strategy, queries that otherwise take several minutes to resolve can be answered in seconds, while still preserving the accuracy of collective resolution.



  • CiteSeer: The CiteSeer dataset contains 1,504 machine learning documents with 2,892 author references to 1,165 author entities. For this dataset, the only attribute information available for authors is the name. The full last name is always given, and in some cases the author's full first name and middle name are given and other times only the initials are given. The dataset was originally created by Giles et. al. and the version which we use includes the author entity ground truth provided by Aron Culotta and Andrew McCallum, University of Massachusetts, Amherst. We have performed further cleaning on it.
  • arXiv: The arXiv dataset describes high energy physics publications. It was originally used in KDD Cup 2003. It contains 29,555 papers with 58,515 references to 9,200 authors. The attribute information available for this dataset is also just the author name, with the same variations in form as described above. The author entity ground truth for this data set was provided by David Jensen, University of Massachusetts, Amherst. We have performed further cleaning on it, extracted the relevant information for entity resolution and put it in the same format as the CiteSeer data.


  • Synthetic Data Generator: We designed a generator for noisy references with co-occurrence relationships between them. This generator allows the user to control several characteristics of the data, such as degree of collaboration between the underlying entities, the size of the co-occurrence relationships, the ambiguity of entity attributes and relationships and others, in a systematic and flexible way. Experiments on synthetic data enabled us to reason beyond specific datasets, understand the impact of different structural properties of the data on collective resolution, and also to empirically verify our performance analysis for relational clustering in general. The generated data is in the same format as CiteSeer.
  • Relational Clustering: The relational clustering code currently reads in reference data in the CiteSeer format described above, performs 'blocking' to identify potential duplicate references, initializes reference clusters using bootstrapping, and then iteratively merges clusters considering both atribute and relational similarity until the similarity of the closest pair drops below a threshold. All parameters such as the termination threshold, attribute and relational similarity measures to be used and the combination weight, can be specified as command line arguments to the executable. This is a pre-alpha version of the code. Watch this page for updates!

Entity Resolution Resources on the Web

  • RIDDLE, maintained by Misha Bilenko, is an excellent web directory listing people, papers and datasets.