Wednesday, January 9, 2013

Datamining and entity resolutions: some tools

I publish a review of some tools that I came through looking for something useful in order to standardize and resolve a list of affiliations from web of science.

Some of the tools proved to be not that useful for my porposes, maybe someone else may benefit from it...


Calais Web Service by Thomson Reuters. The web service is an API that accepts unstructured text (like news articles, blog postings, etc.), processes them using natural language processing and machine learning algorithms, and returns RDF-formatted entities, facts and events.
OpenCalais supports three types of entity disambiguation: Company disambiguation, Geographical disambiguation and Product (Electronics) disambiguation.
Disambiguation of company names - such as determining whether the company Olympus refers to Olympus Optical Co. Ltd. or Olympus Life and Material Science Europa. The resolution output for a given company mention includes:
  • A URI that is unique and uniform across documents
  • The formal English legal name of the company
  • The company's ticker symbol (for public companies)
For company names that cannot be disambiguated, the returned results will include no resolution information.

For using opencalais without APIs:

Test on disambiguation feature on a sample of WOS affiliations showed out an insufficient performance since semantic search works better on fuzzy data (IE web pages) rather than lists of names.

NOTE: in case run after install may return error:
Error parsing c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\config\machine.config
Parser returned error 0xC00CE556
Rename file C:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config
And recreate it by copying from: C:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.default
Be sure to have full rights on the opencalais templates subfolders in order to avoid other errors.


Openrefine ( previously Google Refine, is a desktop application helps to refine messy data in a few clicks with very powerful clustering and cleaning algorithms.
It also allows data transformations and augmentation.
Most of all has a reconciliation feature: Reconciliation is a semi-automated process of matching text names to database IDs (keys). This is semi-automated because in some cases, machine along is not sufficient and human judgement is essential.
Test on reconciliation feature on a sample of WOS affiliations showed out an insufficient performance also because reconciliation database seems more persons / geography rather than organizations oriented.


Is a Python module that parses human names into individual components:
  • Title
  • First name
  • Middle names
  • Last names
  • Suffixes
Download from:


Name Cleaver ( supports three major name types, politicians, individuals and organizations, with a specific class and special features for each.
The OrganizationNameCleaver class has methods to reduce a name to only the "kernel" of the name, and also to expand all abbreviations (that Name Cleaver knows of), useful for matching tasks.
The pyton code of the program can be downloaded here:

OYSTER entity resolution

OYSTER (Open sYSTem Entity Resolution) is an entity resolution system that supports probabilistic direct matching, transitive linking, and asserted linking. To facilitate prospecting for match candidates (blocking), the system builds and maintains an in-memory index of attribute values to identities. Because OYSTER has an identity management system, it also supports persistent identity identifiers. OYSTER is unique among other ER systems in that it is built to incorporate Entity Identity Information Management (EIIM). OYSTER supports EIIM by providing methods that enforce identifiers to be unique among identities, maintain persistent IDs over the life of an identity, and allowing the ability to fix false-positive and false-negative resolutions, which cannot be done with matching rules, through the use of assertion, traceability, and other features.
Developed in JAVA, can be downloaded from: