I publish a review of some tools that I came through looking for something useful in order to standardize and resolve a list of affiliations from web of science.
Some of the tools proved to be not that useful for my porposes, maybe someone else may benefit from it...
Some of the tools proved to be not that useful for my porposes, maybe someone else may benefit from it...
OPENCALAIS
Calais Web Service by Thomson Reuters. The web
service is an API that accepts unstructured text (like news articles, blog
postings, etc.), processes them using natural language processing and machine
learning algorithms, and returns RDF-formatted entities, facts and events.
OpenCalais supports three types of entity
disambiguation: Company disambiguation, Geographical disambiguation and Product
(Electronics) disambiguation.
Disambiguation of company names - such as determining whether the company
Olympus refers to Olympus Optical Co. Ltd. or Olympus Life and Material Science
Europa. The resolution output for a given company mention includes:
- A URI that is unique and uniform across documents
- The formal English legal name of the company
- The company's ticker symbol (for public companies)
For company names that cannot be disambiguated, the returned results will
include no resolution information.
For using opencalais without APIs:
Test on disambiguation feature on a sample of
WOS affiliations showed out an insufficient performance since semantic search
works better on fuzzy data (IE web pages) rather than lists of names.
NOTE: in case run after install may return error:
Error parsing c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\config\machine.config
Parser returned error 0xC00CE556
Rename file
C:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config
And recreate it by copying from:
C:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.default
Be sure to have full rights on the opencalais
templates subfolders in order to avoid other errors.
OPENREFINE
Openrefine (https://github.com/OpenRefine/OpenRefine/wiki)
previously Google Refine, is a desktop application helps to refine messy data
in a few clicks with very powerful clustering and cleaning algorithms.
It also allows data transformations and augmentation.
Most of all has a reconciliation feature: Reconciliation is a semi-automated process of matching text names to
database IDs (keys). This is semi-automated because in some cases, machine
along is not sufficient and human judgement is essential.
Test on reconciliation feature on a sample of
WOS affiliations showed out an insufficient performance also because
reconciliation database seems more persons / geography rather than organizations
oriented.
PYTON NAMEPARSER
Is a Python module that parses
human names into individual components:
- Title
- First name
- Middle names
- Last names
- Suffixes
Download from: http://code.google.com/p/python-nameparser/
NAME CLEAVER
Name Cleaver (http://sunlightlabs.com/blog/2011/name-standardization-name-cleaver/) supports three major name types, politicians, individuals and
organizations, with a specific class and special features for each.
The OrganizationNameCleaver class has methods
to reduce a name to only the "kernel" of the name, and also to expand
all abbreviations (that Name Cleaver knows of), useful for matching tasks.
The pyton code of the program can be
downloaded here: https://github.com/sunlightlabs/name-cleaver
OYSTER entity resolution
OYSTER (Open sYSTem Entity Resolution) is an
entity resolution system that supports probabilistic direct matching,
transitive linking, and asserted linking. To facilitate prospecting for match
candidates (blocking), the system builds and maintains an in-memory index of
attribute values to identities. Because OYSTER has an identity management
system, it also supports persistent identity identifiers. OYSTER is unique
among other ER systems in that it is built to incorporate Entity Identity
Information Management (EIIM). OYSTER supports EIIM by providing methods that
enforce identifiers to be unique among identities, maintain persistent IDs over
the life of an identity, and allowing the ability to fix false-positive and
false-negative resolutions, which cannot be done with matching rules, through
the use of assertion, traceability, and other features.
Developed in JAVA, can be downloaded
from: http://sourceforge.net/projects/oysterer/