Monday, February 25, 2013

Reclassifying PRS legal codes

today I want to share this presentation I made about a reclassification of PRS legal data in patstat TLS221 that I made available @OST (you may ask them if you need the data...)

Monday, February 18, 2013

Datasources for entity resolution with company names

The goal of this list is to provide a review of possible datasources for entity resolution on companies/institutions and their history/changes of ownership for patstat / WoS.
Aside from US mainly international lists have been searched.


From: NBER PDP Project User Documentation: Matching Patent Data to Compustat Firms
Compustat records identified by CUSIPs or GVKEYs refer to securities[1], not firms, single organization may correspond to multiple entries within the Compustat data. Sometimes
reorganization of the ownership structure generates a new GVKEY; sometimes accounting changes
result in multiple GVKEYs for the same organization; sometimes a parent organization and a
subsidiary will both have GVKEYs. In order to uniquely identify organizations, we introduce a variable named PDPCO. In most cases, the PDPCO equals the Compustat GVKEY, however, in some cases, multiple GVKEYs are associated with a single PDPCO. These associations are recorded in the PDPCOHDR file.


This is the home page for the new NBER patent data project. US Patent data for 1976-2006 and assignee match to Compustat are now available on the downloads page. Watch this site for announcements of new releases, tools, supplemental files, fixes, etc.
Last updates are from 2010.

contains match for USPTO applicant with compustat and ownership history of applicants

contains stata programs used for cleaning, standardizing and assigning type of applicant.

User forum about compustat match (actually empty)


TLS221 (prs legal data) in patstat contains legal data and among them we may count changes of name / changes of ownership (event type RAP*).
Such events reflect changes also in the company (for sure in case of name changes, but also massive changes of ownership in patent may reflect change of ownership in the company board).

Counts on the 4 events in 2010 ediction show out following figures:

RAP1     500492
RAP2     60191


Full list of university names by country / continent


Indian companies names changes

A database specialized on company names for India, Change of Name, Mergers, Demergers


Greenhouse Gas Emissions Reporting Program parent companies

Standardized Parent Company Names for TRI Reporting

List of standardized names for Toxics Release Inventory from US Env protection agency

USA grants recipients database has recently introduced the availability of Data Archives. The Archives feature will provide the capability to download archived files on by the following criteria:
Major Agency/ALL
Fiscal Year
Type of Spending (Contracts, Grants, Loans, Direct Payments, Insurance, Other)
Type of Output (CSV, TAB, XML, ATOM)

The archived data files would be available for download only in the compressed file format for each output type mentioned above.

For recipients of founds full data (name and address) is given as weel as other data like
The original Federal Assistance Awards Data System recipient type code, modified by into a set of broader categories (government, individual, nonprofit, for profit, higher ed, other).

The type of recipient (i.e., state government, local government, Indian tribe, individual, small business, for-profit, nonprofit, etc.)



CrocTail provides an interface for browsing information parsed from SEC filings about several hundred thousand U.S. publicly traded corporations and their foreign subsidiaries.
Information from company filings with the U.S. Securities and Exchange Commission (SEC) has been parsed and annotated by CorpWatch to provide a way for users to research and add issues related to corporate subsidiaries. CrocTail also serves as a demonstration of the features and data available through the CorpWatch API.


Crocodyl ( is a collaboration between nonprofit organizations such as Center for Corporate Policy, CorpWatch, Corporate Research Project, other contributing organizations and individual contributors from around the world.
Contains a lot of informations but coverage is not homogeneous,  data come from human rights watches in many cases and updates are mostly dated 2009, 2010.


ORCID is an open, non-profit, community-based effort to create and maintain a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. 
ORCID provides two core functions:  (1) a registry to obtain a unique identifier and manage a record of activities, and (2) APIs that support system-to-system communication and authentication.  ORCID makes its code available under an open source license, and will post an annual public data file under a CCO waiver for free download.
ORCID identifiers will be 16 digit numbers, segmented into four-digit groups and including a checksum. They will be expressed as HTTP URI (such as ORCID identifiers contain no semantic information, such as the year the identifier was minted or the country of origin, and they are issued out of sequence. The ORCID service not only will issue unique author identifiers but also will enable linking to existing author identifier services. T

Data provided via API are
  • Bio - Given a contributor, return name and affiliation data.
  • Works - Given a contributor, return list of works he has contributed to.
  • Full - Given a contributor, return list of works he has contributed to, name and affiliation data.
  • Search - Given whatever metadata provided, return a ranked list of potential contributors identified by that metadata.
Formats supported are: HTML, XML or JSON format.
Recently a partnership with CROSSREF ( has been launched ( in order to assign unique author identifier to all DOIs.

OPENCORPORATES is an opensource collection of informations for over 45M of companies from 58 jurisdictions.
The peculiarity is to hav a web link to the national register 5link is not always working) and also a huge record of inactive / dissolved companies.
Service relies on APIs;  there are actually two OpenCorporates APIs: the REST API and the Google Refine Reconciliation API
The main API (the so-called REST API) allows access as data to all aspects of the OpenCorporates website (with the exception of being able submit data). By default it returns data as JSON (but XML is also available). Access to all the data is free and under the same open licence conditions as the rest of OpenCorporates. An optional (free) API token is available which increases usage limits for the service.
If you are matching company names to legal entities from an existing dataset, you should investigate the rather excellent Google Refine, which makes this quick, easy and allows data to be filtered and cleansed quicker than any other tool we know of. OpenCorporates provides a highly popular reconciliation API for Google Refine, which allows matching company names to legal entities.

UCLA Anderson DB list

At URL is published and mantained a very comprehensive list of Business databases, listed by name along with some downloadable guides.

Most of resourced listed are commercial.

World of Learning

At is available on line version of The Europa World of Learning; print edition of this internationally respected title was first published over 60 years ago and has become a primary source of information on the academic sphere world-wide.
A free sample for Finlad can be explored here:
Data contain also research institutes qnd for each of them infos available are, apart from nale and address, also URL, phone and publications edited by that insitution.

The Carnegie Classification of Institutions of Higher Education

The Carnegie Classification™ has been the leading framework for recognizing and describing institutional diversity in U.S. higher education for the past four decades.
The website provides access to extensive documentation as well as tools for looking up specific institutions, listing all institutions in a particular classification category, aggregating categories within a classification.
Most of all it allows to downoad in XLS a list of 4635 US institutions along with 85 variables associated to each of them.
Bu filtering by CCBASIC field you may also extract lists of institutions by type. Especially
15           RU/VH: Research Universities (very high research activity)
16           RU/H: Research Universities (high research activity)
17           DRU: Doctoral/Research Universities

WoS Abbreviations list

One of the main problems in reconciling affiliations from WoS with other sources is the short writing of affiliation name and address.
IE resolving 1 GEOL EXPLORAT INST HENAN PROV could be a problem unless we build a dictionary for abbreviations.
At links:
it is possible to find the list compiled from Thomson-Reuter that is a good starting point.
In case also a list of full journals names vs abbreviations would be needed it may be found here
or in a format easier to manage: