The goal of this list is to provide a review of
possible datasources for entity resolution on companies/institutions and their
history/changes of ownership for patstat / WoS.
Aside from US mainly international lists have
been searched.
COMPUSTAT
From: NBER PDP Project User Documentation:
Matching Patent Data to Compustat Firms
Compustat records identified by CUSIPs or
GVKEYs refer to securities,
not firms, single organization may correspond to multiple entries within the
Compustat data. Sometimes
reorganization of the ownership structure
generates a new GVKEY; sometimes accounting changes
result in multiple GVKEYs for the same
organization; sometimes a parent organization and a
subsidiary will both have GVKEYs. In order
to uniquely identify organizations, we introduce a variable named PDPCO. In
most cases, the PDPCO equals the Compustat GVKEY, however, in some cases, multiple
GVKEYs are associated with a single PDPCO. These associations are recorded in
the PDPCOHDR file.
NBER COMPUSTAT MATCH /
OWNERSHIP HISTORY
This is the home page for the new NBER patent
data project. US Patent data for 1976-2006 and assignee match to Compustat are
now available on the downloads page. Watch this site for announcements of new
releases, tools, supplemental files, fixes, etc.
Last updates are from 2010.
contains match for USPTO applicant with
compustat and ownership history of applicants
contains stata
programs used for cleaning, standardizing and assigning type of applicant.
User forum about compustat match (actually
empty)
FROM PATSTAT TLS221
TLS221 (prs legal data) in patstat contains legal data and among
them we may count changes of name / changes of ownership (event type RAP*).
Such events reflect changes also in the company
(for sure in case of name changes, but also massive changes of ownership in
patent may reflect change of ownership in the company board).
Counts on the 4 events in 2010 ediction show
out following figures:
RAP1 500492
RAP2 60191
UNIVERSITY OF BOLOGNA LIST
OF UNIVERSITY NAMES
Full list of
university names by country / continent
NASDAQ COMPANY NAMES
Indian companies names
changes
A database specialized on company names for
India, Change of Name, Mergers, Demergers
SEC NAMES
Greenhouse Gas Emissions
Reporting Program parent companies
Standardized Parent
Company Names for TRI Reporting
List of standardized names for Toxics
Release Inventory from US Env protection agency
USA grants recipients database
USAspending.gov has recently
introduced the availability of Data Archives. The Archives feature will provide
the capability to download archived files on USAspending.gov by the following
criteria:
• Major Agency/ALL
• Fiscal Year
• Type of Spending (Contracts, Grants, Loans, Direct
Payments, Insurance, Other)
• Type of
Output (CSV, TAB, XML, ATOM)
The archived data files would be
available for download only in the compressed file format for each output type
mentioned above.
For
recipients of founds full data (name and address) is given as weel as other
data like
recip_cat_type
The
original Federal Assistance Awards Data System recipient type code, modified by
USAspending.gov into a set of broader categories (government, individual,
nonprofit, for profit, higher ed, other).
recipient_type
The type
of recipient (i.e., state government, local government, Indian tribe,
individual, small business, for-profit, nonprofit, etc.)
And
DUNS_NUMBER
CROCTAIL
CrocTail provides an interface for browsing
information parsed from SEC filings about several hundred thousand U.S.
publicly traded corporations and their foreign subsidiaries.
Information from company filings with the U.S.
Securities and Exchange Commission (SEC) has been
parsed and annotated by CorpWatch to provide
a way for Crocodyl.org users to research and
add issues related to corporate subsidiaries. CrocTail also serves as a
demonstration of the features and data available through the CorpWatch API.
CROCODYL.ORG
Contains a lot of informations but coverage is
not homogeneous, data come from human
rights watches in many cases and updates are mostly dated 2009, 2010.
ORCID
ORCID is an open, non-profit, community-based
effort to create and maintain a registry of unique researcher identifiers and a
transparent method of linking research activities and outputs to these
identifiers.
ORCID provides two core functions: (1) a
registry to obtain a unique identifier and manage a record of activities, and
(2) APIs that support system-to-system communication and authentication.
ORCID makes its code available under an open source license, and will post an
annual public data file under a CCO waiver for free download.
ORCID identifiers will be 16 digit numbers,
segmented into four-digit groups and including a checksum. They will be
expressed as HTTP URI (such as http://orcid.org/0137-1963-7688-2319). ORCID identifiers
contain no semantic information, such as the year the identifier was minted or the
country of origin, and they are issued out of sequence. The ORCID service not
only will issue unique author identifiers but also will enable linking to
existing author identifier services. T
Data provided via API are
- Bio - Given a contributor,
return name and affiliation data.
- Works -
Given a contributor, return list of works he has contributed to.
- Full - Given a contributor,
return list of works he has contributed to, name and affiliation data.
- Search -
Given whatever metadata provided, return a ranked list of potential
contributors identified by that metadata.
Formats
supported are: HTML, XML or JSON format.
OPENCORPORATES
The peculiarity is to hav a web link to the
national register 5link is not always working) and also a huge record of
inactive / dissolved companies.
The
main API (the so-called REST API)
allows access as data to all aspects of the OpenCorporates website (with the
exception of being able submit data). By default it returns data as JSON (but
XML is also available). Access to all the data is free and under the same open licence
conditions as the rest of OpenCorporates. An optional
(free) API token is available which increases
usage limits for the service.
If
you are matching company names to legal entities from an existing dataset, you
should investigate the rather excellent Google Refine,
which makes this quick, easy and allows data to be filtered and cleansed
quicker than any other tool we know of. OpenCorporates provides a highly
popular reconciliation API for Google Refine,
which allows matching company names to legal entities.
UCLA Anderson DB list
Most of resourced
listed are commercial.
World of Learning
At
http://www.worldoflearning.com
is available on line version of The Europa World of Learning; print edition of this internationally respected
title was first published over 60 years ago and has become a primary source of
information on the academic sphere world-wide.
A
free sample for Finlad can be explored here: http://www.worldoflearning.com/public/views/entry/FI
Data
contain also research institutes qnd for each of them infos available are,
apart from nale and address, also URL, phone and publications edited by that
insitution.
The Carnegie Classification of Institutions of Higher
Education
The Carnegie
Classification™ has been the leading framework for recognizing and describing
institutional diversity in U.S. higher education for the past four decades.
The website http://classifications.carnegiefoundation.org/
provides access to
extensive documentation as well as tools for looking up specific institutions,
listing all institutions in a particular classification category, aggregating
categories within a classification.
Most of all it allows to
downoad in XLS a list of 4635 US institutions along with 85 variables
associated to each of them.
Bu filtering by CCBASIC
field you may also extract lists of institutions by type. Especially
15 RU/VH: Research Universities (very
high research activity)
16 RU/H: Research Universities (high
research activity)
17 DRU: Doctoral/Research Universities
WoS Abbreviations list
One of the main problems in
reconciling affiliations from WoS with other sources is the short writing of
affiliation name and address.
IE resolving 1 GEOL
EXPLORAT INST HENAN PROV could be a problem unless we build a dictionary for
abbreviations.
At links:
http://images.webofknowledge.com/WOK46/help/WOS/h_acrnabrv.html#acrnabrv_a
http://images.webofknowledge.com/WOK46/help/WOS/h_adabrv.html#state_country_abbreviations
it is possible to find the
list compiled from Thomson-Reuter that is a good starting point.
In case also a list of full
journals names vs abbreviations would be needed it may be found here
http://images.webofknowledge.com/WOK46/help/WOS/0-9_abrvjt.html
or in a format easier to
manage: http://people.su.se/~alau4517/jabref.wos.txt