today I want to share this presentation I made about a reclassification of PRS legal data in patstat TLS221 that I made available @OST (you may ask them if you need the data...)
academic patenting
(4)
algorithms
(2)
anvur
(1)
APE-INV
(3)
applicants
(10)
applications
(11)
ascii
(1)
bibliometrics
(7)
bocconi
(2)
bug
(1)
china
(2)
citations
(11)
claims
(3)
concordance
(7)
conference
(8)
CPCs
(2)
curiosities
(1)
data quality
(12)
data recovery
(1)
database
(26)
datamining
(5)
disk
(1)
download
(1)
dump
(1)
ecla
(1)
entity resolution
(4)
EP register
(7)
epo
(15)
equivalents
(1)
espacenet
(2)
ethnicity
(2)
examination
(3)
excel
(3)
free
(2)
function
(1)
GDPR
(1)
gender
(1)
geocoding
(6)
github
(1)
icons
(1)
indicators
(1)
inpadoc
(9)
inventors
(21)
IPC
(21)
IPC35
(4)
job offers
(1)
KITeS
(3)
legal status
(16)
levenshtein
(1)
line breaks
(1)
linked open data
(1)
match
(1)
mobility
(1)
mysql
(23)
nace
(2)
national patents data
(6)
NBER
(1)
news
(1)
NPL
(7)
NUTS3
(6)
OHIM
(1)
openoffice
(1)
orbis
(1)
orcid
(1)
OS
(1)
OST
(2)
password recover
(1)
patent attorneys
(1)
patent data
(2)
patent family
(17)
patent ownership
(3)
patent status
(3)
patent value
(1)
patents
(49)
patentsview
(3)
patstat
(145)
person_id
(13)
priorities
(5)
python
(2)
reclassification
(8)
renewals
(1)
replace
(2)
scientific articles
(2)
scopus
(1)
semantic analysis
(2)
sipo
(3)
sql
(6)
strings
(4)
tool
(9)
trademarks
(2)
triadic patents
(2)
UDF
(1)
USPC
(1)
USPTO
(12)
VBA
(1)
vista
(1)
VM
(1)
webscraping
(2)
WIPO
(10)
workshops
(1)
Wos
(1)
xp
(1)
Monday, February 25, 2013
Monday, February 18, 2013
Datasources for entity resolution with company names
The goal of this list is to provide a review of
possible datasources for entity resolution on companies/institutions and their
history/changes of ownership for patstat / WoS.
Aside from US mainly international lists have
been searched.
COMPUSTAT
From: NBER PDP Project User Documentation:
Matching Patent Data to Compustat Firms
Compustat records identified by CUSIPs or
GVKEYs refer to securities[1],
not firms, single organization may correspond to multiple entries within the
Compustat data. Sometimes
reorganization of the ownership structure
generates a new GVKEY; sometimes accounting changes
result in multiple GVKEYs for the same
organization; sometimes a parent organization and a
subsidiary will both have GVKEYs. In order
to uniquely identify organizations, we introduce a variable named PDPCO. In
most cases, the PDPCO equals the Compustat GVKEY, however, in some cases, multiple
GVKEYs are associated with a single PDPCO. These associations are recorded in
the PDPCOHDR file.
NBER COMPUSTAT MATCH / OWNERSHIP HISTORY
This is the home page for the new NBER patent
data project. US Patent data for 1976-2006 and assignee match to Compustat are
now available on the downloads page. Watch this site for announcements of new
releases, tools, supplemental files, fixes, etc.
Last updates are from 2010.
contains match for USPTO applicant with
compustat and ownership history of applicants
contains stata
programs used for cleaning, standardizing and assigning type of applicant.
User forum about compustat match (actually
empty)
TLS221 (prs legal data) in patstat contains legal data and among
them we may count changes of name / changes of ownership (event type RAP*).
Such events reflect changes also in the company
(for sure in case of name changes, but also massive changes of ownership in
patent may reflect change of ownership in the company board).
Counts on the 4 events in 2010 ediction show
out following figures:
RAP1 500492
RAP2 60191
UNIVERSITY OF BOLOGNA LIST OF UNIVERSITY NAMES
Full list of
university names by country / continent
NASDAQ COMPANY NAMES
Indian companies names changes
A database specialized on company names for
India, Change of Name, Mergers, Demergers
SEC NAMES
Greenhouse Gas Emissions Reporting Program parent companies
Standardized Parent Company Names for TRI Reporting
List of standardized names for Toxics
Release Inventory from US Env protection agency
USA grants recipients database
USAspending.gov has recently
introduced the availability of Data Archives. The Archives feature will provide
the capability to download archived files on USAspending.gov by the following
criteria:
• Major Agency/ALL
• Fiscal Year
• Type of Spending (Contracts, Grants, Loans, Direct
Payments, Insurance, Other)
• Type of
Output (CSV, TAB, XML, ATOM)
The archived data files would be
available for download only in the compressed file format for each output type
mentioned above.
For
recipients of founds full data (name and address) is given as weel as other
data like
recip_cat_type
The
original Federal Assistance Awards Data System recipient type code, modified by
USAspending.gov into a set of broader categories (government, individual,
nonprofit, for profit, higher ed, other).
recipient_type
The type
of recipient (i.e., state government, local government, Indian tribe,
individual, small business, for-profit, nonprofit, etc.)
And
DUNS_NUMBER
CROCTAIL
CrocTail provides an interface for browsing
information parsed from SEC filings about several hundred thousand U.S.
publicly traded corporations and their foreign subsidiaries.
Information from company filings with the U.S.
Securities and Exchange Commission (SEC) has been
parsed and annotated by CorpWatch to provide
a way for Crocodyl.org users to research and
add issues related to corporate subsidiaries. CrocTail also serves as a
demonstration of the features and data available through the CorpWatch API.
CROCODYL.ORG
Crocodyl (http://www.crocodyl.org/) is a collaboration between nonprofit organizations such as Center for Corporate Policy, CorpWatch, Corporate Research Project, other contributing organizations and
individual contributors from around the world.
Contains a lot of informations but coverage is
not homogeneous, data come from human
rights watches in many cases and updates are mostly dated 2009, 2010.
ORCID
ORCID is an open, non-profit, community-based
effort to create and maintain a registry of unique researcher identifiers and a
transparent method of linking research activities and outputs to these
identifiers.
ORCID provides two core functions: (1) a
registry to obtain a unique identifier and manage a record of activities, and
(2) APIs that support system-to-system communication and authentication.
ORCID makes its code available under an open source license, and will post an
annual public data file under a CCO waiver for free download.
ORCID identifiers will be 16 digit numbers,
segmented into four-digit groups and including a checksum. They will be
expressed as HTTP URI (such as http://orcid.org/0137-1963-7688-2319). ORCID identifiers
contain no semantic information, such as the year the identifier was minted or the
country of origin, and they are issued out of sequence. The ORCID service not
only will issue unique author identifiers but also will enable linking to
existing author identifier services. T
Data provided via API are
- Bio - Given a contributor, return name and affiliation data.
- Works - Given a contributor, return list of works he has contributed to.
- Full - Given a contributor, return list of works he has contributed to, name and affiliation data.
- Search - Given whatever metadata provided, return a ranked list of potential contributors identified by that metadata.
Formats
supported are: HTML, XML or JSON format.
Recently a
partnership with CROSSREF (http://www.crossref.org/) has been launched (http://www.crossref.org/01company/pr/news111412.html) in order to assign unique author
identifier to all DOIs.
OPENCORPORATES
http://opencorporates.com/ is an opensource
collection of informations for over 45M of companies from 58 jurisdictions.
The peculiarity is to hav a web link to the
national register 5link is not always working) and also a huge record of
inactive / dissolved companies.
Service relies on APIs; there are actually two OpenCorporates APIs: the REST API and the Google Refine Reconciliation API
The
main API (the so-called REST API)
allows access as data to all aspects of the OpenCorporates website (with the
exception of being able submit data). By default it returns data as JSON (but
XML is also available). Access to all the data is free and under the same open licence
conditions as the rest of OpenCorporates. An optional
(free) API token is available which increases
usage limits for the service.If you are matching company names to legal entities from an existing dataset, you should investigate the rather excellent Google Refine, which makes this quick, easy and allows data to be filtered and cleansed quicker than any other tool we know of. OpenCorporates provides a highly popular reconciliation API for Google Refine, which allows matching company names to legal entities.
UCLA Anderson DB list
At URL http://www.anderson.ucla.edu/x14520.xml is published and
mantained a very comprehensive list of Business databases, listed by name along
with some downloadable guides.
Most of resourced
listed are commercial.
World of Learning
At http://www.worldoflearning.com is available on line version of The Europa World of Learning; print edition of this internationally respected title was first published over 60 years ago and has become a primary source of information on the academic sphere world-wide.A free sample for Finlad can be explored here: http://www.worldoflearning.com/public/views/entry/FI
Data contain also research institutes qnd for each of them infos available are, apart from nale and address, also URL, phone and publications edited by that insitution.
The Carnegie Classification of Institutions of Higher Education
The Carnegie Classification™ has been the leading framework for recognizing and describing institutional diversity in U.S. higher education for the past four decades.The website http://classifications.carnegiefoundation.org/ provides access to extensive documentation as well as tools for looking up specific institutions, listing all institutions in a particular classification category, aggregating categories within a classification.
Most of all it allows to downoad in XLS a list of 4635 US institutions along with 85 variables associated to each of them.
Bu filtering by CCBASIC field you may also extract lists of institutions by type. Especially
15 RU/VH: Research Universities (very
high research activity)
16 RU/H: Research Universities (high
research activity)
17 DRU: Doctoral/Research Universities
WoS Abbreviations list
One of the main problems in reconciling affiliations from WoS with other sources is the short writing of affiliation name and address.IE resolving 1 GEOL EXPLORAT INST HENAN PROV could be a problem unless we build a dictionary for abbreviations.
At links:
http://images.webofknowledge.com/WOK46/help/WOS/h_acrnabrv.html#acrnabrv_a
http://images.webofknowledge.com/WOK46/help/WOS/h_adabrv.html#state_country_abbreviations
it is possible to find the list compiled from Thomson-Reuter that is a good starting point.
In case also a list of full journals names vs abbreviations would be needed it may be found here
http://images.webofknowledge.com/WOK46/help/WOS/0-9_abrvjt.html
or in a format easier to manage: http://people.su.se/~alau4517/jabref.wos.txt
Subscribe to:
Posts (Atom)