Showing posts with label datamining. Show all posts
Showing posts with label datamining. Show all posts

Sunday, June 7, 2015

PatentsView Inventor Disambiguation Technical Workshop

On behalf of the US Patent & Trademark Office, the American Institutes for Research (AIR) is hosting an inventor disambiguation technical workshop.  USPTO is seeking creative new approaches to get better information on innovators and the new technologies they develop by disambiguating inventor names.

AIR invites individual researchers or research teams to develop inventor disambiguation algorithms using US patent data. The top fifteen teams will be invited to present their results at the final workshop, which will be held at USPTO headquarters September 23 and 24.  The researcher or team that contributes the most effective algorithm will receive a $25,000 stipend.

July 1st is the Deadline for prospective participants to submit a 1-page “intent to participate” document. This includes any requests to incorporate additional data, software, or hardware requirements. All teams with proposals deemed to be reasonable (by the judges’ panel) will be invited to participate.

Additional information is posted at www.dev.patentsview.org/workshop, together with the training datasets.

Monday, November 24, 2014

Using patstat in universities evaluation procedures

This work shows a methodology used to match PATSTAT inventor names to a full list of researchers working in Italian universities.
The goal is to have higher recall, leaving institutions/researchers to validate the data.
Focus will not be on results (evaluation still in progress) but on data processing, selection and match algorithm, highlighting some difficulties and relative workarounds.


Monday, February 18, 2013

Datasources for entity resolution with company names



The goal of this list is to provide a review of possible datasources for entity resolution on companies/institutions and their history/changes of ownership for patstat / WoS.
Aside from US mainly international lists have been searched.

COMPUSTAT

From: NBER PDP Project User Documentation: Matching Patent Data to Compustat Firms
Compustat records identified by CUSIPs or GVKEYs refer to securities[1], not firms, single organization may correspond to multiple entries within the Compustat data. Sometimes
reorganization of the ownership structure generates a new GVKEY; sometimes accounting changes
result in multiple GVKEYs for the same organization; sometimes a parent organization and a
subsidiary will both have GVKEYs. In order to uniquely identify organizations, we introduce a variable named PDPCO. In most cases, the PDPCO equals the Compustat GVKEY, however, in some cases, multiple GVKEYs are associated with a single PDPCO. These associations are recorded in the PDPCOHDR file.

NBER COMPUSTAT MATCH / OWNERSHIP HISTORY

This is the home page for the new NBER patent data project. US Patent data for 1976-2006 and assignee match to Compustat are now available on the downloads page. Watch this site for announcements of new releases, tools, supplemental files, fixes, etc.
Last updates are from 2010.

contains match for USPTO applicant with compustat and ownership history of applicants

contains stata programs used for cleaning, standardizing and assigning type of applicant.

User forum about compustat match (actually empty)

FROM PATSTAT TLS221

TLS221 (prs legal data) in patstat contains legal data and among them we may count changes of name / changes of ownership (event type RAP*).
Such events reflect changes also in the company (for sure in case of name changes, but also massive changes of ownership in patent may reflect change of ownership in the company board).

Counts on the 4 events in 2010 ediction show out following figures:

RAP1     500492
RAP2     60191


UNIVERSITY OF BOLOGNA LIST OF UNIVERSITY NAMES


Full list of university names by country / continent

NASDAQ COMPANY NAMES



Indian companies names changes

A database specialized on company names for India, Change of Name, Mergers, Demergers


SEC NAMES



Greenhouse Gas Emissions Reporting Program parent companies


Standardized Parent Company Names for TRI Reporting

List of standardized names for Toxics Release Inventory from US Env protection agency


USA grants recipients database

 USAspending.gov has recently introduced the availability of Data Archives. The Archives feature will provide the capability to download archived files on USAspending.gov by the following criteria:
Major Agency/ALL
Fiscal Year
Type of Spending (Contracts, Grants, Loans, Direct Payments, Insurance, Other)
Type of Output (CSV, TAB, XML, ATOM)

The archived data files would be available for download only in the compressed file format for each output type mentioned above.


For recipients of founds full data (name and address) is given as weel as other data like
recip_cat_type
The original Federal Assistance Awards Data System recipient type code, modified by USAspending.gov into a set of broader categories (government, individual, nonprofit, for profit, higher ed, other).

recipient_type
The type of recipient (i.e., state government, local government, Indian tribe, individual, small business, for-profit, nonprofit, etc.)

And DUNS_NUMBER

CROCTAIL

CrocTail provides an interface for browsing information parsed from SEC filings about several hundred thousand U.S. publicly traded corporations and their foreign subsidiaries.
Information from company filings with the U.S. Securities and Exchange Commission (SEC) has been parsed and annotated by CorpWatch to provide a way for Crocodyl.org users to research and add issues related to corporate subsidiaries. CrocTail also serves as a demonstration of the features and data available through the CorpWatch API.

CROCODYL.ORG

Crocodyl (http://www.crocodyl.org/) is a collaboration between nonprofit organizations such as Center for Corporate Policy, CorpWatch, Corporate Research Project, other contributing organizations and individual contributors from around the world.
Contains a lot of informations but coverage is not homogeneous,  data come from human rights watches in many cases and updates are mostly dated 2009, 2010.

ORCID

ORCID is an open, non-profit, community-based effort to create and maintain a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. 
ORCID provides two core functions:  (1) a registry to obtain a unique identifier and manage a record of activities, and (2) APIs that support system-to-system communication and authentication.  ORCID makes its code available under an open source license, and will post an annual public data file under a CCO waiver for free download.
ORCID identifiers will be 16 digit numbers, segmented into four-digit groups and including a checksum. They will be expressed as HTTP URI (such as http://orcid.org/0137-1963-7688-2319). ORCID identifiers contain no semantic information, such as the year the identifier was minted or the country of origin, and they are issued out of sequence. The ORCID service not only will issue unique author identifiers but also will enable linking to existing author identifier services. T

Data provided via API are
  • Bio - Given a contributor, return name and affiliation data.
  • Works - Given a contributor, return list of works he has contributed to.
  • Full - Given a contributor, return list of works he has contributed to, name and affiliation data.
  • Search - Given whatever metadata provided, return a ranked list of potential contributors identified by that metadata.
Formats supported are: HTML, XML or JSON format.
Recently a partnership with CROSSREF (http://www.crossref.org/) has been launched (http://www.crossref.org/01company/pr/news111412.html) in order to assign unique author identifier to all DOIs.

OPENCORPORATES

http://opencorporates.com/ is an opensource collection of informations for over 45M of companies from 58 jurisdictions.
The peculiarity is to hav a web link to the national register 5link is not always working) and also a huge record of inactive / dissolved companies.
Service relies on APIs;  there are actually two OpenCorporates APIs: the REST API and the Google Refine Reconciliation API
The main API (the so-called REST API) allows access as data to all aspects of the OpenCorporates website (with the exception of being able submit data). By default it returns data as JSON (but XML is also available). Access to all the data is free and under the same open licence conditions as the rest of OpenCorporates. An optional (free) API token is available which increases usage limits for the service.
If you are matching company names to legal entities from an existing dataset, you should investigate the rather excellent Google Refine, which makes this quick, easy and allows data to be filtered and cleansed quicker than any other tool we know of. OpenCorporates provides a highly popular reconciliation API for Google Refine, which allows matching company names to legal entities.

UCLA Anderson DB list

At URL http://www.anderson.ucla.edu/x14520.xml is published and mantained a very comprehensive list of Business databases, listed by name along with some downloadable guides.

Most of resourced listed are commercial.

World of Learning

At http://www.worldoflearning.com is available on line version of The Europa World of Learning; print edition of this internationally respected title was first published over 60 years ago and has become a primary source of information on the academic sphere world-wide.
A free sample for Finlad can be explored here: http://www.worldoflearning.com/public/views/entry/FI
Data contain also research institutes qnd for each of them infos available are, apart from nale and address, also URL, phone and publications edited by that insitution.

The Carnegie Classification of Institutions of Higher Education

The Carnegie Classification™ has been the leading framework for recognizing and describing institutional diversity in U.S. higher education for the past four decades.
The website http://classifications.carnegiefoundation.org/ provides access to extensive documentation as well as tools for looking up specific institutions, listing all institutions in a particular classification category, aggregating categories within a classification.
Most of all it allows to downoad in XLS a list of 4635 US institutions along with 85 variables associated to each of them.
Bu filtering by CCBASIC field you may also extract lists of institutions by type. Especially
15           RU/VH: Research Universities (very high research activity)
16           RU/H: Research Universities (high research activity)
17           DRU: Doctoral/Research Universities

WoS Abbreviations list

One of the main problems in reconciling affiliations from WoS with other sources is the short writing of affiliation name and address.
IE resolving 1 GEOL EXPLORAT INST HENAN PROV could be a problem unless we build a dictionary for abbreviations.
At links:
http://images.webofknowledge.com/WOK46/help/WOS/h_acrnabrv.html#acrnabrv_a
http://images.webofknowledge.com/WOK46/help/WOS/h_adabrv.html#state_country_abbreviations
it is possible to find the list compiled from Thomson-Reuter that is a good starting point.
In case also a list of full journals names vs abbreviations would be needed it may be found here
http://images.webofknowledge.com/WOK46/help/WOS/0-9_abrvjt.html
or in a format easier to manage: http://people.su.se/~alau4517/jabref.wos.txt


[1] http://en.wikipedia.org/wiki/Security_%28finance%29

Wednesday, January 9, 2013

Datamining and entity resolutions: some tools

I publish a review of some tools that I came through looking for something useful in order to standardize and resolve a list of affiliations from web of science.

Some of the tools proved to be not that useful for my porposes, maybe someone else may benefit from it...



OPENCALAIS

Calais Web Service by Thomson Reuters. The web service is an API that accepts unstructured text (like news articles, blog postings, etc.), processes them using natural language processing and machine learning algorithms, and returns RDF-formatted entities, facts and events.
OpenCalais supports three types of entity disambiguation: Company disambiguation, Geographical disambiguation and Product (Electronics) disambiguation.
Disambiguation of company names - such as determining whether the company Olympus refers to Olympus Optical Co. Ltd. or Olympus Life and Material Science Europa. The resolution output for a given company mention includes:
  • A URI that is unique and uniform across documents
  • The formal English legal name of the company
  • The company's ticker symbol (for public companies)
For company names that cannot be disambiguated, the returned results will include no resolution information.

For using opencalais without APIs:

Test on disambiguation feature on a sample of WOS affiliations showed out an insufficient performance since semantic search works better on fuzzy data (IE web pages) rather than lists of names.

NOTE: in case run after install may return error:
Error parsing c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\config\machine.config
Parser returned error 0xC00CE556
Rename file C:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config
And recreate it by copying from: C:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.default
Be sure to have full rights on the opencalais templates subfolders in order to avoid other errors.

OPENREFINE

Openrefine (https://github.com/OpenRefine/OpenRefine/wiki) previously Google Refine, is a desktop application helps to refine messy data in a few clicks with very powerful clustering and cleaning algorithms.
It also allows data transformations and augmentation.
Most of all has a reconciliation feature: Reconciliation is a semi-automated process of matching text names to database IDs (keys). This is semi-automated because in some cases, machine along is not sufficient and human judgement is essential.
Test on reconciliation feature on a sample of WOS affiliations showed out an insufficient performance also because reconciliation database seems more persons / geography rather than organizations oriented.

PYTON NAMEPARSER

Is a Python module that parses human names into individual components:
  • Title
  • First name
  • Middle names
  • Last names
  • Suffixes
Download from: http://code.google.com/p/python-nameparser/

NAME CLEAVER

Name Cleaver (http://sunlightlabs.com/blog/2011/name-standardization-name-cleaver/) supports three major name types, politicians, individuals and organizations, with a specific class and special features for each.
The OrganizationNameCleaver class has methods to reduce a name to only the "kernel" of the name, and also to expand all abbreviations (that Name Cleaver knows of), useful for matching tasks.
The pyton code of the program can be downloaded here:  https://github.com/sunlightlabs/name-cleaver

OYSTER entity resolution

OYSTER (Open sYSTem Entity Resolution) is an entity resolution system that supports probabilistic direct matching, transitive linking, and asserted linking. To facilitate prospecting for match candidates (blocking), the system builds and maintains an in-memory index of attribute values to identities. Because OYSTER has an identity management system, it also supports persistent identity identifiers. OYSTER is unique among other ER systems in that it is built to incorporate Entity Identity Information Management (EIIM). OYSTER supports EIIM by providing methods that enforce identifiers to be unique among identities, maintain persistent IDs over the life of an identity, and allowing the ability to fix false-positive and false-negative resolutions, which cannot be done with matching rules, through the use of assertion, traceability, and other features.
Developed in JAVA, can be downloaded from:  http://sourceforge.net/projects/oysterer/