Showing posts with label database. Show all posts
Showing posts with label database. Show all posts

Friday, September 21, 2018

Google dataset search

Recently Google launced a new service aiming to index local, public and national data repositories: Google Dataset Search.

Dataset Search lets you find datasets wherever they’re hosted, whether it’s a publisher's site, a digital library, or an author's personal web page.

Google also developed guidelines for dataset providers to describe their data in a way that search engines can better understand the content of their pages.

The approach is based on an open standard for describing this information (schema.org) and anybody who publishes data can describe their dataset this way.

The engine also links, where possible, the dataset to Google Scholar articles using them.


Full story @ link
https://www.blog.google/products/search/making-it-easier-discover-datasets/

Monday, December 18, 2017

Maxmind dataset for geolocation

 At web page:

https://www.maxmind.com/en/open-source-data-and-api-for-ip-geolocation

Maxmind delivers some dataset useful for geolocation:

GeoLite2 databases are free IP geolocation databases; The GeoLite2 Country and City databases are updated on the first Tuesday of each month. The GeoLite2 ASN database is updated every Tuesday.

IP Geolocation Usage

IP geolocation is inherently imprecise. Locations are often near the center of the population. Any location provided by a GeoIP database should not be used to identify a particular address or household.
Use the Accuracy Radius as an indication of geolocation accuracy for the latitude and longitude coordinates we return for an IP address. The actual location of the IP address is likely within the area defined by this radius and the latitude and longitude coordinates.

Includes city, region, country, latitude and longitude. This product doesn't contain any IP addresses.

Includes the following fields: ( Technical Details )

  • Country Code
  • ASCII City Name
  • City Name
  • Region
  • Population
  • Latitude (The approximate latitude of the postal code, city, subdivision or country associated with the IP address.*)
  • Longitude (The approximate latitude of the postal code, city, subdivision or country associated with the IP address.*)

Another interseting dataset, even if no more mantained is World cities with population:

City/state/country text as it appears in source files is algorithmically matched against a master geocode file from Google and MaxMind open source files.

Tuesday, December 12, 2017

Google Patents public dataset

Since october 31st 2017 are available in google cloud and BigQuery platform patents data that stand behind google patents, that means worldwide bibliographic information on more than 90 million patent publications from 17 countries and US full text, provided by IFI CLAIMS Patent Services.

https://cloud.google.com/blog/big-data/2017/10/google-patents-public-datasets-connecting-public-paid-and-private-patent-data

in below page you can also find more details and examples

https://console.cloud.google.com/launcher/details/google_patents_public_datasets/google-patents-public-data

Wednesday, February 12, 2014

European Patent Register for PATSTAT

As previously annonced EPO has released EP register data as plug and play extension of patstat.

This new dataset, linkable by appln_id to the other tables, contains the full track of events the application goes across (the legal events table TLS221 may be considered as a subset of this dataset) as well as legal rappresentative and aother useful pieces of info.

An example of the data contained (for EP1.000.000):

https://register.epo.org/application?number=EP99203729&tab=main

here the full EPO letter announcing the releaase:




It is our pleasure to announce that we now have the “European Patent Register for PATSTAT” database ready for sale.  

What is the European Patent Register for PATSTAT ?

The European Patent Register for PATSTAT is a data base extracted from the European Patent Register.  In this database, the European Patent Office stores all the publicly available information it has on European patent applications (inclusive Euro-PCT) as they pass through the grant procedure.  The main purpose of the data base is to allow patent information users in companies, research institutes and public institutions to carry out in-depth statistical or business intelligence analysis of European patent applications.

What can the register do for you ?

Possible areas of research:

§        Broad analysis of the European patent system
§        Applicant filing behaviour at the EPO
§        Technology developments
§        Speed an quality of the granting process
§        Competition matters

Typical questions that can be answered with this data base:

§        In what stage of the procedure do we find European patent applications:  published, refused, withdrawn, granted, opposed, revoked, amended,...  ?
§        Oppositions to a European patents: who is opposing who, in what technological areas (IPC-based) is opposition custom, how long do the various procedural steps take, ... ?
§        How do you find professional representatives active in a certain technological area?
§        What are the success rates of opposition and appeal proceedings?
§        How wide is the scope of the cited prior art? What is the impact of non-patent literature?

How does this new data base relate to PATSTAT ?

The “European Patent Register for PATSTAT” can be used as a standalone database or can be linked to PATSTAT via the application identification (appln_id) .  There are sufficient bibliographical attributes to carry out a broad range of analysis on European patent applications without the need of having PATSTAT.  But it speaks for itself that linking PATSTAT to the “European Patent Register for PATSTAT” opens a whole new range of analysis which was previously not easy possible.



Where do you start ?

If you are interested in “European Patent Register for PATSTAT”, we would advise you to first download a sample data set so that you can check for yourself it the data base contains the necessary attributes for your purpose.  
This sample is available in MS ACCESS format via this link: https://publication.epo.org/raw-data/product?productId=1
We also strongly recommend you to consult the Register Data Catalog if you want to learn about the details of the various tables and attributes.  Here is the link: http://documents.epo.org/projects/babylon/eponet.nsf/0/2A7D990B3A6ADDCCC1257C4000559878/$File/data_catalog_register_v1-1-2_en.pdf

For pricing, we kindly refer you to this document : http://documents.epo.org/projects/babylon/eponet.nsf/0/2A7D990B3A6ADDCCC1257C4000559878/$File/patstat_family_products_pricelist_en.pdf
You will see that we have made a very competitive package price for users who want to order the 3 products together:

ü        PATSTAT
ü        European Patent Register for PATSTAT
ü        EPO worldwide legal status database for PATSTAT

If you have any further questions, don't hesitate to contact us via patstat@epo.org .

Monday, September 23, 2013

UK Business Structure Database (BSD)

The Business Structure Database (BSD) is a dataset made available without fee for research porposes, which contains a number of variables for almost all business organisations in the UK. The BSD is derived primarily from the Inter-Departmental Business Register (IDBR), which is a live register of data collected by HM Revenue and Customs via VAT and Pay As You Earn (PAYE) records. The IDBR data are complimented with data from ONS business surveys. Timeframe available is 1997-2011.

The following variables are available for enterprises and local units:

    employment (and employees)
    turnover
    Standard Industrial Classification (1992, 2003 and 2007 classifications are available)
    legal status (e.g. sole proprietor, partnership, public corporation, non-profit organisation etc)
    foreign ownership
    birth (company start date)
    death (termination date of trading)

'Employment' includes business owners, whereas 'employees' measures the number of staff, excluding owners.

Full list of variables is listed here:

http://www.esds.ac.uk/doc/6697/mrdoc/excel/variables_in_idbr_1997_2005_with_ent_code_generator.xls

Registration is required and standard conditions of use apply.

http://discover.ukdataservice.ac.uk/catalogue?sn=6697

Tuesday, August 20, 2013

EPO documents in google patents

After USPTO, recently Google added in its patent search engine also EPO documents.
Data are based on espacenet and register, and shows out all available data, but patent families.

see FI
http://www.google.com/patents/EP1000000A1?hl=it&cl=en


Goggle do not only reoganizes the data but also introduces some original features like:
- find prior arts: looks for previous applications with some keywords extracted from patent title;
- seeks for questions about patent: if tagged on http://patents.stackexchange.com
- allows surfing by inventor, applicant

Part of data are also made more user friendly like stadnardization of inventor name and applicant name, but also description for legal event codes have been added.

Monday, February 18, 2013

Datasources for entity resolution with company names



The goal of this list is to provide a review of possible datasources for entity resolution on companies/institutions and their history/changes of ownership for patstat / WoS.
Aside from US mainly international lists have been searched.

COMPUSTAT

From: NBER PDP Project User Documentation: Matching Patent Data to Compustat Firms
Compustat records identified by CUSIPs or GVKEYs refer to securities[1], not firms, single organization may correspond to multiple entries within the Compustat data. Sometimes
reorganization of the ownership structure generates a new GVKEY; sometimes accounting changes
result in multiple GVKEYs for the same organization; sometimes a parent organization and a
subsidiary will both have GVKEYs. In order to uniquely identify organizations, we introduce a variable named PDPCO. In most cases, the PDPCO equals the Compustat GVKEY, however, in some cases, multiple GVKEYs are associated with a single PDPCO. These associations are recorded in the PDPCOHDR file.

NBER COMPUSTAT MATCH / OWNERSHIP HISTORY

This is the home page for the new NBER patent data project. US Patent data for 1976-2006 and assignee match to Compustat are now available on the downloads page. Watch this site for announcements of new releases, tools, supplemental files, fixes, etc.
Last updates are from 2010.

contains match for USPTO applicant with compustat and ownership history of applicants

contains stata programs used for cleaning, standardizing and assigning type of applicant.

User forum about compustat match (actually empty)

FROM PATSTAT TLS221

TLS221 (prs legal data) in patstat contains legal data and among them we may count changes of name / changes of ownership (event type RAP*).
Such events reflect changes also in the company (for sure in case of name changes, but also massive changes of ownership in patent may reflect change of ownership in the company board).

Counts on the 4 events in 2010 ediction show out following figures:

RAP1     500492
RAP2     60191


UNIVERSITY OF BOLOGNA LIST OF UNIVERSITY NAMES


Full list of university names by country / continent

NASDAQ COMPANY NAMES



Indian companies names changes

A database specialized on company names for India, Change of Name, Mergers, Demergers


SEC NAMES



Greenhouse Gas Emissions Reporting Program parent companies


Standardized Parent Company Names for TRI Reporting

List of standardized names for Toxics Release Inventory from US Env protection agency


USA grants recipients database

 USAspending.gov has recently introduced the availability of Data Archives. The Archives feature will provide the capability to download archived files on USAspending.gov by the following criteria:
Major Agency/ALL
Fiscal Year
Type of Spending (Contracts, Grants, Loans, Direct Payments, Insurance, Other)
Type of Output (CSV, TAB, XML, ATOM)

The archived data files would be available for download only in the compressed file format for each output type mentioned above.


For recipients of founds full data (name and address) is given as weel as other data like
recip_cat_type
The original Federal Assistance Awards Data System recipient type code, modified by USAspending.gov into a set of broader categories (government, individual, nonprofit, for profit, higher ed, other).

recipient_type
The type of recipient (i.e., state government, local government, Indian tribe, individual, small business, for-profit, nonprofit, etc.)

And DUNS_NUMBER

CROCTAIL

CrocTail provides an interface for browsing information parsed from SEC filings about several hundred thousand U.S. publicly traded corporations and their foreign subsidiaries.
Information from company filings with the U.S. Securities and Exchange Commission (SEC) has been parsed and annotated by CorpWatch to provide a way for Crocodyl.org users to research and add issues related to corporate subsidiaries. CrocTail also serves as a demonstration of the features and data available through the CorpWatch API.

CROCODYL.ORG

Crocodyl (http://www.crocodyl.org/) is a collaboration between nonprofit organizations such as Center for Corporate Policy, CorpWatch, Corporate Research Project, other contributing organizations and individual contributors from around the world.
Contains a lot of informations but coverage is not homogeneous,  data come from human rights watches in many cases and updates are mostly dated 2009, 2010.

ORCID

ORCID is an open, non-profit, community-based effort to create and maintain a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. 
ORCID provides two core functions:  (1) a registry to obtain a unique identifier and manage a record of activities, and (2) APIs that support system-to-system communication and authentication.  ORCID makes its code available under an open source license, and will post an annual public data file under a CCO waiver for free download.
ORCID identifiers will be 16 digit numbers, segmented into four-digit groups and including a checksum. They will be expressed as HTTP URI (such as http://orcid.org/0137-1963-7688-2319). ORCID identifiers contain no semantic information, such as the year the identifier was minted or the country of origin, and they are issued out of sequence. The ORCID service not only will issue unique author identifiers but also will enable linking to existing author identifier services. T

Data provided via API are
  • Bio - Given a contributor, return name and affiliation data.
  • Works - Given a contributor, return list of works he has contributed to.
  • Full - Given a contributor, return list of works he has contributed to, name and affiliation data.
  • Search - Given whatever metadata provided, return a ranked list of potential contributors identified by that metadata.
Formats supported are: HTML, XML or JSON format.
Recently a partnership with CROSSREF (http://www.crossref.org/) has been launched (http://www.crossref.org/01company/pr/news111412.html) in order to assign unique author identifier to all DOIs.

OPENCORPORATES

http://opencorporates.com/ is an opensource collection of informations for over 45M of companies from 58 jurisdictions.
The peculiarity is to hav a web link to the national register 5link is not always working) and also a huge record of inactive / dissolved companies.
Service relies on APIs;  there are actually two OpenCorporates APIs: the REST API and the Google Refine Reconciliation API
The main API (the so-called REST API) allows access as data to all aspects of the OpenCorporates website (with the exception of being able submit data). By default it returns data as JSON (but XML is also available). Access to all the data is free and under the same open licence conditions as the rest of OpenCorporates. An optional (free) API token is available which increases usage limits for the service.
If you are matching company names to legal entities from an existing dataset, you should investigate the rather excellent Google Refine, which makes this quick, easy and allows data to be filtered and cleansed quicker than any other tool we know of. OpenCorporates provides a highly popular reconciliation API for Google Refine, which allows matching company names to legal entities.

UCLA Anderson DB list

At URL http://www.anderson.ucla.edu/x14520.xml is published and mantained a very comprehensive list of Business databases, listed by name along with some downloadable guides.

Most of resourced listed are commercial.

World of Learning

At http://www.worldoflearning.com is available on line version of The Europa World of Learning; print edition of this internationally respected title was first published over 60 years ago and has become a primary source of information on the academic sphere world-wide.
A free sample for Finlad can be explored here: http://www.worldoflearning.com/public/views/entry/FI
Data contain also research institutes qnd for each of them infos available are, apart from nale and address, also URL, phone and publications edited by that insitution.

The Carnegie Classification of Institutions of Higher Education

The Carnegie Classification™ has been the leading framework for recognizing and describing institutional diversity in U.S. higher education for the past four decades.
The website http://classifications.carnegiefoundation.org/ provides access to extensive documentation as well as tools for looking up specific institutions, listing all institutions in a particular classification category, aggregating categories within a classification.
Most of all it allows to downoad in XLS a list of 4635 US institutions along with 85 variables associated to each of them.
Bu filtering by CCBASIC field you may also extract lists of institutions by type. Especially
15           RU/VH: Research Universities (very high research activity)
16           RU/H: Research Universities (high research activity)
17           DRU: Doctoral/Research Universities

WoS Abbreviations list

One of the main problems in reconciling affiliations from WoS with other sources is the short writing of affiliation name and address.
IE resolving 1 GEOL EXPLORAT INST HENAN PROV could be a problem unless we build a dictionary for abbreviations.
At links:
http://images.webofknowledge.com/WOK46/help/WOS/h_acrnabrv.html#acrnabrv_a
http://images.webofknowledge.com/WOK46/help/WOS/h_adabrv.html#state_country_abbreviations
it is possible to find the list compiled from Thomson-Reuter that is a good starting point.
In case also a list of full journals names vs abbreviations would be needed it may be found here
http://images.webofknowledge.com/WOK46/help/WOS/0-9_abrvjt.html
or in a format easier to manage: http://people.su.se/~alau4517/jabref.wos.txt


[1] http://en.wikipedia.org/wiki/Security_%28finance%29

Tuesday, December 4, 2012

UK IP office data interface


A good source of already structured data is also UK ip office.

In it's IPSUM (Online Patent Information and Document Inspection Service) interface allows users to access (and easily parse) many data, including EP reister data.


Data available are:

Application Number, Application Source, Application Language, Publication Number, Publication Language, Status (ie Granted), Filing Date, Publication Date, Grant Date, Last Renewal Date, Year of Last Renewal, Next Renewal Date, Designated States, Application Title. Grant Title, Address for Service + adp, Applicant / Proprietor + adp, Inventors + adp, EPO Representative + adp


IE:http://www.ipo.gov.uk/p-ipsum/Case/PublicationNumber/EP1000000

ADP stands for automated data processing number that is like person_id in Patstat, that unfortunately is not disambiguated (see new holland example below).



Applicant / Proprietor NEW HOLLAND KOBELCO CONSTRUCTION MACHINERY S.P.A. Strada di Settimo, 323 10099 San Mauro Torinese Italy [ADP Number 77324077001]

plicant / Proprietor NEW HOLLAND KOBELCO CONSTRUCTION MACHINERY S.P.A. Strada di Settimo, 323 10099 San Mauro Torinese (TO) Italy [ADP Number 76169127001]