Rawpatentdata: 2019

Wednesday, November 6, 2019

PATSTAT 2019b MySQL upload scripts

At this link are available my new scripts for MySQL to upload PATSTAT 2019b data.

Only change in this data ediction is the removal of table TLS906_PERSON that is fully integrated in TLS206.
My scripts have been updated to remove MyIsam as default engine and to make them fully compatible with MySQL 8.0.

Monday, October 7, 2019

PATSTAT & Patentsview: complements or substitutes?

Match of PATSTAT data (2019 spring) and PatentsView (jan 2019) is discussed here, with focus on how this match can help to enrich PATSTAT data with information not contained in USPTO patents (and the other way round).

PATSTAT & Patentsview: complements or substitutes? from Gianluca Tarasconi

Monday, September 30, 2019

Patents applicants: how to create the full time series

I share a presentation I made @ EPo & KUL summer school in Vienna last september!
hope it's useful to someone else.

Patents change applicants data within time;
Main reason for change are ownership change, name/address change, M&A …
Applicant’s names contained in TLS206 is the ‘last available’ data;
PATSTAT Global + EP Register make available several sources to build a chain of names
and a timeline for patents contained;

Patents applicants: how to create the full time series from Gianluca Tarasconi

Thursday, July 18, 2019

Patents citations from Pubmed Scientific publications

A recently released dataset from Marx, Matt; Aaron Fuegi contains citations from USPTO patents granted 1947-2018 to articles captured by the MS Academic Graph (ID) from 1800-2018.

Files, tab-separated, are available at link : https://zenodo.org/record/3338601

The main file, pcs.tsv, contains the resolved citations matching patent number, MAG ID, the original citation from the patent, an indicator for whether the citation was supplied by the applicant, examiner, or unknown, and a confidence score (1-10) indicating how likely this match is correct.

There is also a PubMed-specific match in pcs-pubmed.tsv.

Authors also made availabel source code for generating the patent citations to science in pcs.tsv is available at https://github.com/mattmarx/reliance_on_science. Source code for generating jif.zip and jcif.zip (Journal Impact Factor and Journal Commercial Impact Factor) is at https://github.com/mattmarx/jcif.

Scripts and programs are mainly for stata & linux (DO files and sh scripts).

Monday, July 8, 2019

PATSTAT - patentsview concordance update 2019

PatentsView is a platform built on data derived from USPTO bulk data files.

This dataset complements perfectly PATSTAT since the former has an native disambiguation of inventors and applicants and a geocoding system applied to inventors and applicants, while the latter links US data to other offices, allowing to calculate knowledge spillovers, family data etc.

At this link is possible to download a table of concordance between patent_id (Patentsview main key) and appln_id (from PATSTAT).

Overlapping of the two datasets is not perfect since Patentsview contains only granted patents after 1975, where PATSTAT has also application (ungranted) and timeframe covers also pre-1975 data.
On the other hand PATSTAT misses design patents before 2001, plants before 2001 and 'statutory invention registration' type of patent.

Data are from PATSTAT spring 2019 and Patentsview march 2019, thus also 2019 data in PW are partially missing in PATSTAT.

Thursday, June 6, 2019

Patent familiarity calc scripts

I made available on Github a set of MySQL and python scripts to create familiarity indicator by IPC class, on NBER patents dataset.

https://github.com/gtarasconi/NBER-familiarity-indicators

Inventor’s familiarity with components of the invention measured by the (a) recent and (b) frequent usage of focal patent’s classes across all US patents. Thus, we calculate a measure of familiarity for each separate class of a focal patent. Therefore, the more recently and frequently a class has been used, its individual measure will be higher.

Based on Fleming 2001
https://funginstitute.berkeley.edu/wp-content/uploads/2012/10/Recombinant-Uncertainty-in-Technological-Search.pdf

Wednesday, May 22, 2019

PATSTAT 2019a analysis of changes

As Stated in EPO documents new version of PATSTAT comes with 2 mayor changes:

Person names in the original language (PERSON_NAME_ORIG_LG)
Field added in tables TLS206, 226, 906. This creates an inflation of records and lack of retrocompatibility with old personwise data.

“RELEVANT_CLAIM” attribute in the TLS215_CITN_CATEG table
A new attribute “RELEVANT_CLAIM” has been added to the TLS215_CITN_CATEG table. This attribute contains a single number referring to the claim to which the citation refers to.

About first change you can find below, for offices with more than 100.000 person_ids, how are they most affected, where the last column shows inflation rate

APPLN_AUTH	Count_person_id_2018b	Count_person_id_2019a	ratio
AR	150688	182005	121%
AT	1142798	1232406	108%
AU	1870980	2008555	107%
BE	196598	204115	104%
BR	1223470	1315337	108%
CA	2524565	2675965	106%
CH	650912	664626	102%
CN	6192040	9956730	161%
CS	166131	178125	107%
CZ	130967	233777	179%
DD	226783	234141	103%
DE	5095800	5258606	103%
DK	624763	687156	110%
EP	6624815	6625117	100.005%
ES	1291041	1389533	108%
FI	365453	384041	105%
FR	1471641	1588271	108%
GB	1757273	1797304	102%
GR	134008	173190	129%
HK	300604	426287	142%
HU	234866	303144	129%
IL	128159	167805	131%
IN	190763	213962	112%
IT	661777	719438	109%
JP	2802730	3787058	135%
KR	1906273	3031637	159%
MX	524748	620385	118%
MY	121991	158129	130%
NL	144408	155950	108%
NO	370993	444431	120%
NZ	261084	329085	126%
PL	370207	492642	133%
PT	218054	249693	115%
RO	131598	152849	116%
RU	979629	1831707	187%
SE	406973	420885	103%
SG	277368	363149	131%
SU	1536615	1718307	112%
TW	1004135	1143685	114%
UA	164487	339271	206%
US	12270415	12630422	103%
WO	4119901	4756241	115%
ZA	511180	621550	122%

About the latter, number of records in TLS215 are more than doubled since for each category (now categories can be multiple as AD, AX ...) one line for relevant claim is created.

Sunday, May 19, 2019

PATSTAT 2019a MySQL upload scripts

At this link are available my new scripts for MySQL to upload PATSTAT 2019a data.

New features in this ediction:

Person names in the original language (PERSON_NAME_ORIG_LG)
Field added in tables TLS206, 226, 906. This creates an inflation of records and lack of retrocompatibility with old personwise data.

“RELEVANT_CLAIM” attribute in the TLS215_CITN_CATEG table
A new attribute “RELEVANT_CLAIM” has been added to the TLS215_CITN_CATEG table. This attribute contains a single number referring to the claim to which the citation refers to.

Wednesday, May 8, 2019

webscraping: download of ANVUR list of journals in Python

Recently I had the task of creating a dataset of scientific journal classified by ANVUR (italian agency for research rating);
Unfortunately the lists are splitted by research area and available only in PDF at URL

http://www.anvur.it/attivita/classificazione-delle-riviste/classificazione-delle-riviste-ai-fini-dellabilitazione-scientifica-nazionale/elenchi-di-riviste-scientifiche-e-di-classe-a/

In order to make my life easier I created a Python 3 script that downloads all PDFs and via Tabula library, transforms PDF tables into CSVs.

I put the script below (note URL is hardcoded, for future uses change it)

Still CSVs need some work due to multiline titles.
To make life easie I made 2 xls files with A journals and all journals, for Areas 11 12 13 and 14.

XLS files can be downloaded here.

# python 35

# pdf downloader code extractor

from bs4 import BeautifulSoup
import requests
import time
import codecs
import PyPDF2
import os
from tabula import read_pdf
import pandas as pd

if __name__ == "__main__":

    istable = input('Pdf are tables?[N]') or 'N'
    dnl = input('download PDFs?[Y]') or 'Y'

    if dnl=="Y":
        archive_url = "http://www.anvur.it/attivita/classificazione-delle-riviste/classificazione-delle-riviste-ai-fini-dellabilitazione-scientifica-nazionale/elenchi-di-riviste-scientifiche-e-di-classe-a/"
        response = requests.get(archive_url)

        soup = BeautifulSoup(response.text, 'html.parser')

        pdf_links = [link['href'] for link in soup.find_all('a') if link['href'].endswith('pdf')]

        for link in pdf_links:

            if link[:4]!='http':
                link = archive_url + link

            '''iterate through all links in and download them one by one'''

            # obtain filename by splitting url and getting
            # last string
            file_name = link.split('/')[-1]

            print ("Downloading file:%s" % file_name)

            # create response object
            r = requests.get(link, stream=True)

            # download started
            with open(file_name, 'wb') as f:
                for chunk in r.iter_content(chunk_size=1024 * 1024):
                    if chunk:
                        f.write(chunk)

            print ("%s downloaded!\n" % file_name)

        print ("All file downloaded!")

    pdfDir = ""
    txtDir = ""

    if pdfDir == "": pdfDir = os.getcwd() + "\\" # if no pdfDir passed in
    if txtDir == "": txtDir = os.getcwd() + "\\" # if no txtDir passed in

    for pdf_to_read in os.listdir(pdfDir): # iterate through pdfs in pdf directory
        fileExtension = pdf_to_read.split(".")[-1] # -1 takes always last part
        if fileExtension == "pdf":
            pdfFilename = pdfDir + pdf_to_read
            textFilename = txtDir + pdf_to_read + ".txt"
            textFile = open(textFilename, "a") # make text file

            if istable == 'N':

                pdf = PyPDF2.PdfFileReader(open(pdfFilename, "rb"))
                for page in pdf.pages:
                    textFile.write(page.extractText()) # write text to text file

            else:
                df= read_pdf(pdfFilename, pages="all")
                df.to_csv(textFilename)

            textFile.close()

Monday, May 6, 2019

PATSTAT data coverage

When doing advanced statistical analysis, it is important to understand the coverage and content of the data you are working with. PATSTAT Global contains data coming from all over the world. Quality, timeliness and completeness vary a great deal depending on patent office.

EPO compiled a Tableau dashboard that maps the content and coverage of PATSTAT Global.

This chart shows the percentage of patent applications having a certain data element (e.g. CPC classification) by patent authority and application year in PATSTAT Global.

https://public.tableau.com/profile/patstat.support#!/vizhome/CoverageofPATSTAT2018AutumnEdition/CoveragePATSTATGlobal