Thursday, July 18, 2019

Patents citations from Pubmed Scientific publications

A recently released dataset from Marx, Matt; Aaron Fuegi contains citations from USPTO patents granted 1947-2018 to articles captured by the MS Academic Graph (ID) from 1800-2018.

Files, tab-separated, are available at link :

The main file, pcs.tsv, contains the resolved citations matching patent number, MAG ID, the original citation from the patent, an indicator for whether the citation was supplied by the applicant, examiner, or unknown, and a confidence score (1-10) indicating how likely this match is correct.

There is also a PubMed-specific match in pcs-pubmed.tsv.

Authors also made availabel source code for generating the patent citations to science in pcs.tsv is available at Source code for generating and (Journal Impact Factor and Journal Commercial Impact Factor) is at

Scripts and programs are mainly for stata & linux (DO files and sh scripts).


Monday, July 8, 2019

PATSTAT - patentsview concordance update 2019

PatentsView  is a platform built on data derived from USPTO bulk data files.

This dataset complements perfectly PATSTAT since the former has an native disambiguation of inventors and applicants and a geocoding system applied to inventors and applicants, while the latter links US data to other offices, allowing to calculate knowledge spillovers, family data etc.

At this link is possible to download a table of concordance between patent_id (Patentsview main key) and appln_id (from PATSTAT).

Overlapping of the two datasets is not perfect since Patentsview contains only granted patents after 1975, where PATSTAT has also application (ungranted) and timeframe covers also pre-1975 data.
On the other hand PATSTAT misses design patents before 2001,  plants before 2001 and 'statutory invention registration' type of patent.

Data are from PATSTAT spring 2019 and Patentsview march 2019, thus also 2019 data in PW are partially missing in PATSTAT.

Thursday, June 6, 2019

Patent familiarity calc scripts

I made available on Github a set of MySQL and python scripts to create familiarity indicator by IPC class, on NBER patents dataset.

Inventor’s familiarity with components of the invention measured by the (a) recent and (b) frequent usage of focal patent’s classes across all US patents. Thus, we calculate a measure of familiarity for each separate class of a focal patent. Therefore, the more recently and frequently a class has been used, its individual measure will be higher.

Based on Fleming 2001

Wednesday, May 22, 2019

PATSTAT 2019a analysis of changes

As Stated in EPO documents new version of PATSTAT comes with 2 mayor changes:

Person names in the original language (PERSON_NAME_ORIG_LG)
Field added in tables TLS206, 226, 906. This creates an inflation of records and lack of retrocompatibility with old personwise data.

“RELEVANT_CLAIM” attribute in the TLS215_CITN_CATEG table
A new attribute “RELEVANT_CLAIM” has been added to the TLS215_CITN_CATEG table. This attribute contains a single number referring to the claim to which the citation refers to.

About first change you can find below, for offices with more than 100.000 person_ids, how are they  most affected, where the last column shows inflation rate

APPLN_AUTH Count_person_id_2018b Count_person_id_2019a ratio
AR 150688 182005 121%
AT 1142798 1232406 108%
AU 1870980 2008555 107%
BE 196598 204115 104%
BR 1223470 1315337 108%
CA 2524565 2675965 106%
CH 650912 664626 102%
CN 6192040 9956730 161%
CS 166131 178125 107%
CZ 130967 233777 179%
DD 226783 234141 103%
DE 5095800 5258606 103%
DK 624763 687156 110%
EP 6624815 6625117 100.005%
ES 1291041 1389533 108%
FI 365453 384041 105%
FR 1471641 1588271 108%
GB 1757273 1797304 102%
GR 134008 173190 129%
HK 300604 426287 142%
HU 234866 303144 129%
IL 128159 167805 131%
IN 190763 213962 112%
IT 661777 719438 109%
JP 2802730 3787058 135%
KR 1906273 3031637 159%
MX 524748 620385 118%
MY 121991 158129 130%
NL 144408 155950 108%
NO 370993 444431 120%
NZ 261084 329085 126%
PL 370207 492642 133%
PT 218054 249693 115%
RO 131598 152849 116%
RU 979629 1831707 187%
SE 406973 420885 103%
SG 277368 363149 131%
SU 1536615 1718307 112%
TW 1004135 1143685 114%
UA 164487 339271 206%
US 12270415 12630422 103%
WO 4119901 4756241 115%
ZA 511180 621550 122%

About the latter, number of records in TLS215 are more than doubled since for each category (now categories can be multiple as AD, AX ...) one line for relevant claim is created.

Sunday, May 19, 2019

PATSTAT 2019a MySQL upload scripts

At this link are available my new scripts for MySQL to upload PATSTAT 2019a data.

New features in this ediction:

Person names in the original language (PERSON_NAME_ORIG_LG)
Field added in tables TLS206, 226, 906. This creates an inflation of records and lack of retrocompatibility with old personwise data.

“RELEVANT_CLAIM” attribute in the TLS215_CITN_CATEG table
A new attribute “RELEVANT_CLAIM” has been added to the TLS215_CITN_CATEG table. This attribute contains a single number referring to the claim to which the citation refers to.

Wednesday, May 8, 2019

webscraping: download of ANVUR list of journals in Python

Recently I had the task of creating a dataset of scientific journal classified by ANVUR (italian agency for research rating);
Unfortunately the lists are splitted by research area and available only in PDF at URL

In order to make my life easier I created a Python 3 script that downloads all PDFs and via Tabula library, transforms PDF tables into CSVs.

I put the script below (note URL is hardcoded, for future uses change it)

Still CSVs need some work due to multiline titles.
To make life easie I made 2 xls files with A journals and all journals, for Areas 11 12 13 and 14.

XLS files can be downloaded here.

# python 35

# pdf downloader code extractor

from bs4 import BeautifulSoup
import requests
import time
import codecs
import PyPDF2
import os
from tabula import read_pdf
import pandas as pd

if __name__ == "__main__":

    istable = input('Pdf are tables?[N]') or 'N'
    dnl = input('download PDFs?[Y]') or 'Y'

    if dnl=="Y":
        archive_url = ""
        response = requests.get(archive_url)

        soup = BeautifulSoup(response.text, 'html.parser')

        pdf_links = [link['href'] for link in soup.find_all('a') if link['href'].endswith('pdf')]

        for link in pdf_links:

            if link[:4]!='http':
                link = archive_url + link

            '''iterate through all links in and download them one by one'''

            # obtain filename by splitting url and getting
            # last string
            file_name = link.split('/')[-1]

            print ("Downloading file:%s" % file_name)

            # create response object
            r = requests.get(link, stream=True)

            # download started
            with open(file_name, 'wb') as f:
                for chunk in r.iter_content(chunk_size=1024 * 1024):
                    if chunk:

            print ("%s downloaded!\n" % file_name)

        print ("All file downloaded!")

    pdfDir = ""
    txtDir = ""

    if pdfDir == "": pdfDir = os.getcwd() + "\\"  # if no pdfDir passed in
    if txtDir == "": txtDir = os.getcwd() + "\\"  # if no txtDir passed in

    for pdf_to_read in os.listdir(pdfDir):  # iterate through pdfs in pdf directory
        fileExtension = pdf_to_read.split(".")[-1]  # -1 takes always last part
        if fileExtension == "pdf":
            pdfFilename = pdfDir + pdf_to_read
            textFilename = txtDir + pdf_to_read + ".txt"
            textFile = open(textFilename, "a")  # make text file

            if istable == 'N':

                pdf = PyPDF2.PdfFileReader(open(pdfFilename, "rb"))
                for page in pdf.pages:
                    textFile.write(page.extractText())  # write text to text file

                df= read_pdf(pdfFilename, pages="all")


Monday, May 6, 2019

PATSTAT data coverage

When doing advanced statistical analysis, it is important to understand the coverage and content of the data you are working with. PATSTAT Global contains data coming from all over the world. Quality, timeliness and completeness vary a great deal depending on patent office.

EPO compiled a Tableau  dashboard that maps the content and coverage of PATSTAT Global.

This chart shows the percentage of patent applications having a certain data element (e.g. CPC classification) by patent authority and application year in PATSTAT Global.