Wednesday, May 22, 2019

PATSTAT 2019a analysis of changes

As Stated in EPO documents new version of PATSTAT comes with 2 mayor changes:

Person names in the original language (PERSON_NAME_ORIG_LG)
Field added in tables TLS206, 226, 906. This creates an inflation of records and lack of retrocompatibility with old personwise data.

“RELEVANT_CLAIM” attribute in the TLS215_CITN_CATEG table
A new attribute “RELEVANT_CLAIM” has been added to the TLS215_CITN_CATEG table. This attribute contains a single number referring to the claim to which the citation refers to.

About first change you can find below, for offices with more than 100.000 person_ids, how are they  most affected, where the last column shows inflation rate



APPLN_AUTH Count_person_id_2018b Count_person_id_2019a ratio
AR 150688 182005 121%
AT 1142798 1232406 108%
AU 1870980 2008555 107%
BE 196598 204115 104%
BR 1223470 1315337 108%
CA 2524565 2675965 106%
CH 650912 664626 102%
CN 6192040 9956730 161%
CS 166131 178125 107%
CZ 130967 233777 179%
DD 226783 234141 103%
DE 5095800 5258606 103%
DK 624763 687156 110%
EP 6624815 6625117 100.005%
ES 1291041 1389533 108%
FI 365453 384041 105%
FR 1471641 1588271 108%
GB 1757273 1797304 102%
GR 134008 173190 129%
HK 300604 426287 142%
HU 234866 303144 129%
IL 128159 167805 131%
IN 190763 213962 112%
IT 661777 719438 109%
JP 2802730 3787058 135%
KR 1906273 3031637 159%
MX 524748 620385 118%
MY 121991 158129 130%
NL 144408 155950 108%
NO 370993 444431 120%
NZ 261084 329085 126%
PL 370207 492642 133%
PT 218054 249693 115%
RO 131598 152849 116%
RU 979629 1831707 187%
SE 406973 420885 103%
SG 277368 363149 131%
SU 1536615 1718307 112%
TW 1004135 1143685 114%
UA 164487 339271 206%
US 12270415 12630422 103%
WO 4119901 4756241 115%
ZA 511180 621550 122%


About the latter, number of records in TLS215 are more than doubled since for each category (now categories can be multiple as AD, AX ...) one line for relevant claim is created.

Sunday, May 19, 2019

PATSTAT 2019a MySQL upload scripts


At this link are available my new scripts for MySQL to upload PATSTAT 2019a data.

New features in this ediction:


Person names in the original language (PERSON_NAME_ORIG_LG)
Field added in tables TLS206, 226, 906. This creates an inflation of records and lack of retrocompatibility with old personwise data.

“RELEVANT_CLAIM” attribute in the TLS215_CITN_CATEG table
A new attribute “RELEVANT_CLAIM” has been added to the TLS215_CITN_CATEG table. This attribute contains a single number referring to the claim to which the citation refers to.

Wednesday, May 8, 2019

webscraping: download of ANVUR list of journals in Python


Recently I had the task of creating a dataset of scientific journal classified by ANVUR (italian agency for research rating);
Unfortunately the lists are splitted by research area and available only in PDF at URL

http://www.anvur.it/attivita/classificazione-delle-riviste/classificazione-delle-riviste-ai-fini-dellabilitazione-scientifica-nazionale/elenchi-di-riviste-scientifiche-e-di-classe-a/

In order to make my life easier I created a Python 3 script that downloads all PDFs and via Tabula library, transforms PDF tables into CSVs.

I put the script below (note URL is hardcoded, for future uses change it)

Still CSVs need some work due to multiline titles.
To make life easie I made 2 xls files with A journals and all journals, for Areas 11 12 13 and 14.

XLS files can be downloaded here.




# python 35

# pdf downloader code extractor

from bs4 import BeautifulSoup
import requests
import time
import codecs
import PyPDF2
import os
from tabula import read_pdf
import pandas as pd

if __name__ == "__main__":


    istable = input('Pdf are tables?[N]') or 'N'
    dnl = input('download PDFs?[Y]') or 'Y'

    if dnl=="Y":
        archive_url = "http://www.anvur.it/attivita/classificazione-delle-riviste/classificazione-delle-riviste-ai-fini-dellabilitazione-scientifica-nazionale/elenchi-di-riviste-scientifiche-e-di-classe-a/"
        response = requests.get(archive_url)

        soup = BeautifulSoup(response.text, 'html.parser')

        pdf_links = [link['href'] for link in soup.find_all('a') if link['href'].endswith('pdf')]


        for link in pdf_links:

            if link[:4]!='http':
                link = archive_url + link

            '''iterate through all links in and download them one by one'''

            # obtain filename by splitting url and getting
            # last string
            file_name = link.split('/')[-1]

            print ("Downloading file:%s" % file_name)

            # create response object
            r = requests.get(link, stream=True)

            # download started
            with open(file_name, 'wb') as f:
                for chunk in r.iter_content(chunk_size=1024 * 1024):
                    if chunk:
                        f.write(chunk)

            print ("%s downloaded!\n" % file_name)

        print ("All file downloaded!")

    pdfDir = ""
    txtDir = ""

    if pdfDir == "": pdfDir = os.getcwd() + "\\"  # if no pdfDir passed in
    if txtDir == "": txtDir = os.getcwd() + "\\"  # if no txtDir passed in

    for pdf_to_read in os.listdir(pdfDir):  # iterate through pdfs in pdf directory
        fileExtension = pdf_to_read.split(".")[-1]  # -1 takes always last part
        if fileExtension == "pdf":
            pdfFilename = pdfDir + pdf_to_read
            textFilename = txtDir + pdf_to_read + ".txt"
            textFile = open(textFilename, "a")  # make text file

            if istable == 'N':

                pdf = PyPDF2.PdfFileReader(open(pdfFilename, "rb"))
                for page in pdf.pages:
                    textFile.write(page.extractText())  # write text to text file

            else:
                df= read_pdf(pdfFilename, pages="all")
                df.to_csv(textFilename)



            textFile.close()

Monday, May 6, 2019

PATSTAT data coverage

When doing advanced statistical analysis, it is important to understand the coverage and content of the data you are working with. PATSTAT Global contains data coming from all over the world. Quality, timeliness and completeness vary a great deal depending on patent office.


EPO compiled a Tableau  dashboard that maps the content and coverage of PATSTAT Global.


This chart shows the percentage of patent applications having a certain data element (e.g. CPC classification) by patent authority and application year in PATSTAT Global.