Rawpatentdata: May 2019

Wednesday, May 22, 2019

PATSTAT 2019a analysis of changes

As Stated in EPO documents new version of PATSTAT comes with 2 mayor changes:

Person names in the original language (PERSON_NAME_ORIG_LG)
Field added in tables TLS206, 226, 906. This creates an inflation of records and lack of retrocompatibility with old personwise data.

“RELEVANT_CLAIM” attribute in the TLS215_CITN_CATEG table
A new attribute “RELEVANT_CLAIM” has been added to the TLS215_CITN_CATEG table. This attribute contains a single number referring to the claim to which the citation refers to.

About first change you can find below, for offices with more than 100.000 person_ids, how are they most affected, where the last column shows inflation rate

APPLN_AUTH	Count_person_id_2018b	Count_person_id_2019a	ratio
AR	150688	182005	121%
AT	1142798	1232406	108%
AU	1870980	2008555	107%
BE	196598	204115	104%
BR	1223470	1315337	108%
CA	2524565	2675965	106%
CH	650912	664626	102%
CN	6192040	9956730	161%
CS	166131	178125	107%
CZ	130967	233777	179%
DD	226783	234141	103%
DE	5095800	5258606	103%
DK	624763	687156	110%
EP	6624815	6625117	100.005%
ES	1291041	1389533	108%
FI	365453	384041	105%
FR	1471641	1588271	108%
GB	1757273	1797304	102%
GR	134008	173190	129%
HK	300604	426287	142%
HU	234866	303144	129%
IL	128159	167805	131%
IN	190763	213962	112%
IT	661777	719438	109%
JP	2802730	3787058	135%
KR	1906273	3031637	159%
MX	524748	620385	118%
MY	121991	158129	130%
NL	144408	155950	108%
NO	370993	444431	120%
NZ	261084	329085	126%
PL	370207	492642	133%
PT	218054	249693	115%
RO	131598	152849	116%
RU	979629	1831707	187%
SE	406973	420885	103%
SG	277368	363149	131%
SU	1536615	1718307	112%
TW	1004135	1143685	114%
UA	164487	339271	206%
US	12270415	12630422	103%
WO	4119901	4756241	115%
ZA	511180	621550	122%

About the latter, number of records in TLS215 are more than doubled since for each category (now categories can be multiple as AD, AX ...) one line for relevant claim is created.

Sunday, May 19, 2019

PATSTAT 2019a MySQL upload scripts

At this link are available my new scripts for MySQL to upload PATSTAT 2019a data.

New features in this ediction:

Person names in the original language (PERSON_NAME_ORIG_LG)
Field added in tables TLS206, 226, 906. This creates an inflation of records and lack of retrocompatibility with old personwise data.

“RELEVANT_CLAIM” attribute in the TLS215_CITN_CATEG table
A new attribute “RELEVANT_CLAIM” has been added to the TLS215_CITN_CATEG table. This attribute contains a single number referring to the claim to which the citation refers to.

Wednesday, May 8, 2019

webscraping: download of ANVUR list of journals in Python

Recently I had the task of creating a dataset of scientific journal classified by ANVUR (italian agency for research rating);
Unfortunately the lists are splitted by research area and available only in PDF at URL

http://www.anvur.it/attivita/classificazione-delle-riviste/classificazione-delle-riviste-ai-fini-dellabilitazione-scientifica-nazionale/elenchi-di-riviste-scientifiche-e-di-classe-a/

In order to make my life easier I created a Python 3 script that downloads all PDFs and via Tabula library, transforms PDF tables into CSVs.

I put the script below (note URL is hardcoded, for future uses change it)

Still CSVs need some work due to multiline titles.
To make life easie I made 2 xls files with A journals and all journals, for Areas 11 12 13 and 14.

XLS files can be downloaded here.

# python 35

# pdf downloader code extractor

from bs4 import BeautifulSoup
import requests
import time
import codecs
import PyPDF2
import os
from tabula import read_pdf
import pandas as pd

if __name__ == "__main__":

    istable = input('Pdf are tables?[N]') or 'N'
    dnl = input('download PDFs?[Y]') or 'Y'

    if dnl=="Y":
        archive_url = "http://www.anvur.it/attivita/classificazione-delle-riviste/classificazione-delle-riviste-ai-fini-dellabilitazione-scientifica-nazionale/elenchi-di-riviste-scientifiche-e-di-classe-a/"
        response = requests.get(archive_url)

        soup = BeautifulSoup(response.text, 'html.parser')

        pdf_links = [link['href'] for link in soup.find_all('a') if link['href'].endswith('pdf')]

        for link in pdf_links:

            if link[:4]!='http':
                link = archive_url + link

            '''iterate through all links in and download them one by one'''

            # obtain filename by splitting url and getting
            # last string
            file_name = link.split('/')[-1]

            print ("Downloading file:%s" % file_name)

            # create response object
            r = requests.get(link, stream=True)

            # download started
            with open(file_name, 'wb') as f:
                for chunk in r.iter_content(chunk_size=1024 * 1024):
                    if chunk:
                        f.write(chunk)

            print ("%s downloaded!\n" % file_name)

        print ("All file downloaded!")

    pdfDir = ""
    txtDir = ""

    if pdfDir == "": pdfDir = os.getcwd() + "\\" # if no pdfDir passed in
    if txtDir == "": txtDir = os.getcwd() + "\\" # if no txtDir passed in

    for pdf_to_read in os.listdir(pdfDir): # iterate through pdfs in pdf directory
        fileExtension = pdf_to_read.split(".")[-1] # -1 takes always last part
        if fileExtension == "pdf":
            pdfFilename = pdfDir + pdf_to_read
            textFilename = txtDir + pdf_to_read + ".txt"
            textFile = open(textFilename, "a") # make text file

            if istable == 'N':

                pdf = PyPDF2.PdfFileReader(open(pdfFilename, "rb"))
                for page in pdf.pages:
                    textFile.write(page.extractText()) # write text to text file

            else:
                df= read_pdf(pdfFilename, pages="all")
                df.to_csv(textFilename)

            textFile.close()

Monday, May 6, 2019

PATSTAT data coverage

When doing advanced statistical analysis, it is important to understand the coverage and content of the data you are working with. PATSTAT Global contains data coming from all over the world. Quality, timeliness and completeness vary a great deal depending on patent office.

EPO compiled a Tableau dashboard that maps the content and coverage of PATSTAT Global.

This chart shows the percentage of patent applications having a certain data element (e.g. CPC classification) by patent authority and application year in PATSTAT Global.

https://public.tableau.com/profile/patstat.support#!/vizhome/CoverageofPATSTAT2018AutumnEdition/CoveragePATSTATGlobal