Sunday, May 19, 2019

PATSTAT 2019a MySQL upload scripts


At this link are available my new scripts for MySQL to upload PATSTAT 2019a data.

New features in this ediction:


Person names in the original language (PERSON_NAME_ORIG_LG)
Field added in tables TLS206, 226, 906. This creates an inflation of records and lack of retrocompatibility with old personwise data.

“RELEVANT_CLAIM” attribute in the TLS215_CITN_CATEG table
A new attribute “RELEVANT_CLAIM” has been added to the TLS215_CITN_CATEG table. This attribute contains a single number referring to the claim to which the citation refers to.

Wednesday, May 8, 2019

webscraping: download of ANVUR list of journals in Python


Recently I had the task of creating a dataset of scientific journal classified by ANVUR (italian agency for research rating);
Unfortunately the lists are splitted by research area and available only in PDF at URL

http://www.anvur.it/attivita/classificazione-delle-riviste/classificazione-delle-riviste-ai-fini-dellabilitazione-scientifica-nazionale/elenchi-di-riviste-scientifiche-e-di-classe-a/

In order to make my life easier I created a Python 3 script that downloads all PDFs and via Tabula library, transforms PDF tables into CSVs.

I put the script below (note URL is hardcoded, for future uses change it)

Still CSVs need some work due to multiline titles.
To make life easie I made 2 xls files with A journals and all journals, for Areas 11 12 13 and 14.

XLS files can be downloaded here.




# python 35

# pdf downloader code extractor

from bs4 import BeautifulSoup
import requests
import time
import codecs
import PyPDF2
import os
from tabula import read_pdf
import pandas as pd

if __name__ == "__main__":


    istable = input('Pdf are tables?[N]') or 'N'
    dnl = input('download PDFs?[Y]') or 'Y'

    if dnl=="Y":
        archive_url = "http://www.anvur.it/attivita/classificazione-delle-riviste/classificazione-delle-riviste-ai-fini-dellabilitazione-scientifica-nazionale/elenchi-di-riviste-scientifiche-e-di-classe-a/"
        response = requests.get(archive_url)

        soup = BeautifulSoup(response.text, 'html.parser')

        pdf_links = [link['href'] for link in soup.find_all('a') if link['href'].endswith('pdf')]


        for link in pdf_links:

            if link[:4]!='http':
                link = archive_url + link

            '''iterate through all links in and download them one by one'''

            # obtain filename by splitting url and getting
            # last string
            file_name = link.split('/')[-1]

            print ("Downloading file:%s" % file_name)

            # create response object
            r = requests.get(link, stream=True)

            # download started
            with open(file_name, 'wb') as f:
                for chunk in r.iter_content(chunk_size=1024 * 1024):
                    if chunk:
                        f.write(chunk)

            print ("%s downloaded!\n" % file_name)

        print ("All file downloaded!")

    pdfDir = ""
    txtDir = ""

    if pdfDir == "": pdfDir = os.getcwd() + "\\"  # if no pdfDir passed in
    if txtDir == "": txtDir = os.getcwd() + "\\"  # if no txtDir passed in

    for pdf_to_read in os.listdir(pdfDir):  # iterate through pdfs in pdf directory
        fileExtension = pdf_to_read.split(".")[-1]  # -1 takes always last part
        if fileExtension == "pdf":
            pdfFilename = pdfDir + pdf_to_read
            textFilename = txtDir + pdf_to_read + ".txt"
            textFile = open(textFilename, "a")  # make text file

            if istable == 'N':

                pdf = PyPDF2.PdfFileReader(open(pdfFilename, "rb"))
                for page in pdf.pages:
                    textFile.write(page.extractText())  # write text to text file

            else:
                df= read_pdf(pdfFilename, pages="all")
                df.to_csv(textFilename)



            textFile.close()

Monday, May 6, 2019

PATSTAT data coverage

When doing advanced statistical analysis, it is important to understand the coverage and content of the data you are working with. PATSTAT Global contains data coming from all over the world. Quality, timeliness and completeness vary a great deal depending on patent office.


EPO compiled a Tableau  dashboard that maps the content and coverage of PATSTAT Global.


This chart shows the percentage of patent applications having a certain data element (e.g. CPC classification) by patent authority and application year in PATSTAT Global.



Wednesday, October 17, 2018

PATSTAT autumn 2018 MySQL upload scripts

at this link is possible to download a batch of scripts for MySQL that will allow you to upload new PATSTAT edition autumn 2018.

This release has some improvements as:

* Table TLS201_APPLN and TLS211:  attribute granted changed from 0/1 to Y/N.

* Table TLS212_CITATION: Euro-PCT applications did not have the citations from the international search report linked to the respective application (and publication). These are the so called A0 publications. To avoid this, EPO simply duplicated the citations from the international search report, and linked them to the respective EP publications.

* Table TLS803_LEGAL_EVENT_CODE: has been redesigned to match WIPO ST.27.

Tuesday, October 16, 2018

PATSTAT projects on github



Refilling PATSTAT addresses

this project contains a docker container in Python and MySQL to refill persons where addresses is missin

https://github.com/cortext/patstat/tree/master/parsed%20addresses

Classify Legal Entities And Individuals From Patent Applicants

A batch of MySQL script to discriminate type of applicant

https://github.com/cortext/patstat/tree/master/applicants%20classification

Add official name of patent office


https://github.com/cortext/patstat/tree/master/nomenclatures/offices_classification

building descriptions for the International Patent Classification

An API embedded into a VM to get the full description of IPC codes


https://github.com/cortext/patstat/tree/master/nomenclatures/ipc_descriptions


PATSTAT loader

https://github.com/simonemainardi/load_patstat


psClean
Python library and associated code for preparing PATSTAT inventor-patent data for disambiguation with either the Torvik-Smallheiser or Open City Dedupe algorithms.

https://github.com/markhuberty/psClean

 
fuzzygeo
fuzzygeo provides a fuzzy geocoding routine for geocoding at the named entity (city or similar) level
https://github.com/markhuberty/fuzzygeo

psClassify
a simple supervised learning algorithm to classify PATSTAT records into two categories:
  • person names
  • not person names
https://github.com/mkln/psClassify
















 

Friday, September 21, 2018

Google dataset search

Recently Google launced a new service aiming to index local, public and national data repositories: Google Dataset Search.

Dataset Search lets you find datasets wherever they’re hosted, whether it’s a publisher's site, a digital library, or an author's personal web page.

Google also developed guidelines for dataset providers to describe their data in a way that search engines can better understand the content of their pages.

The approach is based on an open standard for describing this information (schema.org) and anybody who publishes data can describe their dataset this way.

The engine also links, where possible, the dataset to Google Scholar articles using them.


Full story @ link
https://www.blog.google/products/search/making-it-easier-discover-datasets/

Wednesday, September 19, 2018

How to build indicators from PATSTAT step by step

A presentation I prepared about how to build a patents based indicator (Patents orìiginality) step by step for EPO PATSTAT avoiding most common pitfalls, included commented SQL code
hope you will find it useful