Tuesday, November 30, 2010

ESF‐APE‐INV 2nd “Name Game” workshop

The 2nd “Name Game” workshop on patent data will be held in Madrid, on December 9‐10, 2010, as part of the APE‐INV project, sponsored by the European Science Foundation.

It builds upon the success of a similar initiative held in Paris in November 2009, also as part of the APE‐INV project, and it aims at convening several researchers interested into building inventor‐based patent datasets, which requires solving a number of technical problems and exchanging data and expertise. It will be hosted by IPP, the Instituto de Politicas y Bienes Publicos, of CSIC (Consejo Superior de Investigaciones Científicas) and organized by Francesco Lissoni (KITES‐Bocconi) and Catalina Martinez (IPP‐CSIC).

APE-INV is a project funded by the European Science Foundation that aims at identifying inventions stemming from academic research through a reclassification by inventor of patents from PatStat, the EPO Worldwide Patent Statistical Database.

Such reclassification effort requires inventors’ names, surnames, and addresses to be parsed, matched, and filtered, in order to identify synonyms (that is names+surnames or addresses which are the same, although spelled differently) and to
disambiguate homonyms (verify whether two inventors with same name and surname are indeed the same person). Several algorithms have been produced in the recent past, either with reference to data from PatStat or from national patent offices.

One the objectives of the APE-INV project is to compare the accuracy and efficiency of such algorithms, and to involve as many researchers as possible in a collective research effort aimed at producing a shared database of inventors’ names, surnames, and addresses, linked to PatStat.

In order to achieve this objective APE-INV produces a number of PatStat-based benchmark databases, and invites all interested parties to test their algorithms against them. The present document (to be updated periodically) describes such benchmark databases, their rules of access, and provides guidelines on how to conduct the tests and how to report their results, in order to ensure comparability. Information is also provided on workshops that will be organized in order to allow a discussion of the results.

Objectives and Programme Approach

The main results expected by Ape-INV are there:

Sharing experiences for the creation of INV Database
  to share expertise and methods among European (and US or Japanese) reearchers for the creation of an inventors’ database, one that will identify all different spelling variations of the inventor’s name, as well as the inventor’s different addresses and patents;
  to share expertise and methods among European researchers for matching the inventors’ database with national databases of academic scientists, in order to produce comparable counts of academic patenting activity and to collect auxiliary information on academic inventors;

Producing a Database on Academic Patenting in Europe (APE-INV Database)
  to produce a freely available database on “academic patenting in Europe”, that will contain reliable and comparable information on the contribution of European academic scientists to technology transfer via patenting, and that researchers will be able to update in the future.

Editing joint publications using the Data-set
  to experiment the database opportunities, editing one or more joint publications containing original applications of the newly created databases;

Designing a Method to allow users to correct data
  to devise a method for collecting the database users’ feedbacks on the quality of the data, one that will allow users to enter their own corrections to the identification errors. Users’ corrections are particularly important in this case, since the identification of inventors rely on algorithms that make use of information on each inventor’s social network: so users’ corrections to the identity of one inventor may lead to correct the identity of others.

Cooperating with established institutions in the field of patent data
  A unique window of opportunity has been recently opened by the European Patent Office (EPO), and its collaboration with the OECD Patent Statistics Task Force for the creation of PATSTAT, a new database for statistical use (joint with WIPO, the World Intellectual Property Organization, and Eurostat).

APE-INV program will contribute to create a community of PATSTAT users, and to turn PATSTAT into a reference source for the worldwide community of social scientists engaged in science and technology studies. Besides making use of PATSTAT data, APE-INV will try to establish co-operation ties with all the insititutions involved in its development.

Participation to APE-INV

Currently, the APE-INV programme is supported by 9 ESF member organizations, namely:

• Austrian Science Fund (FWF)
• National Fund for Scientific Research (FNRS), Belgium
• Research Foundation Flanders (FWO), Belgium
• The National Foundation of Science, Higher Education and Technological Development of the Republic of Croatia (NZZ),
• Danish Social Science Research Council
• German Research Foundation (DFG)
• National Research Council (CNR), Italy
• Netherlands Organisation for Scientific Research (NWO)
• Council for Scientific Research (CSIC), Spain
• Swedish Research Council (VR)
• Swiss National Science Foundation (SNSF)

However, APE-INV is willing to enlarge the participation to other countries, in order to become a pan-European research networking programme. In addition, collaboration with scientists worldwide is encouraged, and will be pursued with the help of subsidiary funds.

Workshop on patenting in China

This one-day workshop deals with various issues having to do with Chinese patents and the patent system in China : evolution of the patent system in China over the last 20 years, quality of the Chinese patents, enforceability of patents in China, availability of datasets on Chinese patents and patent citations, effects of patent system on economic performance. There will be presentations by economists and lawyers.

Maastricht, the Netherlands
UNU-MERIT, Keizer Karelplein 19, conference room
December 10, 2010

more on: http://www.merit.unu.edu/patenting/

Wednesday, November 24, 2010

Deeper into patstat ipcs in september 2010 data

Two issues with IPCs 
First of all a new group of IPC code arised


we have  73195 in all patstat, but still we have no reclassification (OST30 or NACE) for them

See as example: EP 2250533
Soon we will try to suggest a possible cathegory for this IPC.

On the other hand we must highlight some IPC, falling under ipc_class_level 'S' are present in sept. 2009 data aside Advanced and Core IPCs. EPO suggestion is to drop them (they are about 1.2 Millions)

'A', 160.067.275
'C', 140.155.212
'S',   1.294.288

Friday, November 19, 2010

News from patstat user conference

Some news about patstat future development were disclosed in Vienna @ patent statistics for decision makers conference.

First of all, as already cited, ECLA classification has been added in available informations.
Such classification is an ectension of IPC containing 135.000 symbols vs 70.000 in IPC.

April 2010 ediction added complete ECLA (ECLA, ICO, ECNO, IDT codes)
September 2010: added EST codes Y02 (ICO symbols) patents in the area of sustainanable energy technologies

About Ecla you can find on EPO website an e-learning module.

Another important issue is the possibility of querying patstat on line from EPO website.
From URL https://data.epo.org/expert-services/ you can run some queries (a part of them, those with a lock icon, are accessible only to service subscribers) and make also graphics, straight on epo data.

Also it has been announced what could be released in next patstat edictions:

• EP number of claims
• US number of claims
• US drawings
• PCT addresses

• A stable application id
• Cited References - New Features • IPC Classes - Core and Advanced

In detail:
Number of grants may be added to EP-B pubblications within a patstat table (probably TLS211)
Anyway EP-B publications may have different numbers of claims depending on designated countries and language considered
FI: EP 1311379 B1: GB/DE: 19 claims
AT/BE/.../TR: 13 claims

For unique and Stable Application Identifier, will be used published applications : DOCDB "R-id" ; for replenished applications will be used a surrogate key e.g : starting from value 900,000,000

Cited References:
In Sept 2010 - New origins have been added
 '5' - International Search Report
 '6' - Supplementary Search Report
 '7' - Chapter II

In April 2011 - International Search Authority will be added
Options :
 provide data inside CITN_ORIGIN
e.g. CITN_ORIGIN = '5SE'
or introduce a new field to contain authority

Cited References - Optional Features:
 Option : have cited applications included  origin '1' - cited by the applicant on filing  earlier application filed by same applicant  application that has not been published

In Sept 2010 : an application would have at least "core" level possibly "advanced" level
In April 2011 : an application will have either "core" level or "advanced" level

Monday, November 15, 2010

Importing patstat TLS221 (PRS inpadoc legal status) into MYSql

From september 2010 inpadoc legal status data are available as a separate dataset, but linked to the rest of patstat via application id.
Datacome with no documentation, since they link existing PRS 14.11 product documents (so I link my previous posts on the argument )

In comparison to the old PRS data, there are no more fields L001 - 500, since application and publication data may be retrieved from TLS201 and 211 from core patstat tables, so they became redundant.

Total number of records in this ediction to be loaded is 80.328.938, devided into 16 files containing 5M records (header excluded), and the last containing 328938.
Be aware you'd need 20Gb of disk space for the whole process.

At this link [rightclick and choose save as] you may download MYSql scripts for TLS221: care that they are 2 scripts since due to some errors, so if you want to be sure, at mid of first script you should run the second for creating a temp table to handcheck and correct (but this step is really for maniac since only a few records are to be corrected).
Then you can do other optimizations like setting date field to date formats from char etc.
Be also aware that in this first ediction records are terminated from {LF} but # 14 and 15 that are terminated by{CR/LF}.
My scripts reflect actual situation, but be aware EPO will soon fix this problem so scripts will need to be amended.

If  you need a detailed list of 'problematic records' by number of txt file of TLS221 (they usually have a text field terminated by / so it escapes field termination and fields are shifted... this is the reason of the need of fixing records with above mentined procedure)...

Row 2117823 was truncated; it contained more data than there were input columns
( appln_id 010762977 progr #2)
Row 1243470 doesn't contain data for all columns
Row 3091645 doesn't contain data for all columns
Row 3504629 doesn't contain data for all columns
Row 3021693 doesn't contain data for all columns

patstat sept. 2010 MYSql load scripts

A few days ago was released september 2010 version of PATSTAT data.
Along with 3 DVD come a precedure for uploading data into MS Sql server, but a very old procedure for MYSql user.

Cliccking on above icon you may get my scripts for uplaoding the data. They are released as they are, under CC 3.0 licence (this means you can use them but not resell 'em).
Please post a comment if you used them and found them useful, or if you have suggestions to improve 'em.

In this post I also put some comments.

First of all there is a 'problem' in comparison to previous versions, since records are not terminated with {CR/LF} but with {LF}.

Apart from that here I list the log of record load: we always lose some records in tables 202, 203 but oddly we 'earn' some records in 210 and 217, compared to the content declared by EPO.

table date of files declared imported delta
TLS201_APPLN 201009 66.226.956 66.226.956 0
TLS202_APPLN_TITLE 201009 48.303.269 48.303.256 13
TLS203_APPLN_ABSTR 201009 18.139.427 18.139.356 71
TLS204_APPLN_PRIOR 201009 28.823.857 28.823.857 0
TLS205_TECH_REL 201009 2.122.738 2.122.738 0
TLS206_ASCII 201009 37.428.136 37.428.107 29
TLS207_PERS_APPLN 201009 134.687.197 134.687.197 0
TLS208_DOC_STD_NMS 201009 16.864.577 16.864.577 0
TLS209_APPLN_IPC 201009 301.516.775 301.516.775 0
TLS210_APPLN_N_CLS 201009 25.289.374 25.289.379 -5
TLS211_PAT_PUBLN 201009 74.161.545 74.161.545 0
TLS212_CITATION 201009 97.111.948 97.111.948 0
TLS214_NPL_PUBLN 201009 14.826.883 14.826.881 2
TLS215_CITN_CATEG 201009 18.043.102 18.043.102 0
TLS216_APPLN_CONTN 201009 1.769.423 1.769.423 0
TLS217_APPLN_I_CLS 201009 101.894.277 101.894.406 -129
TLS218_DOCDB_FAM 201009 58.713.013 58.713.013 0
TLS219_INPADOC_FAM 201009 66.226.956 66.226.956 0

To go in detail (most user will find this boring) some record in some tables have lesser problems.

First of all 3 EP applications have an odd application and pubblication number that gives an upload error in part 3 of files TLS201 and 211 (is due to a non ascii char in the app/pub number and it's tranlastion in 2 chars make the field wider than it should be).

TLS201 - part03:
Data too long for column 'APPLN_NR' at row 21104523
Data too long for column 'APPLN_NR' at row 21145320
Data too long for column 'APPLN_NR' at row 21764445
TLS211 -PART03
Data too long for column 'PUBLN_NR' at row 19039112
Data too long for column 'PUBLN_NR' at row 19079909
Data too long for column 'PUBLN_NR' at row 19699034

This is the content of data, in black the field giving error:

73151912, 'EP', '      9600Ó7LI1', 'A', 66151912, '9999-12-31', '', 0
66151912, 'EP', '      9600Ó7LI1', 'D2', '9999-12-31', 'PI', '', '', 0

Then we have 2 titles longer than 3000 chars, always due to non ascii chars translated into 2 chars, increasing so the size of the field title.

TLS202 - PART01
Data too long for column 'APPLN_TITLE' at row 3126639
Data too long for column 'APPLN_TITLE' at row 3136166
003544896,"mîtodo e composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento de uma doenúa ou uma condiúÃúo autoimune ou infecciosa, mîtodo e composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento de uma doenúa ou uma condiúÃúo do sangue, mîtodo e composiúÃúo farmacêutica para modular a formaúÃúo de cîlulas do sangue, mîtodo e composiúÃúo farmacêutica para intensificar a mobilizaúÃúo perifîrico da cîlula tronco, mîtodo e composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento de uma doenúa ou uma condiúÃúo metabælica, mîtodo e composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento das condiúÃÁes associadas com doses mieloablativas de quimioradioterapia suportadas pelo transplante autælogo de medula æssea ou de cîlulas tronco do sangue perifîrico (asct) ou pelo transplante alogenîico de medula æssea (bmt), mîtodo e composiúÃúo farmacêutica para aumentar o efeito de um fator estimulante de cîlulas do sangue, mîtodo e composiúÃúo farmacêutica para intensificar a colonizaúÃúo de cîlulas tronco do sangue doadas em um receptor mieloablatado, mîtodo e composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento de uma doenúa ou uma condiúÃúo bacteriana, composiúÃúo farmacêutica para o tratamento ou a prevenúÃúo de uma indicaúÃúo selecionada do grupo que consiste em doenúa ou condiúÃúo autoimune, doenúa viral, infecúÃúo viral, doenúa hematolægica, deficiências hematolægicas, trombocitopenia, pancitopenia, granulopenia, hiperlipidemia, hipercolesterolemia, glucosuria, hiperglicemia, diabetes, aids, hiv-1, distúrbios de cîlulas t auxiliares, deficiências de cîlulas dendrüticas, deficiências de macrofagos, distúrbios de cîlulas tronco hematopoiîticas incluindo distúrbios com plaquetas, linfæcitos, cîlulas do plasma e neutræfilos, condiúÃÁes prî-leucêmicas, condiúÃÁes leucêmicas, distúrbios do sistema imunolægico resultantes da terapia de quimioterapia ou de radiaúÃúo, distúrbios do sistema imunolægico humano resultantes do tratamento das doenúas de deficiência imunolægica e infecúÃÁes bacterianas, composiúÃúo farmacêutica para o tratamento ou a prevenúÃúo de um indicaúÃúo selecionada do grupo que consiste em doenúa hematolægica, deficiências hematolægicas, trombocitopenia, pancitopenia, granulopenia, deficiências de cîlulas dendrüticas, deficiências de macrofagos, distúrbios de cîlulas tronco hematopoiîticas incluindo distúrbios com plaquetas, linfæcitos, cîlulas do plasma e neutræfilos, condiúÃÁes prî-leucêmicas, condiúÃÁes leucêmicas, sündrome mielodisplÃístiacas, malignidades nÃúo mielæides, anemia plÃística e insuficiência da medula æssea, peptüdeo purificado, peptüdeo quimîrico purificado, peptüdeo quimîrico, composiúÃúo farmacêutica, composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento de uma condiúÃúo associada com um agente infeccioso de sars, mîtodo de processamento a baixa temperatura de hidrolisato proteolütico de caseüna e hidrolisato de proteüna de caseüna"

Same thing for 1 abstract:

tls203 - PART07
Data too long for column 'APPLN_ABSTRACT' at row 702251

In table 206 ascii - part01 we have one record (#56) that gives some problems since a text field finishes with a / and this shiftes some fields, giving error 'Row 56 doesn't contain data for all columns'

Also in TLS206-part05
last 63 records doesn't contain data for all columns, but this time for real...
(Row 1428062 till 1428127)
037428066010830866054240706US       43659874A 3USPI0003        000000000000000000000

Enjoy you database...

Wednesday, November 10, 2010

Double publication numbers by application id in USPTO

As previously posted, several application authority may have more than one publication number referring to the same application id.

In the case of uspto these are the main publication kind pairs of publication numbers referred to the same application id. It may be useful for creating a disambiguation table.

kind1 kind2 count
'A'  'B1' 3895
'A'  'B2' 124
'A'  'C1' 356
'A'  'E1' 12
'A1'  'A2' 521
'A1'  'A9' 1662
'A1'  'B1' 18410
'A1'  'B2' 845793
'A1'  'H1' 43
'A2'  'B2' 169
'A9'  'B2' 768
'B1'  'B2' 142
'E'  'F1' 59
'P1'  'P2' 424
'P1'  'P3' 1686

Bytheway, counting the difference in months among filing date and publication date for the first of the two pubblication kinds, we get the following result:

months # of applications
4 50501
5 53963
6 130103
7 111533
8 67594
9 48504
10 33086
11 26268
12 20754
13 15320
14 13855
15 11927
16 9346
17 7075
18 186961
19 32484
20 6354

Above 95% of applications listed in the first table are published before 18 months (cases in month 19 may be a rounding error), conforting so our hipothesys that the first of the 2 publication is due to the 18 month publication law.

See also:
18-month publication provision:

About non-publication request:

Thanks to Francesco Lissoni for help in solving this issue.

Monday, November 8, 2010

Linkage among application and publication tables in PATSTAT

We could expect (from EPO documents) that every application has 0 to N publications, where every publication belongs to exactly 1 application.
In reality, it isn't so.

Let's check in deep  the the table containing applications (TLS201) and the one with publications (TLS211).

Linkage among them may be established via APPLN_ID field, that is the main key in patstat.

Referring to sept 2009 ediction, we start by checking application table TLS201.
The first control aiming to discover if the same application_id could referr to distinct couples application authority / application number gives a positive feedback (it would have been a real truble otherwise!)

If we check, on the same table, couples application authority / application number having more than one application id, we find a 9% of duplication.
Nothing strange: we already stated  that the disambiguating data for applications are application authority, application number and APPLICATION KIND, so such a percentuage referrs to offices where same application number may be used (for instance) for a patent of invention and for an utility model, referring to different application filed.
So also this case looks ok with our data model.

Let's see instead what happens with publications table (TLS211).
Using the same citeria used on TLS201 we start checking if the same application_id could referr to distinct couples publication authority / publication number.

We get about a 10% of doublecounting, where the top 5 by publication office is
'JP'    4202644
'US'    867657
'CN'    513398
'GB'    333662
'IT'    264060

Bad news here is that this duplication is due to issues tightly related to the application authority!

In case of Japan, duplication is due to different publication numbers originated from the same application
See this example for publication number JP53123578.

For China happens the same of Japan (see example: CN 1274133)

About USPTO change of law took place in Y2K (eighteen-month publication provisions of the American Inventors Protection Act of 1999) creating a duplication the publications related to the same application: a publication number issued 18 month after publication, related to the application only, then later a final publication number issued when application is granted. An example here.

In Italian patent office (I'm so proud eventually we are in a top 5!!!) we meet a lot of D0 publication kind, that referr to a lot of non existing publication, or better, publication we cannot find FI in espacenet.
FI publication number IT1243259, kind B, shares application id with IT9001711, status D0, that is pastat documents is listed as 'Filing application'

Same issue with D0 kind is found in GB patent office, together with other duplicated kinds (British always overmake... see GB2358979)

So we cannot really find a general rule but country by country should be investigated; probably we may drop D0 type everywhere, and we should know that by counting by publication number we run the risk of ovestimate the number of patents.
Good news: EPO data are not affected from this issue.

Eventually if we check, on the same table, couples publication authority / publication number having more than one application id, we find a 11% of duplication.

We meet here patents in different status and, in some cases, also a link to an unpublished priority application (see from appln_id > 58.000.000 and date = 9999-12-31)

An example from EPO is patent EP122624, having 4 application ids: one for each of following states: A2, A3, B1, and also one for D1 that is an unpublished priority with date 31/12/9999.

Another example from JP patent office (we met this patent before) JP53123578.

Monday, November 1, 2010

PASTAT september 2010: what's new

This is just a resume of EPO letter annex to the DVD set of september 2010....

- First grant indication: the data for element 049 PUBLN_FIRST_GRANT has been replaced by source data from DOCDB XML in order to guarantee that the data in PATSTAT is in line with other EPO patent information products.

- PRS data:  in cooperation with the INPADOC legal status data team, is now available an extra table available that contains the PRS data. The table TLS221_INPADOC_PRS is NOT part of the standard PATSTAT distribution.  You will need to purchase product 14.11:  INPADOC Worldwide legal status (PRS) .

- Citation origin:  in table TLS_212_CITATION an extra "origins" for the citations has been added.  The list now contains extra:    ISR ==> 5 - citations from the International Search Report
            SUP ==> 6 - citations from the Supplementary Search Report
        and    CH2 ==> 7 - citations introduced during the Chapter 2 phase of the PCT.
Observant PATSTAT users have seen that we also use the numerical presentation of the citation origin.

- Green technology: TLS217_APPLN_ECLA now also contains the Y02 tags covering  "TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE".  These tags describe technologies that hold the potential for reducing waste and emissions, including greenhouse gas emissions 
More information on EPO's role on this issue @link