Rawpatentdata: November 2010

Tuesday, November 30, 2010

ESF‐APE‐INV 2nd “Name Game” workshop

The 2nd “Name Game” workshop on patent data will be held in Madrid, on December 9‐10, 2010, as part of the APE‐INV project, sponsored by the European Science Foundation.

It builds upon the success of a similar initiative held in Paris in November 2009, also as part of the APE‐INV project, and it aims at convening several researchers interested into building inventor‐based patent datasets, which requires solving a number of technical problems and exchanging data and expertise. It will be hosted by IPP, the Instituto de Politicas y Bienes Publicos, of CSIC (Consejo Superior de Investigaciones Científicas) and organized by Francesco Lissoni (KITES‐Bocconi) and Catalina Martinez (IPP‐CSIC).

APE-INV is a project funded by the European Science Foundation that aims at identifying inventions stemming from academic research through a reclassification by inventor of patents from PatStat, the EPO Worldwide Patent Statistical Database.

Such reclassification effort requires inventors’ names, surnames, and addresses to be parsed, matched, and filtered, in order to identify synonyms (that is names+surnames or addresses which are the same, although spelled differently) and to
disambiguate homonyms (verify whether two inventors with same name and surname are indeed the same person). Several algorithms have been produced in the recent past, either with reference to data from PatStat or from national patent offices.

One the objectives of the APE-INV project is to compare the accuracy and efficiency of such algorithms, and to involve as many researchers as possible in a collective research effort aimed at producing a shared database of inventors’ names, surnames, and addresses, linked to PatStat.

In order to achieve this objective APE-INV produces a number of PatStat-based benchmark databases, and invites all interested parties to test their algorithms against them. The present document (to be updated periodically) describes such benchmark databases, their rules of access, and provides guidelines on how to conduct the tests and how to report their results, in order to ensure comparability. Information is also provided on workshops that will be organized in order to allow a discussion of the results.

Objectives and Programme Approach

The main results expected by Ape-INV are there:

Sharing experiences for the creation of INV Database
to share expertise and methods among European (and US or Japanese) reearchers for the creation of an inventors’ database, one that will identify all different spelling variations of the inventor’s name, as well as the inventor’s different addresses and patents;
to share expertise and methods among European researchers for matching the inventors’ database with national databases of academic scientists, in order to produce comparable counts of academic patenting activity and to collect auxiliary information on academic inventors;

Producing a Database on Academic Patenting in Europe (APE-INV Database)
to produce a freely available database on “academic patenting in Europe”, that will contain reliable and comparable information on the contribution of European academic scientists to technology transfer via patenting, and that researchers will be able to update in the future.

Editing joint publications using the Data-set
to experiment the database opportunities, editing one or more joint publications containing original applications of the newly created databases;

Designing a Method to allow users to correct data
to devise a method for collecting the database users’ feedbacks on the quality of the data, one that will allow users to enter their own corrections to the identification errors. Users’ corrections are particularly important in this case, since the identification of inventors rely on algorithms that make use of information on each inventor’s social network: so users’ corrections to the identity of one inventor may lead to correct the identity of others.

Cooperating with established institutions in the field of patent data
A unique window of opportunity has been recently opened by the European Patent Office (EPO), and its collaboration with the OECD Patent Statistics Task Force for the creation of PATSTAT, a new database for statistical use (joint with WIPO, the World Intellectual Property Organization, and Eurostat).

APE-INV program will contribute to create a community of PATSTAT users, and to turn PATSTAT into a reference source for the worldwide community of social scientists engaged in science and technology studies. Besides making use of PATSTAT data, APE-INV will try to establish co-operation ties with all the insititutions involved in its development.

Participation to APE-INV

Currently, the APE-INV programme is supported by 9 ESF member organizations, namely:

• Austrian Science Fund (FWF)
• National Fund for Scientific Research (FNRS), Belgium
• Research Foundation Flanders (FWO), Belgium
• The National Foundation of Science, Higher Education and Technological Development of the Republic of Croatia (NZZ),
• Danish Social Science Research Council
• German Research Foundation (DFG)
• National Research Council (CNR), Italy
• Netherlands Organisation for Scientific Research (NWO)
• Council for Scientific Research (CSIC), Spain
• Swedish Research Council (VR)
• Swiss National Science Foundation (SNSF)

However, APE-INV is willing to enlarge the participation to other countries, in order to become a pan-European research networking programme. In addition, collaboration with scientists worldwide is encouraged, and will be pursued with the help of subsidiary funds.

Workshop on patenting in China

This one-day workshop deals with various issues having to do with Chinese patents and the patent system in China : evolution of the patent system in China over the last 20 years, quality of the Chinese patents, enforceability of patents in China, availability of datasets on Chinese patents and patent citations, effects of patent system on economic performance. There will be presentations by economists and lawyers.

Maastricht, the Netherlands
UNU-MERIT, Keizer Karelplein 19, conference room
December 10, 2010

more on: http://www.merit.unu.edu/patenting/

Wednesday, November 24, 2010

Deeper into patstat ipcs in september 2010 data

Two issues with IPCs
First of all a new group of IPC code arised:

G01Q

we have 73195 in all patstat, but still we have no reclassification (OST30 or NACE) for them

See as example: EP 2250533
Soon we will try to suggest a possible cathegory for this IPC.

On the other hand we must highlight some IPC, falling under ipc_class_level 'S' are present in sept. 2009 data aside Advanced and Core IPCs. EPO suggestion is to drop them (they are about 1.2 Millions)

'A', 160.067.275
'C', 140.155.212
'S', 1.294.288

Friday, November 19, 2010

News from patstat user conference

Some news about patstat future development were disclosed in Vienna @ patent statistics for decision makers conference.

First of all, as already cited, ECLA classification has been added in available informations.
Such classification is an ectension of IPC containing 135.000 symbols vs 70.000 in IPC.

April 2010 ediction added complete ECLA (ECLA, ICO, ECNO, IDT codes)
September 2010: added EST codes Y02 (ICO symbols) patents in the area of sustainanable energy technologies

About Ecla you can find on EPO website an e-learning module.

Another important issue is the possibility of querying patstat on line from EPO website.
From URL https://data.epo.org/expert-services/ you can run some queries (a part of them, those with a lock icon, are accessible only to service subscribers) and make also graphics, straight on epo data.

Also it has been announced what could be released in next patstat edictions:

• EP number of claims
• US number of claims
• US drawings
• PCT addresses
• A stable application id
• Cited References - New Features • IPC Classes - Core and Advanced

In detail:
Number of grants may be added to EP-B pubblications within a patstat table (probably TLS211)
Anyway EP-B publications may have different numbers of claims depending on designated countries and language considered
FI: EP 1311379 B1: GB/DE: 19 claims
AT/BE/.../TR: 13 claims

For unique and Stable Application Identifier, will be used published applications : DOCDB "R-id" ; for replenished applications will be used a surrogate key e.g : starting from value 900,000,000

Cited References:
In Sept 2010 - New origins have been added
'5' - International Search Report
'6' - Supplementary Search Report
'7' - Chapter II

In April 2011 - International Search Authority will be added
Options :
provide data inside CITN_ORIGIN
e.g. CITN_ORIGIN = '5SE'
or introduce a new field to contain authority

Cited References - Optional Features:
Option : have cited applications included origin '1' - cited by the applicant on filing earlier application filed by same applicant application that has not been published

IPC Classes - CORE and ADVANCED
In Sept 2010 : an application would have at least "core" level possibly "advanced" level
In April 2011 : an application will have either "core" level or "advanced" level

Monday, November 15, 2010

Importing patstat TLS221 (PRS inpadoc legal status) into MYSql

From september 2010 inpadoc legal status data are available as a separate dataset, but linked to the rest of patstat via application id.
Datacome with no documentation, since they link existing PRS 14.11 product documents (so I link my previous posts on the argument )

In comparison to the old PRS data, there are no more fields L001 - 500, since application and publication data may be retrieved from TLS201 and 211 from core patstat tables, so they became redundant.

Total number of records in this ediction to be loaded is 80.328.938, devided into 16 files containing 5M records (header excluded), and the last containing 328938.
Be aware you'd need 20Gb of disk space for the whole process.

At this link [rightclick and choose save as] you may download MYSql scripts for TLS221: care that they are 2 scripts since due to some errors, so if you want to be sure, at mid of first script you should run the second for creating a temp table to handcheck and correct (but this step is really for maniac since only a few records are to be corrected).
Then you can do other optimizations like setting date field to date formats from char etc.
Be also aware that in this first ediction records are terminated from {LF} but # 14 and 15 that are terminated by{CR/LF}.
My scripts reflect actual situation, but be aware EPO will soon fix this problem so scripts will need to be amended.

If you need a detailed list of 'problematic records' by number of txt file of TLS221 (they usually have a text field terminated by / so it escapes field termination and fields are shifted... this is the reason of the need of fixing records with above mentined procedure)...

3
Row 2117823 was truncated; it contained more data than there were input columns
( appln_id 010762977 progr #2)
14
Row 1243470 doesn't contain data for all columns
15
Row 3091645 doesn't contain data for all columns
Row 3504629 doesn't contain data for all columns
16
Row 3021693 doesn't contain data for all columns

patstat sept. 2010 MYSql load scripts

A few days ago was released september 2010 version of PATSTAT data.
Along with 3 DVD come a precedure for uploading data into MS Sql server, but a very old procedure for MYSql user.

Cliccking on above icon you may get my scripts for uplaoding the data. They are released as they are, under CC 3.0 licence (this means you can use them but not resell 'em).
Please post a comment if you used them and found them useful, or if you have suggestions to improve 'em.

In this post I also put some comments.

First of all there is a 'problem' in comparison to previous versions, since records are not terminated with {CR/LF} but with {LF}.

Apart from that here I list the log of record load: we always lose some records in tables 202, 203 but oddly we 'earn' some records in 210 and 217, compared to the content declared by EPO.

table	date of files	declared	imported	delta
TLS201_APPLN	201009	66.226.956	66.226.956	0
TLS202_APPLN_TITLE	201009	48.303.269	48.303.256	13
TLS203_APPLN_ABSTR	201009	18.139.427	18.139.356	71
TLS204_APPLN_PRIOR	201009	28.823.857	28.823.857	0
TLS205_TECH_REL	201009	2.122.738	2.122.738	0
TLS206_ASCII	201009	37.428.136	37.428.107	29
TLS207_PERS_APPLN	201009	134.687.197	134.687.197	0
TLS208_DOC_STD_NMS	201009	16.864.577	16.864.577	0
TLS209_APPLN_IPC	201009	301.516.775	301.516.775	0
TLS210_APPLN_N_CLS	201009	25.289.374	25.289.379	-5
TLS211_PAT_PUBLN	201009	74.161.545	74.161.545	0
TLS212_CITATION	201009	97.111.948	97.111.948	0
TLS214_NPL_PUBLN	201009	14.826.883	14.826.881	2
TLS215_CITN_CATEG	201009	18.043.102	18.043.102	0
TLS216_APPLN_CONTN	201009	1.769.423	1.769.423	0
TLS217_APPLN_I_CLS	201009	101.894.277	101.894.406	-129
TLS218_DOCDB_FAM	201009	58.713.013	58.713.013	0
TLS219_INPADOC_FAM	201009	66.226.956	66.226.956	0

To go in detail (most user will find this boring) some record in some tables have lesser problems.

First of all 3 EP applications have an odd application and pubblication number that gives an upload error in part 3 of files TLS201 and 211 (is due to a non ascii char in the app/pub number and it's tranlastion in 2 chars make the field wider than it should be).

TLS201 - part03:
Data too long for column 'APPLN_NR' at row 21104523
Data too long for column 'APPLN_NR' at row 21145320
Data too long for column 'APPLN_NR' at row 21764445
TLS211 -PART03
Data too long for column 'PUBLN_NR' at row 19039112
Data too long for column 'PUBLN_NR' at row 19079909
Data too long for column 'PUBLN_NR' at row 19699034

This is the content of data, in black the field giving error:

73151912, 'EP', '      9600Ã“7LI1', 'A', 66151912, '9999-12-31', '', 0
66151912, 'EP', '      9600Ã“7LI1', 'D2', '9999-12-31', 'PI', '', '', 0

Then we have 2 titles longer than 3000 chars, always due to non ascii chars translated into 2 chars, increasing so the size of the field title.

TLS202 - PART01
Data too long for column 'APPLN_TITLE' at row 3126639
Data too long for column 'APPLN_TITLE' at row 3136166
(LONGHER THAN 3000)
003544896,"mÃ®todo e composiÃºÃúo farmacÃªutica para a prevenÃºÃúo ou o tratamento de uma doenÃºa ou uma condiÃºÃúo autoimune ou infecciosa, mÃ®todo e composiÃºÃúo farmacÃªutica para a prevenÃºÃúo ou o tratamento de uma doenÃºa ou uma condiÃºÃúo do sangue, mÃ®todo e composiÃºÃúo farmacÃªutica para modular a formaÃºÃúo de cÃ®lulas do sangue, mÃ®todo e composiÃºÃúo farmacÃªutica para intensificar a mobilizaÃºÃúo perifÃ®rico da cÃ®lula tronco, mÃ®todo e composiÃºÃúo farmacÃªutica para a prevenÃºÃúo ou o tratamento de uma doenÃºa ou uma condiÃºÃúo metabÃ¦lica, mÃ®todo e composiÃºÃúo farmacÃªutica para a prevenÃºÃúo ou o tratamento das condiÃºÃÁes associadas com doses mieloablativas de quimioradioterapia suportadas pelo transplante autÃ¦logo de medula Ã¦ssea ou de cÃ®lulas tronco do sangue perifÃ®rico (asct) ou pelo transplante alogenÃ®ico de medula Ã¦ssea (bmt), mÃ®todo e composiÃºÃúo farmacÃªutica para aumentar o efeito de um fator estimulante de cÃ®lulas do sangue, mÃ®todo e composiÃºÃúo farmacÃªutica para intensificar a colonizaÃºÃúo de cÃ®lulas tronco do sangue doadas em um receptor mieloablatado, mÃ®todo e composiÃºÃúo farmacÃªutica para a prevenÃºÃúo ou o tratamento de uma doenÃºa ou uma condiÃºÃúo bacteriana, composiÃºÃúo farmacÃªutica para o tratamento ou a prevenÃºÃúo de uma indicaÃºÃúo selecionada do grupo que consiste em doenÃºa ou condiÃºÃúo autoimune, doenÃºa viral, infecÃºÃúo viral, doenÃºa hematolÃ¦gica, deficiÃªncias hematolÃ¦gicas, trombocitopenia, pancitopenia, granulopenia, hiperlipidemia, hipercolesterolemia, glucosuria, hiperglicemia, diabetes, aids, hiv-1, distÃºrbios de cÃ®lulas t auxiliares, deficiÃªncias de cÃ®lulas dendrÃ¼ticas, deficiÃªncias de macrofagos, distÃºrbios de cÃ®lulas tronco hematopoiÃ®ticas incluindo distÃºrbios com plaquetas, linfÃ¦citos, cÃ®lulas do plasma e neutrÃ¦filos, condiÃºÃÁes prÃ®-leucÃªmicas, condiÃºÃÁes leucÃªmicas, distÃºrbios do sistema imunolÃ¦gico resultantes da terapia de quimioterapia ou de radiaÃºÃúo, distÃºrbios do sistema imunolÃ¦gico humano resultantes do tratamento das doenÃºas de deficiÃªncia imunolÃ¦gica e infecÃºÃÁes bacterianas, composiÃºÃúo farmacÃªutica para o tratamento ou a prevenÃºÃúo de um indicaÃºÃúo selecionada do grupo que consiste em doenÃºa hematolÃ¦gica, deficiÃªncias hematolÃ¦gicas, trombocitopenia, pancitopenia, granulopenia, deficiÃªncias de cÃ®lulas dendrÃ¼ticas, deficiÃªncias de macrofagos, distÃºrbios de cÃ®lulas tronco hematopoiÃ®ticas incluindo distÃºrbios com plaquetas, linfÃ¦citos, cÃ®lulas do plasma e neutrÃ¦filos, condiÃºÃÁes prÃ®-leucÃªmicas, condiÃºÃÁes leucÃªmicas, sÃ¼ndrome mielodisplÃístiacas, malignidades nÃúo mielÃ¦ides, anemia plÃística e insuficiÃªncia da medula Ã¦ssea, peptÃ¼deo purificado, peptÃ¼deo quimÃ®rico purificado, peptÃ¼deo quimÃ®rico, composiÃºÃúo farmacÃªutica, composiÃºÃúo farmacÃªutica para a prevenÃºÃúo ou o tratamento de uma condiÃºÃúo associada com um agente infeccioso de sars, mÃ®todo de processamento a baixa temperatura de hidrolisato proteolÃ¼tico de caseÃ¼na e hidrolisato de proteÃ¼na de caseÃ¼na"

Same thing for 1 abstract:

tls203 - PART07
Data too long for column 'APPLN_ABSTRACT' at row 702251

In table 206 ascii - part01 we have one record (#56) that gives some problems since a text field finishes with a / and this shiftes some fields, giving error 'Row 56 doesn't contain data for all columns'

Also in TLS206-part05
last 63 records doesn't contain data for all columns, but this time for real...
(Row 1428062 till 1428127)
FI:
037428066010830866054240706US       43659874A 3USPI0003        000000000000000000000

Enjoy you database...

Wednesday, November 10, 2010

Double publication numbers by application id in USPTO

As previously posted, several application authority may have more than one publication number referring to the same application id.

In the case of uspto these are the main publication kind pairs of publication numbers referred to the same application id. It may be useful for creating a disambiguation table.

kind1	kind2	count
'A'	'B1'	3895
'A'	'B2'	124
'A'	'C1'	356
'A'	'E1'	12
'A1'	'A2'	521
'A1'	'A9'	1662
'A1'	'B1'	18410
'A1'	'B2'	845793
'A1'	'H1'	43
'A2'	'B2'	169
'A9'	'B2'	768
'B1'	'B2'	142
'E'	'F1'	59
'P1'	'P2'	424
'P1'	'P3'	1686

Bytheway, counting the difference in months among filing date and publication date for the first of the two pubblication kinds, we get the following result:

months	# of applications
4	50501
5	53963
6	130103
7	111533
8	67594
9	48504
10	33086
11	26268
12	20754
13	15320
14	13855
15	11927
16	9346
17	7075
18	186961
19	32484
20	6354

Above 95% of applications listed in the first table are published before 18 months (cases in month 19 may be a rounding error), conforting so our hipothesys that the first of the 2 publication is due to the 18 month publication law.

See also:
18-month publication provision:
http://www.uspto.gov/web/offices/pac/mpep/documents/1100_1120.htm

About non-publication request:
http://www.uspto.gov/web/offices/pac/mpep/documents/appxr_1_213.htm#cfr37s1.213

Thanks to Francesco Lissoni for help in solving this issue.

Monday, November 8, 2010

Linkage among application and publication tables in PATSTAT

We could expect (from EPO documents) that every application has 0 to N publications, where every publication belongs to exactly 1 application.
In reality, it isn't so.

Let's check in deep the the table containing applications (TLS201) and the one with publications (TLS211).

Linkage among them may be established via APPLN_ID field, that is the main key in patstat.

Referring to sept 2009 ediction, we start by checking application table TLS201.
The first control aiming to discover if the same application_id could referr to distinct couples application authority / application number gives a positive feedback (it would have been a real truble otherwise!)

If we check, on the same table, couples application authority / application number having more than one application id, we find a 9% of duplication.
Nothing strange: we already stated that the disambiguating data for applications are application authority, application number and APPLICATION KIND, so such a percentuage referrs to offices where same application number may be used (for instance) for a patent of invention and for an utility model, referring to different application filed.
So also this case looks ok with our data model.

Let's see instead what happens with publications table (TLS211).
Using the same citeria used on TLS201 we start checking if the same application_id could referr to distinct couples publication authority / publication number.

We get about a 10% of doublecounting, where the top 5 by publication office is

'JP'	4202644
'US'	867657
'CN'	513398
'GB'	333662
'IT'	264060

Bad news here is that this duplication is due to issues tightly related to the application authority!

In case of Japan, duplication is due to different publication numbers originated from the same application
See this example for publication number JP53123578.

For China happens the same of Japan (see example: CN 1274133)

About USPTO change of law took place in Y2K (eighteen-month publication provisions of the American Inventors Protection Act of 1999) creating a duplication the publications related to the same application: a publication number issued 18 month after publication, related to the application only, then later a final publication number issued when application is granted. An example here.

In Italian patent office (I'm so proud eventually we are in a top 5!!!) we meet a lot of D0 publication kind, that referr to a lot of non existing publication, or better, publication we cannot find FI in espacenet.
FI publication number IT1243259, kind B, shares application id with IT9001711, status D0, that is pastat documents is listed as 'Filing application'

Same issue with D0 kind is found in GB patent office, together with other duplicated kinds (British always overmake... see GB2358979)

So we cannot really find a general rule but country by country should be investigated; probably we may drop D0 type everywhere, and we should know that by counting by publication number we run the risk of ovestimate the number of patents.
Good news: EPO data are not affected from this issue.

Eventually if we check, on the same table, couples publication authority / publication number having more than one application id, we find a 11% of duplication.

We meet here patents in different status and, in some cases, also a link to an unpublished priority application (see from appln_id > 58.000.000 and date = 9999-12-31)

An example from EPO is patent EP122624, having 4 application ids: one for each of following states: A2, A3, B1, and also one for D1 that is an unpublished priority with date 31/12/9999.

Another example from JP patent office (we met this patent before) JP53123578.

Monday, November 1, 2010

PASTAT september 2010: what's new

This is just a resume of EPO letter annex to the DVD set of september 2010....

- First grant indication: the data for element 049 PUBLN_FIRST_GRANT has been replaced by source data from DOCDB XML in order to guarantee that the data in PATSTAT is in line with other EPO patent information products.

- PRS data: in cooperation with the INPADOC legal status data team, is now available an extra table available that contains the PRS data. The table TLS221_INPADOC_PRS is NOT part of the standard PATSTAT distribution. You will need to purchase product 14.11: INPADOC Worldwide legal status (PRS) .

- Citation origin: in table TLS_212_CITATION an extra "origins" for the citations has been added. The list now contains extra:    ISR ==> 5 - citations from the International Search Report
            SUP ==> 6 - citations from the Supplementary Search Report
        and    CH2 ==> 7 - citations introduced during the Chapter 2 phase of the PCT.
Observant PATSTAT users have seen that we also use the numerical presentation of the citation origin.

- Green technology: TLS217_APPLN_ECLA now also contains the Y02 tags covering "TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE". These tags describe technologies that hold the potential for reducing waste and emissions, including greenhouse gas emissions
More information on EPO's role on this issue @link