Thursday, September 23, 2010

Adding legal status to patents (inpadoc to mysql - part II)

3) CHECK AND CLEANUP

When import procedure in NAVYCAT has finished, first we should check the log in order to see if some records have been left due to import errors;
Regardless what the documents say, there will be some records that contain in L510EP field data with non ascii chars (a dozen in 2010 files).
You can use the log, correcting the wrong chars, to cue the records skipped via SQL.

On the other hand we should also crosscheck data contained in the T12BFYYMM___STAT file against the content of our table, for T12BF, by using the following SQL

select L001EP, count(l002ep) as c from test.patlegal
group by L001EP


If figures are equivalent, we can guess import was OK.


The import procedure up to here imported also header and trailing records, so we need to run a cleanup sql.

delete from test.patlegal
where L001EP is null and L002EP is null and L003EP is null



For better exploiting the data, this two handbook are very useful…

http://documents.epo.org/projects/babylon/eponet.nsf/0/85D8230D18F52DCDC12572440037552F/$File/T12EXT22_07-10_en.pdf

http://documents.epo.org/projects/babylon/eponet.nsf/0/21B2BAA84E866C40C12574CF00484D32/$File/ExtendedLegalEvents_Manual_update_June2010_en.pdf

Also a vocabulary of PRS code can be downloaded here:
http://documents.epo.org/projects/babylon/rawdata.nsf/0/BB6076C83F0196D3C125779D003D0970/$File/le-codes-en1037.txt

4) RESTRUCTURING DATA
In order to need to build a more consistent dataset, we need to start to look better into field contents.
We will deal here with T12BF data. XLEV will be subject of a further post.

4a) drop empty fields
Fields L006EP L009EP L010EP L011EP L015 L016 L019 L020 L514 are empty and they may be dropped.


Georg Huber from EPO explained me very kindly the reason:
"During the implementation of the new PRS system (finished 2003) it was not evident that some of the data in the old system have been superfluous  (L006EP, L009EP).
For deletion of the data special information have not been implemented (=reasons for deletion of data) (L010EP, L011EP, L014EP).
The application numbers and the publication numbers had in the old PS system another number format (INPADOC number format) as in the new system (DocDB number format). These old number format was given to our customers until 2007 in the tags (L015EP, L016EP)
Tags L019EP, L020EP are foreseen as future possibilities that are not implemented yet.
We have not included any new legal events, the publication language is important or supplied by the national offices. therefore L015EP is always empty."


Be also aware that for some hundred cases the pair COUNTRY/PRSCODE1 in L001EP/L008EP field will not match the vocabulary previously linked;

These are some examples from 2010 table:


L001EP L008EP count
DE C1 370
SU MM4A 73
SE WWW 19
SU PD4A 17
SU QB4A 11
SU QZ4A 10



Always citing Georg Huber these are the reasons:


"For WO, I can see three reasons that these errors occurred.
1. There is not yet a publication number in the main database for these data and therefore the filing number was given instead of WO publication number (these is the reasons we send for WO publication numbers as in the main database the application numbers have the country code of the patent authority the PCT application has been filed).

2. A former existing publication number disappeared from the bibliographic main database

For SU and CS are have been similar problems, as these are applications of RU and CZ patents.
I assume that in the main bibliographic database the application number is still the former country. "


So Eventually this may be the demi-final version of our data


Corresp. TAG type Description
L001EP $2 Country code
L002EP $1 Format of document number following rules for either (F)iling applications or (P)ublications
L003EP $20 Document number
L004EP $2 Kind code for document number (if provided)
L005EP $2 IPR type (PI Patent of Invention / UM Utility Model)
L007EP DATE8 PRS date; DATE_GAZETTE; date of notification to the public
L008EP $4 4 bytes Legal Event code 1(lookup on table PRSCODE1)
L014EP DATE8 Publication or filing date (if provided) of DOCDB document in tags L001EP, L003EP, L004EP
L017EP $171 DOCDB publication ID; relates to the first publication level found in DOCDB
L018EP $8 DATE this event was last exchanged to subscribers
L501EP $2 Corresponding country code for PRS code •EP REG••
L502EP $4 Corresponding EP code 1 for PRS code  •EP REG••
L503EP 20 Corresponding patent document 
L504EP $2 Country code of corresponding patent document
L505EP DATE8 Publication date of corresponding patent
L506EP $2 Kind of corresponding patent document
L507EP $300 List of designated states 
L508EP $2 Extension state
L509EP $255 New owner name or address if name or address of owner changes; addresses are NOT stored in this tag
L510EP $700 Free format text
L511EP $20 SPC number 
L512EP  DATE8 Filing date
L513EP DATE8 Expiry date 
L515EP $255 Inventor name (separated by ;)
L516EP $50 International Patent Classification (comma separated)
L517EP $255 Representative's name(s)
L518EP DATE8 Payment date 
L519EP $50 Opponent name(s)
L520EP 2 Year of fee payment - contains the xxth year for which the payment was made 
L521EP $30 New kind of IPR, new number; e.g. Brazil utility model - code GA;"MI4601602-3"
L522EP $50 Name of requester 
L523EP DATE8 Extension date 
L524EP $100 List of countries concerned with an event L507EP & L508EP have special significance.
L525EP DATE8 Effective date; DATE_IN_FORCE 
L526EP DATE8 Date of withdrawal 
L527EP $1 Indicator for format of attribute list document number following rules for either (F)iling applications or (P)ublications. If not known, this tag will not be present; refers to the document given in L503EP and L504EP


Some issues are left out of this post and will be faced soon, like:

- identify macro type of PRSCODE1 and link them to the correct field;
- transpose fields constining more occurrences of same info (FI L507EP with designated states)
- link to patstat via document number.

Wednesday, September 15, 2010

Adding legal status to patents (inpadoc to mysql - part I)

In order to create a database containing data regarding legal issues of patents, epo provides a set of raw data called inpadoc database – legal status data (product 14.11) containing legal status data also known as PRS data (Patent Register Service) and includes records from over 40 international patent authorities.


The legal status of a patent or patent application refers to the entries and procedural steps occurring during the patent grant procedure and the subsequent life of a patent. These are normally published in the patent gazette of the patent-granting country or organisation concerned [from EPO website].


Legal status data are available in two formats: back file (from 1978 upto current year) and weekly updates (for current year). Following instruction refer to backlog file, but are applicable also for weekly update.

Inpadoc legal data come in two batches of files containing legal data (t12bfYYWW like where YY and WW are year and week of issue) and extended legal data (xlevYYWW) which translates the weekly bibliographic file (DOCDB) into legal events.

Data come in a XML format, splitted into several files due to the size of data.
Each file (example for T12BF) has a header

[iprevent cntevents=000656001 cntiprevents=000656003 date=20100129 record=START week=201004][/iprevent]

And records are structured like this

[iprevent cy=AT date=20100129 record=DATA status=C][l001ep]AT[/l001ep][l002ep]F[/l002ep][l003ep]168[/l003ep][l004ep]A [/l004ep][l005ep]PI[/l005ep][l007ep]19910315[/l007ep][l008ep]ELJ [/l008ep][l013ep]C[/l013ep][l017ep] AT 321347B[/l017ep][l018ep]20030101[/l018ep][/iprevent]

[NOTE: I changed < and > with [ and ] otherwise html interpreter would not display it on web page]

2010 backlog DVD contained, in T12BF group, 112 files and over 73M records, counting 24 Gb of unzipped data, while for XLEV was 155 files for 66.2 Gb of data.
If you want to create both tables, be sure to have 250 Gb of free space on disk.

In order to create a mysql table containing the relevant data the following steps can be followed:

1) file merge
After moving the file named T12BFYYMM___STAT (containing statistics about number of records for each file) files should be merged into one big file in order to facilitate the import process.
In order to facilitate the import step I suggest to also create a file with a record containing all the fields to be parsed, naming it @header.

Via DOS command line, move to the directory containing the files and run the following command

copy /b *.* patlegal.xml

This step may take 30 minutes.

2) MYSQL Import
The tool used for importing the file was NAVYCAT PREMIUM that has very powerful import features for mysql as well as other databases.
By rightclicking in the chosen library and selecting IMPORT WIZARD, you may chose XML import; be aware to chose IPREVENT as tag identifying table rows, when requested at step 3; for the rest you can just click on NEXT.

I suggest to create before via SQL an empty table having the correct field dimensions, so Navycat should only append the reocrds in the right fields.

In our case I created the table TEST.PATLEGAL.

This step may take 3 hrs.

[to be continued]

Thursday, September 9, 2010

code for IPC reclassification in 35 classes

In some previous posts I made available some code about ISI-OST-INPI IPC reclassification in 30 classes.

Recently, in order to have a better balance of the number of patents contained in the different classes, it has been proposed a 35 classes reclassification.

At this link can be consulted the ufficial document from Ulric Schmoch, published from WIPO.

So I made, in order to give am easier format, an excel file containing descriptions in one sheet and IPCs (in the second sheet); the file can be downloaded from this link.
Care the reclassifications sheet contains a colums IPCNOT showing that, if the patent contains such IPCs should be discarded from the class, regardless other IPCs should belong to it.

Differently from ISI-OST-INPI, this reclassification should be run on ALL IPCs (main and secundary).
As the author correctly adresses, this will lead to a 20% of patents belonging to 2 or more classes and for such cases only the main IPC (if available) should be used for reclassifications.
Have fun!!!