Thursday, September 23, 2010

Adding legal status to patents (inpadoc to mysql - part II)

3) CHECK AND CLEANUP

When import procedure in NAVYCAT has finished, first we should check the log in order to see if some records have been left due to import errors;
Regardless what the documents say, there will be some records that contain in L510EP field data with non ascii chars (a dozen in 2010 files).
You can use the log, correcting the wrong chars, to cue the records skipped via SQL.

On the other hand we should also crosscheck data contained in the T12BFYYMM___STAT file against the content of our table, for T12BF, by using the following SQL

select L001EP, count(l002ep) as c from test.patlegal
group by L001EP


If figures are equivalent, we can guess import was OK.


The import procedure up to here imported also header and trailing records, so we need to run a cleanup sql.

delete from test.patlegal
where L001EP is null and L002EP is null and L003EP is null



For better exploiting the data, this two handbook are very useful…

http://documents.epo.org/projects/babylon/eponet.nsf/0/85D8230D18F52DCDC12572440037552F/$File/T12EXT22_07-10_en.pdf

http://documents.epo.org/projects/babylon/eponet.nsf/0/21B2BAA84E866C40C12574CF00484D32/$File/ExtendedLegalEvents_Manual_update_June2010_en.pdf

Also a vocabulary of PRS code can be downloaded here:
http://documents.epo.org/projects/babylon/rawdata.nsf/0/BB6076C83F0196D3C125779D003D0970/$File/le-codes-en1037.txt

4) RESTRUCTURING DATA
In order to need to build a more consistent dataset, we need to start to look better into field contents.
We will deal here with T12BF data. XLEV will be subject of a further post.

4a) drop empty fields
Fields L006EP L009EP L010EP L011EP L015 L016 L019 L020 L514 are empty and they may be dropped.


Georg Huber from EPO explained me very kindly the reason:
"During the implementation of the new PRS system (finished 2003) it was not evident that some of the data in the old system have been superfluous  (L006EP, L009EP).
For deletion of the data special information have not been implemented (=reasons for deletion of data) (L010EP, L011EP, L014EP).
The application numbers and the publication numbers had in the old PS system another number format (INPADOC number format) as in the new system (DocDB number format). These old number format was given to our customers until 2007 in the tags (L015EP, L016EP)
Tags L019EP, L020EP are foreseen as future possibilities that are not implemented yet.
We have not included any new legal events, the publication language is important or supplied by the national offices. therefore L015EP is always empty."


Be also aware that for some hundred cases the pair COUNTRY/PRSCODE1 in L001EP/L008EP field will not match the vocabulary previously linked;

These are some examples from 2010 table:


L001EP L008EP count
DE C1 370
SU MM4A 73
SE WWW 19
SU PD4A 17
SU QB4A 11
SU QZ4A 10



Always citing Georg Huber these are the reasons:


"For WO, I can see three reasons that these errors occurred.
1. There is not yet a publication number in the main database for these data and therefore the filing number was given instead of WO publication number (these is the reasons we send for WO publication numbers as in the main database the application numbers have the country code of the patent authority the PCT application has been filed).

2. A former existing publication number disappeared from the bibliographic main database

For SU and CS are have been similar problems, as these are applications of RU and CZ patents.
I assume that in the main bibliographic database the application number is still the former country. "


So Eventually this may be the demi-final version of our data


Corresp. TAG type Description
L001EP $2 Country code
L002EP $1 Format of document number following rules for either (F)iling applications or (P)ublications
L003EP $20 Document number
L004EP $2 Kind code for document number (if provided)
L005EP $2 IPR type (PI Patent of Invention / UM Utility Model)
L007EP DATE8 PRS date; DATE_GAZETTE; date of notification to the public
L008EP $4 4 bytes Legal Event code 1(lookup on table PRSCODE1)
L014EP DATE8 Publication or filing date (if provided) of DOCDB document in tags L001EP, L003EP, L004EP
L017EP $171 DOCDB publication ID; relates to the first publication level found in DOCDB
L018EP $8 DATE this event was last exchanged to subscribers
L501EP $2 Corresponding country code for PRS code •EP REG••
L502EP $4 Corresponding EP code 1 for PRS code  •EP REG••
L503EP 20 Corresponding patent document 
L504EP $2 Country code of corresponding patent document
L505EP DATE8 Publication date of corresponding patent
L506EP $2 Kind of corresponding patent document
L507EP $300 List of designated states 
L508EP $2 Extension state
L509EP $255 New owner name or address if name or address of owner changes; addresses are NOT stored in this tag
L510EP $700 Free format text
L511EP $20 SPC number 
L512EP  DATE8 Filing date
L513EP DATE8 Expiry date 
L515EP $255 Inventor name (separated by ;)
L516EP $50 International Patent Classification (comma separated)
L517EP $255 Representative's name(s)
L518EP DATE8 Payment date 
L519EP $50 Opponent name(s)
L520EP 2 Year of fee payment - contains the xxth year for which the payment was made 
L521EP $30 New kind of IPR, new number; e.g. Brazil utility model - code GA;"MI4601602-3"
L522EP $50 Name of requester 
L523EP DATE8 Extension date 
L524EP $100 List of countries concerned with an event L507EP & L508EP have special significance.
L525EP DATE8 Effective date; DATE_IN_FORCE 
L526EP DATE8 Date of withdrawal 
L527EP $1 Indicator for format of attribute list document number following rules for either (F)iling applications or (P)ublications. If not known, this tag will not be present; refers to the document given in L503EP and L504EP


Some issues are left out of this post and will be faced soon, like:

- identify macro type of PRSCODE1 and link them to the correct field;
- transpose fields constining more occurrences of same info (FI L507EP with designated states)
- link to patstat via document number.

1 comment:

Post a Comment