Wednesday, September 15, 2010

Adding legal status to patents (inpadoc to mysql - part I)

In order to create a database containing data regarding legal issues of patents, epo provides a set of raw data called inpadoc database – legal status data (product 14.11) containing legal status data also known as PRS data (Patent Register Service) and includes records from over 40 international patent authorities.


The legal status of a patent or patent application refers to the entries and procedural steps occurring during the patent grant procedure and the subsequent life of a patent. These are normally published in the patent gazette of the patent-granting country or organisation concerned [from EPO website].


Legal status data are available in two formats: back file (from 1978 upto current year) and weekly updates (for current year). Following instruction refer to backlog file, but are applicable also for weekly update.

Inpadoc legal data come in two batches of files containing legal data (t12bfYYWW like where YY and WW are year and week of issue) and extended legal data (xlevYYWW) which translates the weekly bibliographic file (DOCDB) into legal events.

Data come in a XML format, splitted into several files due to the size of data.
Each file (example for T12BF) has a header

[iprevent cntevents=000656001 cntiprevents=000656003 date=20100129 record=START week=201004][/iprevent]

And records are structured like this

[iprevent cy=AT date=20100129 record=DATA status=C][l001ep]AT[/l001ep][l002ep]F[/l002ep][l003ep]168[/l003ep][l004ep]A [/l004ep][l005ep]PI[/l005ep][l007ep]19910315[/l007ep][l008ep]ELJ [/l008ep][l013ep]C[/l013ep][l017ep] AT 321347B[/l017ep][l018ep]20030101[/l018ep][/iprevent]

[NOTE: I changed < and > with [ and ] otherwise html interpreter would not display it on web page]

2010 backlog DVD contained, in T12BF group, 112 files and over 73M records, counting 24 Gb of unzipped data, while for XLEV was 155 files for 66.2 Gb of data.
If you want to create both tables, be sure to have 250 Gb of free space on disk.

In order to create a mysql table containing the relevant data the following steps can be followed:

1) file merge
After moving the file named T12BFYYMM___STAT (containing statistics about number of records for each file) files should be merged into one big file in order to facilitate the import process.
In order to facilitate the import step I suggest to also create a file with a record containing all the fields to be parsed, naming it @header.

Via DOS command line, move to the directory containing the files and run the following command

copy /b *.* patlegal.xml

This step may take 30 minutes.

2) MYSQL Import
The tool used for importing the file was NAVYCAT PREMIUM that has very powerful import features for mysql as well as other databases.
By rightclicking in the chosen library and selecting IMPORT WIZARD, you may chose XML import; be aware to chose IPREVENT as tag identifying table rows, when requested at step 3; for the rest you can just click on NEXT.

I suggest to create before via SQL an empty table having the correct field dimensions, so Navycat should only append the reocrds in the right fields.

In our case I created the table TEST.PATLEGAL.

This step may take 3 hrs.

[to be continued]

2 comments:

Carl Auclair said...

Hi,

Interesting to see that you succeeded to import the legal status to SQL Server. For my part, it's not working. I do exactly what you did but I select two files instead of just one (the @header and one more). Navicat Premium 10.1.6 returns the error message:
"[2013-04-29 11:59:19] [Msg] Import start
[2013-04-29 11:59:19] [Msg] Import type - XML file
[2013-04-29 11:59:19] [Msg] Import from - C:\EPO\SGML_data\04-2013\T12BF_201304_FormattedFiles\@SGMLtoSQL_Template | C:\EPO\SGML_data\04-2013\T12BF_201304_FormattedFiles\t12bf1301_AR_AT_01
[2013-04-29 11:59:19] [Msg] Import table [Table_LEGAL_STATUS_Test]
[2013-04-29 11:59:19] [Msg] Import table [Table_LEGAL_STATUS_Test]
[2013-04-29 12:05:17] [Err] 22001 - [SQL Server]String or binary data would be truncated.
01000 - [SQL Server]The statement has been terminated."

Yet, every colums are varchar(255) and null value is allowed
Did you modify your SQL Server setting or did something special on your server?

Thanks for your help.

Carl Auclair
cauclair@globalip.com

GL said...

Dear Carl
I suggest you to put all fields to 300 and L510EP to 3000 length.
Bytheway now EPO devilers also a copy of the data in CSV containing also APPLN_ID that makes all easier to link to patstat, I'd suggest you next time to go for that.

Post a Comment