Monday, November 15, 2010

Importing patstat TLS221 (PRS inpadoc legal status) into MYSql

From september 2010 inpadoc legal status data are available as a separate dataset, but linked to the rest of patstat via application id.
Datacome with no documentation, since they link existing PRS 14.11 product documents (so I link my previous posts on the argument )

In comparison to the old PRS data, there are no more fields L001 - 500, since application and publication data may be retrieved from TLS201 and 211 from core patstat tables, so they became redundant.

Total number of records in this ediction to be loaded is 80.328.938, devided into 16 files containing 5M records (header excluded), and the last containing 328938.
Be aware you'd need 20Gb of disk space for the whole process.

At this link [rightclick and choose save as] you may download MYSql scripts for TLS221: care that they are 2 scripts since due to some errors, so if you want to be sure, at mid of first script you should run the second for creating a temp table to handcheck and correct (but this step is really for maniac since only a few records are to be corrected).
Then you can do other optimizations like setting date field to date formats from char etc.
Be also aware that in this first ediction records are terminated from {LF} but # 14 and 15 that are terminated by{CR/LF}.
My scripts reflect actual situation, but be aware EPO will soon fix this problem so scripts will need to be amended.


If  you need a detailed list of 'problematic records' by number of txt file of TLS221 (they usually have a text field terminated by / so it escapes field termination and fields are shifted... this is the reason of the need of fixing records with above mentined procedure)...


3
Row 2117823 was truncated; it contained more data than there were input columns
( appln_id 010762977 progr #2)
14
Row 1243470 doesn't contain data for all columns
15
Row 3091645 doesn't contain data for all columns
Row 3504629 doesn't contain data for all columns
16
Row 3021693 doesn't contain data for all columns

1 comment:

Antoine Dechezlepretre said...

Works perfectly.
To correct the errors before uploading the data just open tables 3, 14, 15, 16 with Textpad and replace occurences of \" with "
Then only the import script is necessary
Antoine

Post a Comment