Monday, November 15, 2010

patstat sept. 2010 MYSql load scripts

A few days ago was released september 2010 version of PATSTAT data.
Along with 3 DVD come a precedure for uploading data into MS Sql server, but a very old procedure for MYSql user.



Cliccking on above icon you may get my scripts for uplaoding the data. They are released as they are, under CC 3.0 licence (this means you can use them but not resell 'em).
Please post a comment if you used them and found them useful, or if you have suggestions to improve 'em.

In this post I also put some comments.

First of all there is a 'problem' in comparison to previous versions, since records are not terminated with {CR/LF} but with {LF}.

Apart from that here I list the log of record load: we always lose some records in tables 202, 203 but oddly we 'earn' some records in 210 and 217, compared to the content declared by EPO.



table date of files declared imported delta
TLS201_APPLN 201009 66.226.956 66.226.956 0
TLS202_APPLN_TITLE 201009 48.303.269 48.303.256 13
TLS203_APPLN_ABSTR 201009 18.139.427 18.139.356 71
TLS204_APPLN_PRIOR 201009 28.823.857 28.823.857 0
TLS205_TECH_REL 201009 2.122.738 2.122.738 0
TLS206_ASCII 201009 37.428.136 37.428.107 29
TLS207_PERS_APPLN 201009 134.687.197 134.687.197 0
TLS208_DOC_STD_NMS 201009 16.864.577 16.864.577 0
TLS209_APPLN_IPC 201009 301.516.775 301.516.775 0
TLS210_APPLN_N_CLS 201009 25.289.374 25.289.379 -5
TLS211_PAT_PUBLN 201009 74.161.545 74.161.545 0
TLS212_CITATION 201009 97.111.948 97.111.948 0
TLS214_NPL_PUBLN 201009 14.826.883 14.826.881 2
TLS215_CITN_CATEG 201009 18.043.102 18.043.102 0
TLS216_APPLN_CONTN 201009 1.769.423 1.769.423 0
TLS217_APPLN_I_CLS 201009 101.894.277 101.894.406 -129
TLS218_DOCDB_FAM 201009 58.713.013 58.713.013 0
TLS219_INPADOC_FAM 201009 66.226.956 66.226.956 0

To go in detail (most user will find this boring) some record in some tables have lesser problems.

First of all 3 EP applications have an odd application and pubblication number that gives an upload error in part 3 of files TLS201 and 211 (is due to a non ascii char in the app/pub number and it's tranlastion in 2 chars make the field wider than it should be).

TLS201 - part03:
Data too long for column 'APPLN_NR' at row 21104523
Data too long for column 'APPLN_NR' at row 21145320
Data too long for column 'APPLN_NR' at row 21764445
TLS211 -PART03
Data too long for column 'PUBLN_NR' at row 19039112
Data too long for column 'PUBLN_NR' at row 19079909
Data too long for column 'PUBLN_NR' at row 19699034



This is the content of data, in black the field giving error:

73151912, 'EP', '      9600Ó7LI1', 'A', 66151912, '9999-12-31', '', 0
66151912, 'EP', '      9600Ó7LI1', 'D2', '9999-12-31', 'PI', '', '', 0

Then we have 2 titles longer than 3000 chars, always due to non ascii chars translated into 2 chars, increasing so the size of the field title.

TLS202 - PART01
Data too long for column 'APPLN_TITLE' at row 3126639
Data too long for column 'APPLN_TITLE' at row 3136166
(LONGHER THAN 3000)
003544896,"mîtodo e composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento de uma doenúa ou uma condiúÃúo autoimune ou infecciosa, mîtodo e composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento de uma doenúa ou uma condiúÃúo do sangue, mîtodo e composiúÃúo farmacêutica para modular a formaúÃúo de cîlulas do sangue, mîtodo e composiúÃúo farmacêutica para intensificar a mobilizaúÃúo perifîrico da cîlula tronco, mîtodo e composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento de uma doenúa ou uma condiúÃúo metabælica, mîtodo e composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento das condiúÃÁes associadas com doses mieloablativas de quimioradioterapia suportadas pelo transplante autælogo de medula æssea ou de cîlulas tronco do sangue perifîrico (asct) ou pelo transplante alogenîico de medula æssea (bmt), mîtodo e composiúÃúo farmacêutica para aumentar o efeito de um fator estimulante de cîlulas do sangue, mîtodo e composiúÃúo farmacêutica para intensificar a colonizaúÃúo de cîlulas tronco do sangue doadas em um receptor mieloablatado, mîtodo e composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento de uma doenúa ou uma condiúÃúo bacteriana, composiúÃúo farmacêutica para o tratamento ou a prevenúÃúo de uma indicaúÃúo selecionada do grupo que consiste em doenúa ou condiúÃúo autoimune, doenúa viral, infecúÃúo viral, doenúa hematolægica, deficiências hematolægicas, trombocitopenia, pancitopenia, granulopenia, hiperlipidemia, hipercolesterolemia, glucosuria, hiperglicemia, diabetes, aids, hiv-1, distúrbios de cîlulas t auxiliares, deficiências de cîlulas dendrüticas, deficiências de macrofagos, distúrbios de cîlulas tronco hematopoiîticas incluindo distúrbios com plaquetas, linfæcitos, cîlulas do plasma e neutræfilos, condiúÃÁes prî-leucêmicas, condiúÃÁes leucêmicas, distúrbios do sistema imunolægico resultantes da terapia de quimioterapia ou de radiaúÃúo, distúrbios do sistema imunolægico humano resultantes do tratamento das doenúas de deficiência imunolægica e infecúÃÁes bacterianas, composiúÃúo farmacêutica para o tratamento ou a prevenúÃúo de um indicaúÃúo selecionada do grupo que consiste em doenúa hematolægica, deficiências hematolægicas, trombocitopenia, pancitopenia, granulopenia, deficiências de cîlulas dendrüticas, deficiências de macrofagos, distúrbios de cîlulas tronco hematopoiîticas incluindo distúrbios com plaquetas, linfæcitos, cîlulas do plasma e neutræfilos, condiúÃÁes prî-leucêmicas, condiúÃÁes leucêmicas, sündrome mielodisplÃístiacas, malignidades nÃúo mielæides, anemia plÃística e insuficiência da medula æssea, peptüdeo purificado, peptüdeo quimîrico purificado, peptüdeo quimîrico, composiúÃúo farmacêutica, composiúÃúo farmacêutica para a prevenúÃúo ou o tratamento de uma condiúÃúo associada com um agente infeccioso de sars, mîtodo de processamento a baixa temperatura de hidrolisato proteolütico de caseüna e hidrolisato de proteüna de caseüna"

Same thing for 1 abstract:

tls203 - PART07
Data too long for column 'APPLN_ABSTRACT' at row 702251


In table 206 ascii - part01 we have one record (#56) that gives some problems since a text field finishes with a / and this shiftes some fields, giving error 'Row 56 doesn't contain data for all columns'

Also in TLS206-part05
last 63 records doesn't contain data for all columns, but this time for real...
(Row 1428062 till 1428127)
FI:
037428066010830866054240706US       43659874A 3USPI0003        000000000000000000000

Enjoy you database...

5 comments:

Unknown said...

Hey man,

Thanks so much for doing this, but the link to the scripts seems broken. Do you think you could post an alternate link?

Thanks!!
M

GL said...

Hi Matej
you were right, there was a problem in the file on skydrive. I fixed it, let me know if it's ok

Gtampu said...

GL,
Great! Thanks for putting this up. As you said, blogs like this restores our faith and confidence in God and humanity :)

G.
PS.I've just got back.

Unknown said...

Hi,
Thanks for sharing your scripts but the link is not working :(

GL said...

Hi, just leftclick and choose 'save as', it should be working; if you just doubleclick it will not work.

Post a Comment