Monday, October 10, 2011

Person ids concordance table

A lot of problems arise when you need to attach to your patstat version DB, data coming from other sources updated asincornally with yours.

Most problems come from appln_id and person_id.

Since from 201104 appln_id will be stabilized @ source, I produced a first concordance table for person_id.

@this link (thanks to EPO for guesting this big file on their servers) you may download a first version of a concordance table among differnet versions of person ids.

After un-raring the file you will have a txt tab separated table contains 4 fields:
personid1   first patstat version personid
personid2   second patstat version personid
pversion1   patstat version of personid1 (format YYYYMM)
pversion2    patstat version of personid2 (format YYYYMM)

This first version of the file contains only concordance among 200909 and 201009 versions.

The file succeds in  matching 34250823, out of 37428113, 201009 persons against 200909 ids.


In reality matches are a little less since about 38.000 matches are duplicated (12000 2010 ids match to more than one 2009 id and 26000 2009 ids match more than one 2010 ids), but this less than 1/1000 error is due to some data repetitions in the main db and is a very small error.



If you are willing to do the same exercise you may run this SQL after renaming to tls206_person_200909 the TLS206 table, version 200909.

Be aware you will need to use a collation that distinguish among upper and lower case (in mysql utf8_bin would do the job).


create table person_id_his
Select
  T10.PERSON_ID As person_id1,
  T9.PERSON_ID As person_id2,
  "201009" as pediction1,
  "200909" as pediction2
From
  tls206_person_200909 T9 Inner Join
  tls206_person T10 On
  T10.PERSON_CTRY_CODE  = T9.PERSON_CTRY_CODE
  And T10.PERSON_NAME  = T9.PERSON_NAME
  And T10.PERSON_ADDRESS  = T9.PERSON_ADDRESS ;

No comments:

Post a Comment