In october 2010 patstat ediction we find in table TLS206 37.428.107 distinct person id (applicants or inventors); we would expect (or hope) they to have, a part from name, some geographic data, but as stated in some previous posts, a lot of them miss all informations a part from name making data quality improvement a little harder.
Exactly 13.032.871 persons (a 28% of the total) have no country code (and obviously in most cases no city, address etc.).
I start here a some posts about how to try to find clusters where it's possible to improve data quality of countries.
The first case I'll investigate applicants with no country codes.
let's take FI patent AP273 invented by BRUCE HOWARD DIXON [US] and applied by HOWARD DIXON BRUCE (with no country);
If we look at the PDF of the patent we see Bruce as applicant is listed as living in Florida, US, and as inventor is listed as "SEE ABOVE"; so we may presume a lot of applicants with no country when they have omonims in the same application, can inheritate country code from homonim inventor.
We must anyway remove possible doubles, like this couple of us patents, A & B where same applicant (and person id) Abate Riccardo invents a patent first as US, latter as IT.
So when creating correction table we must remove multiple occurences of same persons.
Applying this procedure in a simple way (I mean no standardization of names, just sheer string match) 472.668 (3,7% of missing) persons with no country code can be assigned a country code.
Exactly 13.032.871 persons (a 28% of the total) have no country code (and obviously in most cases no city, address etc.).
I start here a some posts about how to try to find clusters where it's possible to improve data quality of countries.
The first case I'll investigate applicants with no country codes.
let's take FI patent AP273 invented by BRUCE HOWARD DIXON [US] and applied by HOWARD DIXON BRUCE (with no country);
If we look at the PDF of the patent we see Bruce as applicant is listed as living in Florida, US, and as inventor is listed as "SEE ABOVE"; so we may presume a lot of applicants with no country when they have omonims in the same application, can inheritate country code from homonim inventor.
We must anyway remove possible doubles, like this couple of us patents, A & B where same applicant (and person id) Abate Riccardo invents a patent first as US, latter as IT.
So when creating correction table we must remove multiple occurences of same persons.
Applying this procedure in a simple way (I mean no standardization of names, just sheer string match) 472.668 (3,7% of missing) persons with no country code can be assigned a country code.
No comments:
Post a Comment