Monday, June 13, 2011

Deeper into patstat family tables

As previously stated, Patstat contains 2 family data tables: tls218 (docdb family) and tls219 (inpadoc family).
In this post you may find differnces in definitions.


Here we are going to make some analisis on the data contained in the two tables, based on oct 2010 ediction.


A first issue that comes out is the difference in number of records:


tls218_docdb_fam58713013
tls219_inpadoc_fam 66226956

This can be easily explained: the difference is that while inpadoc family has been built for all application_id, docdb exludes PATSTAT Applications created from unpublished DOCDB Priorities (in this case appln_id > 59.000.000).

One more issue can be to extract the number of distinct families from the two tables:

inpadoc: 39.301.955
docdb:    40.677.058

So we expect average size of docdb family to be smaller, in agreement with it's definition.
Then We may calculate the family composition in number of applications.



inpadoc
docdb
#apps
#
%
#
%
1
29923455
76,14%
35468031
87,19%
2
5374436
13,67%
1863423
4,58%
3
1048185
2,67%
898137
2,21%
4
718316
1,83%
656255
1,61%
5
552198
1,41%
514797
1,27%
6
424765
1,08%
359279
0,88%
7
308614
0,79%
243965
0,60%
8
222540
0,57%
170751
0,42%
9
161787
0,41%
122107
0,30%
>=10
567659
1,44%
380313
0,93%
TOT
39301955
100%
40677058
100%

If we go for absolute numbers we will find that inpadoc bigger family counts 4927 applications (where 20% of them are unpublished priorities), where docdb has 'only' 329 applications in it's bigger 'clan'.

Here, in espacenet, you may find the inpadoc family members of this 4927 oversize family (no surprise is a biotech application) but only 430 are displayed along with a message "the system limits have been exceeded, therefore the family is incomplete".



No comments:

Post a Comment