Due to the many different sources of data, the text fields in patstat (especially in table TLS206 containing names and adresses of applicants & inventors) would containg a lot of non-ascii chars;
In case anybody would need, this is a simple version for standardizing to plain ascii characters; be aware that using this table on patents titles or abstracts could lead to unexpected results.
--> In case anybody would need, this is a simple version for standardizing to plain ascii characters; be aware that using this table on patents titles or abstracts could lead to unexpected results.
NONASCII
|
ASCII
|
DESCRIPTION
|
Ç
|
C
|
Ç CEDILLE
|
É
|
E
|
é
|
Ö
|
O
|
ö
|
Ü
|
U
|
ü
|
Ä
|
A
|
ä
|
À
|
A
|
à
|
Ú
|
U
|
ú
|
Á
|
A
|
á
|
Î
|
I
|
î
|
Å
|
A
|
å
|
È
|
E
|
è
|
Ã
|
A
|
Ã
|
Â
|
A
|
Â
|
Ë
|
E
|
Ë
|
Ø
|
O
|
Ø
|
Ó
|
O
|
Ó
|
Ñ
|
N
|
Ñ N TILDE
|
Ê
|
E
|
Ê
|
Ô
|
O
|
Ô
|
Æ
|
AE
|
Æ
|
ß
|
SS
|
ß
|
Í
|
I
|
I
|
Ï
|
I
|
Ï
|
Õ
|
O
|
Õ
|
¢
|
C
|
¢
|
Û
|
U
|
Û
|
†
|
A
|
Ä
|
š
|
U
|
ü
|
.O SLASHED.
|
O
|
Ø
|
.ANG.
|
A
|
å
|
{ACUTE OVER (M)}
|
M
|
M
|
{HACEK OVER (S)}
|
S
|
Š
|
{HACEK OVER (C)}
|
C
|
Č
|
{HACEK OVER (Z)}
|
Z
|
Ž
|
{UMLAUT OVER (S)}
|
S
|
S
|
{UMLAUT OVER (C)}
|
C
|
C
|
Obviously there is another problem that is "how non ascii characters are incorporated in the txt fields?"
If we deal with table TLS206_PERSON this problem has a high relevance but we can highlight it by seeking for the char  (FI « = AE) and we may use a table like the one made by Julio Raffo on
http://wiki.epfl.ch/patstat/corrupted
If we use TLS206_ascii the problem is not relevant.
No comments:
Post a Comment