Monday, May 10, 2010

converting patstat text fields into plain ascii

Due to the many different sources of data, the text fields in patstat (especially in table TLS206 containing names and adresses of applicants & inventors) would containg a lot of non-ascii chars;

In case anybody would need, this is a simple version for standardizing to plain ascii characters; be aware that using this table on patents titles or abstracts could lead to unexpected results.

-->
NONASCII
ASCII
DESCRIPTION
Ç
C
Ç CEDILLE
É
E
é
Ö
O
ö
Ü
U
ü
Ä
A
ä
À
A
à
Ú
U
ú
Á
A
á
Î
I
î
Å
A
å
È
E
è
Ã
A
Ã
Â
A
Â
Ë
E
Ë
Ø
O
Ø
Ó
O
Ó
Ñ
N
Ñ N TILDE
Ê
E
Ê
Ô
O
Ô
Æ
AE
Æ
ß
SS
ß
Í
I
I
Ï
I
Ï
Õ
O
Õ
¢
C
¢
Û
U
Û
A
Ä
š
U
ü
.O SLASHED.
O
Ø
.ANG.
A
å
{ACUTE OVER (M)}
M
M
{HACEK OVER (S)}
S
Š
{HACEK OVER (C)}
C
Č
{HACEK OVER (Z)}
Z
Ž
{UMLAUT OVER (S)}
S
S
{UMLAUT OVER (C)}
C
C

Obviously there is another problem that is "how non ascii characters are incorporated in the txt fields?"
If we deal with table TLS206_PERSON this problem has a high relevance but we can highlight it by seeking for the char  (FI « = AE) and we may use a table like the one made by Julio Raffo on
http://wiki.epfl.ch/patstat/corrupted

If we use TLS206_ascii the problem is not relevant.

No comments:

Post a Comment