Monday, November 23, 2009

Patstat vs non patent citations (NPL)

For lovers of bibliometry patstat, along with patent citations, contains also non patent literature referenced by patents.

You con get those references from citations table (TLS212_CITATION) linking the table containing the fulltext of NPL citation (TLS214_NPL_PUBLN) by NPL_PUBLN_ID where NPL_CITN_SEQ_NR (the progressive for NPL citations) is different from 0

An issue on these tables is that TLS214 contains a lot of duplicates!!!

With an easy

select distinct trim(NPL_BIBLIO) from TLS214_NPL_PUBLN
I reduce the figures from 12.139.696 to 9.449.779 (a 23% less...))
[I added a trim cause many records start with a space]

maybe later on I'll post some SQL to deduplicate the data...

Wednesday, November 18, 2009

Removing non alfanumeric chars

Ok.
Let's figure out we have a list of addresses coming from a database (PATSTAT FI).
You can bet that two cities who are the same will be written in a different number of way, the longer the city name the highest the figure!
A part from already cited Levenshtein distance we may also need to get rid of all those interpunctions, slash and other wonderful chars that the keyboard gives us.

Let's go to Germany, for instance.
If you have

Frankfurt/Main
FRANKFURT MAIN
FRANKFURT-MAIN

How may you say they are the same city?
We would need a function (let's call it ALFANUM) who removes all chars except A-Z and 0-9 and puts the string in uppercase returning

FRANKFURTMAIN

for all.

Here it is my version that works with strings up to 500 chars (but you may change it...)

CREATE DEFINER=`root`@`localhost` FUNCTION `alfanum`(in1 varchar(500)) RETURNS varchar(500) CHARSET latin1
BEGIN
DECLARE a INT;
DECLARE b varCHAR(500);
DECLARE c varCHAR(500);
DECLARE d char(1);
SET b="";
SET c = in1 ;
ciclo: WHILE c <>"" DO
set d = left(c,1);
set d = ucase (d);
set c= right(c, length(c)-1);
if ( d >= "A" and d <= "Z") or (d>="0" and d <="9" ) then
set b = concat(b,d) ;
end if;
END WHILE ciclo;
return b;
END

Tuesday, November 17, 2009

Excel vs in cell line breaks

Recently, working on a EU Tender (Using performance indicators in monitoring the implementation of ICT research in FP6 and FP7) I had to transform into a DB a batch of reports (reporting publication titles, magazine and so on) whose original format was MS excel (may the hell swallow them!!!).
One (out of dozens) of the problems was that many cells were containing a line break (due to web cut & paste or in order to give 'em a pretty look) and I found a quick solution for removing them all from a given worksheet

STEP 1 select the area for substitution
STEP 2 press ALT+F11 (opens VB window)
STEP 3 press CTRL+G (opens IMMEDIATE frame)
STEP 4 paste the following: selection.replace chr(10)," " and press enter in the IMMEDIATE frame

Obviously sobstituting chr(10) or " " it works with anything you need to replace.
As a matter of fact I had also to replace doublespaces...!!!

STEP 5 selection.replace "  "," "

Friday, November 13, 2009

Leven-einstein

When having to compare 2 text strings (FI 2 addresses) and get a similarity score, Mr Levenshtein invented an algorithm that is very useful.

The bad news is that  MYSQL do not implement it.
Bytheway you may decide to implement it through a user define function (UDF)... but under windows it would mean a lot of compiling, storing and so on... (in reality I could not succeed but if you want to try see here at Sherlock software)

In LINUX it seems to be much easier (for ubuntu (in german) see http://www.teamarbyte.de/levenshtein.html)

So in my WIN environment I implemented, thanks to CODEJANITOR (see http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function/) this plain SQL function... much slower than a compiled one but that is life...

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        DECLARE cv0, cv1 VARBINARY(256);
        SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
        IF s1 = s2 THEN
          RETURN 0;
        ELSEIF s1_len = 0 THEN
          RETURN s2_len;
        ELSEIF s2_len = 0 THEN
          RETURN s1_len;
        ELSE
          WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
          END WHILE;
          WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
          END WHILE;
        END IF;
        RETURN c;
      END

To note:
    * Maximum length of input strings is 255 characters. The function should be edited to support more if needed.
    * I’ve tested it with international characters on a utf8_bin column and it seemed to work, but I’ve not tested that capability exstensively.
    * I’ve only tested it on MySQL 5.0+. No idea how it will work on versions less than that.

Thursday, November 12, 2009

free icons

ok. This is not much database or patent related... but who does not need some free icons for making nicer his beloved db oreinted applicantion?
Here you can find some good websites for downloading sets of icons for free...


http://www.freeiconsdownload.com/index.html  --> only PNG format
http://www.freeiconsweb.com/free_icons.html --> very very nice!
http://www.iconarchive.com/links/icon-tools.html --> icon tools (editors, creators, converters...)

Monday, November 9, 2009

From Patstat to mysql

Regardless the fact that EPO-Patstat DVDs contain a rich documentation about how to load the data in several formats (MS SQL, mainframe...) for MySQL the documentation is poor since it was written in 2006 and since that time many updates and new tables occurred.

I post here a link to some SQL I wrote in order to allow upload of tables into MYSQL. Especially some changes were made for:
- allowing new TLS219 and TLS218
- Updating data from TLS206ASCII (parsed person data) into a TLS206X

The data rely onto a library called PATSTAT and expect raw data to be loaded into a subdir called /RAW of PATSTAT Mysql library

Hope will be useful... you can download it from here

Let's start from...

the end.
Nowadays I work as a database administrator @ KITES (Knowledge, Internationalization and Technology Studies), a dept. caring of economics of innovation in Bocconi University.
One of my main contributes to Kites was creating a database about patents application at EPO and USPTO.
Mainly our db is based on PATSTAT data and developed in mysql. Soon I'm going to post (since it seems EPO do not provide them) some scripts useful for importing patstat in mysql.
Anyway, this blog is going to be a very technical one (so very boring for most of people) aiming first of all to help me to put order in the stuff I deal with. Eventually if some other people could advantage from it, that could be good.