Sunday, December 20, 2009

Underscore vs dash in excel sheet names

this could be funny, if it were not a big issue in the elaboration of a questionnaire.
As I previously mentioned I'm working on building a DB for DG-Infso (EU) from some thousands of answers made by project leaders, collected via Excel.
The bad thing was that apparently many people had changed some of the worksheets, changing dashes (-) to underscores (_) (that was like publications-papers becoming publications_papers).

Even if I understand that for somebody's taste underscore may look nicer than a sign that resemble a minus (so attracting negativity) I did not feel like dozens of people would be taking the effort of manual substitutions just to avoid bad karma.

So I suspected "maybe they are mac users?" (you know mac users are more estetic-oriented...) until I found this post on OPENOFFICE.ORG explaining how openoffice calc changes dashes to underscore in worksheets names.

So cool...

Thursday, December 17, 2009

IPC to OST reclassification

After last post I decided to add something of my own, by releasing a conversion table from IPC to OST (7 & 30) reclassification.
The table can be downloaded here : it's a CSV file (separated with semicolon) with an xls file containing class descriptions for 7 & 30 reclassifications.

You will find something like

M_class OST30 OST7
A01B% ; 20 ; 5
A01C% ; 20 ; 5

Where % means in SQL "anything after..."
So you may add your reclassification with an update where YOUR_IPC is like M_CLASS

At the end of the table you will find some IPC without reclassification

C99Z%
H04N101/00

But they are very few in the overall patents population (2 and 328 patents in EPO)

Wednesday, December 16, 2009

OST reclassification vs H04W

This post is the result of a joint effort of Emmanuelle Fortune (OST), Lorenzo Cassi (CES) and Francesco Lissoni (KITeS) who gave the solution. I just raised the question.

I asked those guys a help since OST reclassification of IPC codes is not containing H04W class, that comes with patstat 2009/10.

UPto now only @EPo ypou can find 9023 patents with H04W and more than 50.000 worldwide.

The group H04W has been introduced into the IPC-8 only with the version January 2009 (2009.01). As with the IPC-8 system also old documents are re-classified, by using a PATSTAT edition 2009 or later you will find out that also documents from 2008 backwards carry the classificaiton H04W in the databases (see FI EP1685687) , but when you look on the printed document on Espacenet you will not find the symbol there. Also, in case where a patent has been published before 2009 as an application and then granted after 2009, then indeed you will see a difference in the published classification, but in the database you should find that both have the more uptodate on (H04W).

The H04W used to be H04L, H04Q, H04B... H04Q, H04B are in the field Telecommunications and H04L in Digital communication.

So H04W could fall under Telecommunications or Digital communication.

Ulli Schmoch defines "Digital Communications" as follows:
"Digital communication: in the ISI-OST-INPI classification, this field was part of telecommunications. At present, it is a self-contained technology at the border between telecommunications and computer technology. A core application of this technology is the Internet"

While telecommunication is huge (G08C, H01P, H01Q, H04B, H04H, H04J, H04K, H04M, H04N-001, H04N-007, H04N-011, H04Q), Digital communication is made only of H04L

So, in the absence of a more accurate judgement, we decided to keep H04W in Telecommunications, both because Digital Communications is almost a "residual" class and because any mistake would be less noticeable if we dilute HO4W in a big class instead of a small one. Finally, by taking a look at examples of patents with H04W IPC code, their title do not seem to place them squarely into the Internet realm.

Sunday, December 6, 2009

Paris Reloaded

This is about Name Game workshop held in Paris last 25, 26 november.

The workshop has been organized by my roommate Francesco Lissoni in the frame of APE-INV whose goals are measuring the extent of academic patenting, and studying its determinants, in order to improve our understanding of university–industry relationships, and of their influence on academic researchers’ choice of scientific targets and norms of conduct.
So an important issue is building a database about inventors, and the first step consists in cleaning and standardizing the information on inventors we can derive from patent data.
The Name Game workshop aimed at convening as many researchers as possible with experience in this field, in order to allow them to share their methodologies and possibly reach a consensus on how to harmonize their future efforts.
It has been hosted by OST, the Observatoire des sciences et des techniques.

There eventually I had the chance of making a demi-decent presentation of the job done about EPO inventors in the last few years. If you want you can find it here

Monday, November 23, 2009

Patstat vs non patent citations (NPL)

For lovers of bibliometry patstat, along with patent citations, contains also non patent literature referenced by patents.

You con get those references from citations table (TLS212_CITATION) linking the table containing the fulltext of NPL citation (TLS214_NPL_PUBLN) by NPL_PUBLN_ID where NPL_CITN_SEQ_NR (the progressive for NPL citations) is different from 0

An issue on these tables is that TLS214 contains a lot of duplicates!!!

With an easy

select distinct trim(NPL_BIBLIO) from TLS214_NPL_PUBLN
I reduce the figures from 12.139.696 to 9.449.779 (a 23% less...))
[I added a trim cause many records start with a space]

maybe later on I'll post some SQL to deduplicate the data...

Wednesday, November 18, 2009

Removing non alfanumeric chars

Ok.
Let's figure out we have a list of addresses coming from a database (PATSTAT FI).
You can bet that two cities who are the same will be written in a different number of way, the longer the city name the highest the figure!
A part from already cited Levenshtein distance we may also need to get rid of all those interpunctions, slash and other wonderful chars that the keyboard gives us.

Let's go to Germany, for instance.
If you have

Frankfurt/Main
FRANKFURT MAIN
FRANKFURT-MAIN

How may you say they are the same city?
We would need a function (let's call it ALFANUM) who removes all chars except A-Z and 0-9 and puts the string in uppercase returning

FRANKFURTMAIN

for all.

Here it is my version that works with strings up to 500 chars (but you may change it...)

CREATE DEFINER=`root`@`localhost` FUNCTION `alfanum`(in1 varchar(500)) RETURNS varchar(500) CHARSET latin1
BEGIN
DECLARE a INT;
DECLARE b varCHAR(500);
DECLARE c varCHAR(500);
DECLARE d char(1);
SET b="";
SET c = in1 ;
ciclo: WHILE c <>"" DO
set d = left(c,1);
set d = ucase (d);
set c= right(c, length(c)-1);
if ( d >= "A" and d <= "Z") or (d>="0" and d <="9" ) then
set b = concat(b,d) ;
end if;
END WHILE ciclo;
return b;
END

Tuesday, November 17, 2009

Excel vs in cell line breaks

Recently, working on a EU Tender (Using performance indicators in monitoring the implementation of ICT research in FP6 and FP7) I had to transform into a DB a batch of reports (reporting publication titles, magazine and so on) whose original format was MS excel (may the hell swallow them!!!).
One (out of dozens) of the problems was that many cells were containing a line break (due to web cut & paste or in order to give 'em a pretty look) and I found a quick solution for removing them all from a given worksheet

STEP 1 select the area for substitution
STEP 2 press ALT+F11 (opens VB window)
STEP 3 press CTRL+G (opens IMMEDIATE frame)
STEP 4 paste the following: selection.replace chr(10)," " and press enter in the IMMEDIATE frame

Obviously sobstituting chr(10) or " " it works with anything you need to replace.
As a matter of fact I had also to replace doublespaces...!!!

STEP 5 selection.replace "  "," "

Friday, November 13, 2009

Leven-einstein

When having to compare 2 text strings (FI 2 addresses) and get a similarity score, Mr Levenshtein invented an algorithm that is very useful.

The bad news is that  MYSQL do not implement it.
Bytheway you may decide to implement it through a user define function (UDF)... but under windows it would mean a lot of compiling, storing and so on... (in reality I could not succeed but if you want to try see here at Sherlock software)

In LINUX it seems to be much easier (for ubuntu (in german) see http://www.teamarbyte.de/levenshtein.html)

So in my WIN environment I implemented, thanks to CODEJANITOR (see http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function/) this plain SQL function... much slower than a compiled one but that is life...

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        DECLARE cv0, cv1 VARBINARY(256);
        SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
        IF s1 = s2 THEN
          RETURN 0;
        ELSEIF s1_len = 0 THEN
          RETURN s2_len;
        ELSEIF s2_len = 0 THEN
          RETURN s1_len;
        ELSE
          WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
          END WHILE;
          WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
          END WHILE;
        END IF;
        RETURN c;
      END

To note:
    * Maximum length of input strings is 255 characters. The function should be edited to support more if needed.
    * I’ve tested it with international characters on a utf8_bin column and it seemed to work, but I’ve not tested that capability exstensively.
    * I’ve only tested it on MySQL 5.0+. No idea how it will work on versions less than that.

Thursday, November 12, 2009

free icons

ok. This is not much database or patent related... but who does not need some free icons for making nicer his beloved db oreinted applicantion?
Here you can find some good websites for downloading sets of icons for free...


http://www.freeiconsdownload.com/index.html  --> only PNG format
http://www.freeiconsweb.com/free_icons.html --> very very nice!
http://www.iconarchive.com/links/icon-tools.html --> icon tools (editors, creators, converters...)

Monday, November 9, 2009

From Patstat to mysql

Regardless the fact that EPO-Patstat DVDs contain a rich documentation about how to load the data in several formats (MS SQL, mainframe...) for MySQL the documentation is poor since it was written in 2006 and since that time many updates and new tables occurred.

I post here a link to some SQL I wrote in order to allow upload of tables into MYSQL. Especially some changes were made for:
- allowing new TLS219 and TLS218
- Updating data from TLS206ASCII (parsed person data) into a TLS206X

The data rely onto a library called PATSTAT and expect raw data to be loaded into a subdir called /RAW of PATSTAT Mysql library

Hope will be useful... you can download it from here

Let's start from...

the end.
Nowadays I work as a database administrator @ KITES (Knowledge, Internationalization and Technology Studies), a dept. caring of economics of innovation in Bocconi University.
One of my main contributes to Kites was creating a database about patents application at EPO and USPTO.
Mainly our db is based on PATSTAT data and developed in mysql. Soon I'm going to post (since it seems EPO do not provide them) some scripts useful for importing patstat in mysql.
Anyway, this blog is going to be a very technical one (so very boring for most of people) aiming first of all to help me to put order in the stuff I deal with. Eventually if some other people could advantage from it, that could be good.