Wednesday, November 18, 2009

Removing non alfanumeric chars

Ok.
Let's figure out we have a list of addresses coming from a database (PATSTAT FI).
You can bet that two cities who are the same will be written in a different number of way, the longer the city name the highest the figure!
A part from already cited Levenshtein distance we may also need to get rid of all those interpunctions, slash and other wonderful chars that the keyboard gives us.

Let's go to Germany, for instance.
If you have

Frankfurt/Main
FRANKFURT MAIN
FRANKFURT-MAIN

How may you say they are the same city?
We would need a function (let's call it ALFANUM) who removes all chars except A-Z and 0-9 and puts the string in uppercase returning

FRANKFURTMAIN

for all.

Here it is my version that works with strings up to 500 chars (but you may change it...)

CREATE DEFINER=`root`@`localhost` FUNCTION `alfanum`(in1 varchar(500)) RETURNS varchar(500) CHARSET latin1
BEGIN
DECLARE a INT;
DECLARE b varCHAR(500);
DECLARE c varCHAR(500);
DECLARE d char(1);
SET b="";
SET c = in1 ;
ciclo: WHILE c <>"" DO
set d = left(c,1);
set d = ucase (d);
set c= right(c, length(c)-1);
if ( d >= "A" and d <= "Z") or (d>="0" and d <="9" ) then
set b = concat(b,d) ;
end if;
END WHILE ciclo;
return b;
END

No comments:

Post a Comment