Showing posts with label strings. Show all posts
Showing posts with label strings. Show all posts

Thursday, August 30, 2018

Online regular expression validators

a quick note on two useful on line tools allowing test and validate regular expressions.

First one is

https://regex101.com

that allows tests in multiple dialects (php, javascript, python...) and gives a verbose but useful explanation of the steps.

Last but not least it allows to share links to tests prepared like this one:

https://regex101.com/r/9y9n85/20

also useful is:
https://regexper.com/


that comes with the comforting subtitle
You thought you only had two problems…
and allows crreating diagrams about how a given regexp works

example:

 





https://regexper.com/#Results%20%3D%20%28%5Cd%2B%29 

Friday, March 12, 2010

a VBA function counting chars

It may happen to find in some data sources more than one occurence of a data type inside the same field, separated by a special char.
But in many cases that  char may be not so special and be misguiding when trasposing the field.
Let's take as example Thomson SDC database, where in the alliance name you find the alliance participants names separated by the / char.
Obviously the / char may also be integral part of some participant name; so we need to count the occurrence of the / char in order to compare it to the number of partipants indicated in the same record in order to do futher cleaning, if needed.

So I wrote a small DBA function called CHARCOUNT, and I'm posting it here

Function charcount(inputstring As String, findstring As String)
' return the number of occurrences of findstring in inputstring

Dim a$, b$
Dim c, i As Integer

a$ = inputstring
b$ = findstring

c = 0
For i = 1 To Len(a$) - Len(b$) + 1
    If Mid(a$, i, Len(b$)) = b$ Then
        c = c + 1
    End If
Next

charcount = c

End Function
Function charcount(inputstring As String, findstring As String)
' return the number of occurrences of findstring in inputstring

Dim a$, b$
Dim c, i As Integer

a$ = inputstring
b$ = findstring

c = 0
For i = 1 To Len(a$) - Len(b$) + 1
    If Mid(a$, i, Len(b$)) = b$ Then
        c = c + 1
    End If
Next

charcount = c

End Function

Wednesday, November 18, 2009

Removing non alfanumeric chars

Ok.
Let's figure out we have a list of addresses coming from a database (PATSTAT FI).
You can bet that two cities who are the same will be written in a different number of way, the longer the city name the highest the figure!
A part from already cited Levenshtein distance we may also need to get rid of all those interpunctions, slash and other wonderful chars that the keyboard gives us.

Let's go to Germany, for instance.
If you have

Frankfurt/Main
FRANKFURT MAIN
FRANKFURT-MAIN

How may you say they are the same city?
We would need a function (let's call it ALFANUM) who removes all chars except A-Z and 0-9 and puts the string in uppercase returning

FRANKFURTMAIN

for all.

Here it is my version that works with strings up to 500 chars (but you may change it...)

CREATE DEFINER=`root`@`localhost` FUNCTION `alfanum`(in1 varchar(500)) RETURNS varchar(500) CHARSET latin1
BEGIN
DECLARE a INT;
DECLARE b varCHAR(500);
DECLARE c varCHAR(500);
DECLARE d char(1);
SET b="";
SET c = in1 ;
ciclo: WHILE c <>"" DO
set d = left(c,1);
set d = ucase (d);
set c= right(c, length(c)-1);
if ( d >= "A" and d <= "Z") or (d>="0" and d <="9" ) then
set b = concat(b,d) ;
end if;
END WHILE ciclo;
return b;
END

Friday, November 13, 2009

Leven-einstein

When having to compare 2 text strings (FI 2 addresses) and get a similarity score, Mr Levenshtein invented an algorithm that is very useful.

The bad news is that  MYSQL do not implement it.
Bytheway you may decide to implement it through a user define function (UDF)... but under windows it would mean a lot of compiling, storing and so on... (in reality I could not succeed but if you want to try see here at Sherlock software)

In LINUX it seems to be much easier (for ubuntu (in german) see http://www.teamarbyte.de/levenshtein.html)

So in my WIN environment I implemented, thanks to CODEJANITOR (see http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function/) this plain SQL function... much slower than a compiled one but that is life...

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        DECLARE cv0, cv1 VARBINARY(256);
        SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
        IF s1 = s2 THEN
          RETURN 0;
        ELSEIF s1_len = 0 THEN
          RETURN s2_len;
        ELSEIF s2_len = 0 THEN
          RETURN s1_len;
        ELSE
          WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
          END WHILE;
          WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
          END WHILE;
        END IF;
        RETURN c;
      END

To note:
    * Maximum length of input strings is 255 characters. The function should be edited to support more if needed.
    * I’ve tested it with international characters on a utf8_bin column and it seemed to work, but I’ve not tested that capability exstensively.
    * I’ve only tested it on MySQL 5.0+. No idea how it will work on versions less than that.