When having to compare 2 text strings (FI 2 addresses) and get a similarity score, Mr Levenshtein invented an algorithm that is very useful.
The bad news is that MYSQL do not implement it.
Bytheway you may decide to implement it through a user define function (UDF)... but under windows it would mean a lot of compiling, storing and so on... (in reality I could not succeed but if you want to try see here at Sherlock software)
In LINUX it seems to be much easier (for ubuntu (in german) see http://www.teamarbyte.de/levenshtein.html)
So in my WIN environment I implemented, thanks to CODEJANITOR (see http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function/) this plain SQL function... much slower than a compiled one but that is life...
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END
To note:
* Maximum length of input strings is 255 characters. The function should be edited to support more if needed.
* I’ve tested it with international characters on a utf8_bin column and it seemed to work, but I’ve not tested that capability exstensively.
* I’ve only tested it on MySQL 5.0+. No idea how it will work on versions less than that.
academic patenting
(4)
algorithms
(2)
anvur
(1)
APE-INV
(3)
applicants
(10)
applications
(11)
ascii
(1)
bibliometrics
(7)
bocconi
(2)
bug
(1)
china
(2)
citations
(11)
claims
(3)
concordance
(7)
conference
(8)
CPCs
(2)
curiosities
(1)
data quality
(12)
data recovery
(1)
database
(26)
datamining
(5)
disk
(1)
download
(1)
dump
(1)
ecla
(1)
entity resolution
(4)
EP register
(7)
epo
(15)
equivalents
(1)
espacenet
(2)
ethnicity
(2)
examination
(3)
excel
(3)
free
(2)
function
(1)
GDPR
(1)
gender
(1)
geocoding
(6)
github
(1)
icons
(1)
indicators
(1)
inpadoc
(9)
inventors
(21)
IPC
(21)
IPC35
(4)
job offers
(1)
KITeS
(3)
legal status
(16)
levenshtein
(1)
line breaks
(1)
linked open data
(1)
match
(1)
mobility
(1)
mysql
(23)
nace
(2)
national patents data
(6)
NBER
(1)
news
(1)
NPL
(7)
NUTS3
(6)
OHIM
(1)
openoffice
(1)
orbis
(1)
orcid
(1)
OS
(1)
OST
(2)
password recover
(1)
patent attorneys
(1)
patent data
(2)
patent family
(17)
patent ownership
(3)
patent status
(3)
patent value
(1)
patents
(49)
patentsview
(3)
patstat
(145)
person_id
(13)
priorities
(5)
python
(2)
reclassification
(8)
renewals
(1)
replace
(2)
scientific articles
(2)
scopus
(1)
semantic analysis
(2)
sipo
(3)
sql
(6)
strings
(4)
tool
(9)
trademarks
(2)
triadic patents
(2)
UDF
(1)
USPC
(1)
USPTO
(12)
VBA
(1)
vista
(1)
VM
(1)
webscraping
(2)
WIPO
(10)
workshops
(1)
Wos
(1)
xp
(1)
No comments:
Post a Comment