Image may be NSFW.
Clik here to view.For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually. But for larger projects this approach would not be feasible. So I had a look at what R would offer me for fuzzy string matching beyond good ol’ Levenshtein distance and came across a rather new package answering to the name of “stringdist” maintained by Mark van der Loo. To my pleasant surprise it offers not two, not three , but a variety of configurable algorithms for that purpose. But I have no idea what is for example the effective difference between a Jaccard distance and a cosine distance. So I played around a bit with them and finally came up with the idea of something like a slope graph showing the distances for alternations of one string – in this case “Cosmo Kramer” – just to get started and an idea about what’s going on and how different algorithms are affected by certain alternations.

The slope graph and a few observations

The colors serve the purpose of giving a categorization of the alternation: typo, conventional variation, unconventional variation and totallly different.

The red category I introduced to get an idea on where to expect the boundary from “could be considered the same” to “is definitely something different“. An interesting observation is that all algorithms manage to keep the typos separate from the red zone, which is what you would intuitively expect from a reasonable string distance algorithm.

Also note how q-gram-, Jaccard- and cosine-distance lead to virtually the same order for q in {2,3} just differing on the scaled distance value. Those algorithms for q=1 are obviously indifferent to permuations. Jaro-Winkler again seems to care little about characters interspersed, placed randomly or missing as long as the target word’s characters are present in correct order.

The different algorithms provided by stringdist

Hamming distance: Number of positions with same symbol in both strings. Only defined for strings of equal length.

distance(‘abcdd‘,’abbcd‘) = 3

Levenshtein distance: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.

(Full) Damerau-Levenshtein distance: Like Levenshtein distance, but transposition of adjacent symbols is allowed.

Optimal String Alignment / restricted Damerau-Levenshtein distance: Like (full) Damerau-Levenshtein distance but each substring may only be edited once.

> stringdist('ab', 'bxa', method = 'osa')
[1] 3

> stringdist('ab', 'bxa', method = 'dl')
[1] 2

Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.

distance(‘ABCvDEx‘,’xABCyzDE’) = 5

q-gram distance: Sum of absolute differences between N-gram vectors of both strings.

> qgrams('abcde','abdcde',q=2)
   ab bc cd de dc bd
V1  1  1  1  1  0  0
V2  1  0  1  1  1  1

> stringdist('abcde', 'abdcde', method='qgram', q=2)
[1] 3

Cosine distance: 1 minus the cosine similarity of both N-gram vectors.

> cos_sim <- function(a, b) {
+     sum(a*b) / (sqrt(sum(a*a)) * sqrt(sum(b*b)))
+ }

> a <- 'abcde'
> b <- 'abdcde'

> g <- qgrams(a, b, q=2)
> 1 - cos_sim(g[1,], g[2,])
[1] 0.3291796

> stringdist(a, b, method='cosine', q=2)
[1] 0.3291796

Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.

> qgrams('abcde', 'abdcde', q=2)
   ab bc cd de dc bd
V1  1  1  1  1  0  0
V2  1  0  1  1  1  1

> stringdist('abcde', 'abdcde', method='jaccard', q=2)
[1] 0.5 = [1 - 3 / 6]

Jaro distance: The Jaro distance is a formula of 4 values and effectively a special case of the Jaro-Winkler distance with p = 0.

Jaro-Winkler distance: This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].

> a <- 'abcde'
> b <- 'abdcde'

A = 5 (length of a)
B = 6 (length of b)
m = 5 (number of shared symbols)
t = 1 (number of necessary transpositions of shared symbols)

> d <- function(A,B,m,t) {
+     1-(1/3)*(m/A + m/B + (m-t)/m);
+ }

l = 2 (num of symbols at beginning before first mismatch. maximum value is 4)

> jw <- function(A,B,m,t,l,p) {
+     d(A,B,m,t) * (1 - l * p);
+ }

> jw(5, 6, 5, 1, 2, 0.1)
[1] 0.09777778

> stringdist(a, b, method='jw', p=0.1)
[1] 0.09777778

Meaningul quantification of difference between two strings

What string distance to use depends on the situation. If we want to compensate for typos then the variations of the Levenshtein distances are of good use, because those are taking into account the three or four usual types of typos. The metric could be improved f.x. by factoring the keyboard layout into the calculation. On an english keyboard the distance between “test” and “rest” would then be smaller than the difference between “test” and “best” for obvious reasons. This would be a top-down-assessment of a string metric. The bottom-up couterpart would be by trying to quantify the question “What would a human being (me) assume as similar?” and its answer. This is naturally tough to compute – but there is one case for which it is actually possible! Check this out:

Ins’t it fnnuy taht you can raed tihs steennce eevn tohguh leettsrs ecpext for the fsrit and lsat one are perumted?

So from a top down perspective a good string metric would consider two strings very close if the first and last letter are matching and the letters in between are just permuted. You don’t have to be a genius to tell from the above given descriptions of the algos that none will perform exceptionally well and the one’s that do are probably just immune to perumtations on a whole – but what the heck – I got curious how the metrics respond to permutations. Okay one further aspect – given that even though human reading seems to be unimpressed by framed permutations ambiguous cases might arise – “ecxept”/”except” and “expcet”/”expect” – then the hamming distance would (maybe) determine the interpretation – which is why I chose it for the coloring in the following plot:

Image may be NSFW.
Clik here to view.

Few observations

Some dots I annotated because they were sticking out. Hamming distance of two but maximum distance – for q-gram-, cosine- and Jaccard-distance with q=3 – that is interesting. Or the maximum distance for only one permutation next to the special case “abcdef” – for Jaro-Winkler. This cases can be assumed as something like “algorithmic blind spots”.

Also worth noting is how for q-gram, cosine and Jaccard the number of permutations with same hamming distance per cluster is the same. I don’t think this is obvious from the defintion of the metrics.

The R code producing the distances for “Cosmo Kramer”

library(stringdist)

b <- c(
    "Cosmo Kramer",
    "Kosmo Kramer",
    "Comso Kramer",
    "Csmo Kramer",
    "Cosmo X. Kramer",
    "Kramer, Cosmo",
    "Jerry Seinfeld",
    " CKaemmoorrs",
    "Cosmer Kramo",
    "Kosmoo Karme",
    "George Costanza",
    "Elaine Benes",
    "Dr. Van Nostren",
    "remarK omsoC",
    "Mr. Kramer",
    "Sir Cosmo Kramer",
    "C.o.s.m.o. .K.r.a.m.e.r",
    "CsoKae",
    "Coso Kraer"
)

a <- rep("Cosmo Kramer", length(b))

M <- data.frame(
    m = c("osa", "lv", "dl", "lcs", "qgram", "qgram", "qgram",
        "cosine", "cosine", "cosine", "jaccard", "jaccard", "jaccard",
        "jw", "jw", "jw"), 
    q = c(0,0,0,0,1,2,3,1,2,3,1,2,3,0,0,0), 
    p = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.2)
)

R <- apply(M, 1, 
    function(m) stringdist(a, b, method=m["m"], q=m["q"], p=m["p"]))

R2 <- round(R,3)

rownames(R2) <- paste(format(paste("'", a, "'", sep=""), width=14), " - ",
    format(paste("'", b, "'", sep=""), width=17), sep=""
)

colnames(R2) <- M$m

write.table(R2, "clipboard", sep="\t")

The R code for producing the permutations scatter plot

library(permute)
library(stringdist)
library(ggplot2)
library(reshape2)

M <- data.frame(
    m = c("hamming","osa","lv","dl","lcs","qgram","cosine","jaccard","qgram",
        "cosine","jaccard","qgram","cosine","jaccard","jw","jw","jw"), 
    q = c(0,0,0,0,0,3,3,3,2,2,2,1,1,1,0,0,0), 
    p = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.2))

permsRaw <- allPerms(6, max=10^7)

perms <- apply(permsRaw,1,function(x)paste(c(x),collapse=""))
perms <- c("123456",perms,"abcdef")

R <- apply(M,1,function(m)stringdist("123456",perms,method=m["m"],q=m["q"],p=m["p"]))

RColMax <- apply(R,2,function(x)max(x))
RColMax[1] <- 1
R0 <- t(t(R)/RColMax)

dfR <- as.data.frame(R0)
colnames(dfR) <- paste(M$m,"q",M$q,"p",M$p,sep="")

dfR$perm <- perms

dfR <- dfR[sample(nrow(dfR)),]

dfR0 <- melt(dfR,id.vars=c('hammingq0p0','perm'))

dfR0 <- dfR0[order(-dfR0$hammingq0p0),]
ggplot(dfR0) + 
    geom_jitter(
        data=dfR0,
        aes(x=variable, y=value,col=factor(hammingq0p0)),
        alpha=.7,size=1.7,
        position = position_jitter(height = .05)
    )+
    scale_color_manual(values = c("#000000", "#ff0000", "#00ff00",
        "#6BAED6", "#3182BD", "#08519C"))

(original article published on www.joyofdata.de)

Comparison of String Distance Algorithms

The slope graph and a few observations

The different algorithms provided by stringdist

Meaningul quantification of difference between two strings

Few observations

The R code for producing the permutations scatter plot

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112