Image may be NSFW.
Clik here to view.For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually. But for larger projects this approach would not be feasible. So I had a look at what R would offer me for fuzzy string matching beyond good ol’ Levenshtein distance and came across a rather new package answering to the name of “stringdist” maintained by Mark van der Loo. To my pleasant surprise it offers not two, not three , but a variety of configurable algorithms for that purpose. But I have no idea what is for example the effective difference between a Jaccard distance and a cosine distance. So I played around a bit with them and finally came up with the idea of something like a slope graph showing the distances for alternations of one string – in this case “Cosmo Kramer” – just to get started and an idea about what’s going on and how different algorithms are affected by certain alternations.
The slope graph and a few observations
The colors serve the purpose of giving a categorization of the alternation: typo, conventional variation, unconventional variation and totallly different.
The red category I introduced to get an idea on where to expect the boundary from “could be considered the same” to “is definitely something different“. An interesting observation is that all algorithms manage to keep the typos separate from the red zone, which is what you would intuitively expect from a reasonable string distance algorithm.
Also note how q-gram-, Jaccard- and cosine-distance lead to virtually the same order for q in {2,3} just differing on the scaled distance value. Those algorithms for q=1 are obviously indifferent to permuations. Jaro-Winkler again seems to care little about characters interspersed, placed randomly or missing as long as the target word’s characters are present in correct order.
The different algorithms provided by stringdist
Hamming distance: Number of positions with same symbol in both strings. Only defined for strings of equal length.
distance(‘abcdd‘,’abbcd‘) = 3
Levenshtein distance: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.
(Full) Damerau-Levenshtein distance: Like Levenshtein distance, but transposition of adjacent symbols is allowed.
Optimal String Alignment / restricted Damerau-Levenshtein distance: Like (full) Damerau-Levenshtein distance but each substring may only be edited once.
> stringdist('ab', 'bxa', method = 'osa') [1] 3 > stringdist('ab', 'bxa', method = 'dl') [1] 2
Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
distance(‘ABCvDEx‘,’xABCyzDE’) = 5
q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
> qgrams('abcde','abdcde',q=2) ab bc cd de dc bd V1 1 1 1 1 0 0 V2 1 0 1 1 1 1 > stringdist('abcde', 'abdcde', method='qgram', q=2) [1] 3
Cosine distance: 1 minus the cosine similarity of both N-gram vectors.
> cos_sim <- function(a, b) { + sum(a*b) / (sqrt(sum(a*a)) * sqrt(sum(b*b))) + } > a <- 'abcde' > b <- 'abdcde' > g <- qgrams(a, b, q=2) > 1 - cos_sim(g[1,], g[2,]) [1] 0.3291796 > stringdist(a, b, method='cosine', q=2) [1] 0.3291796
Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
> qgrams('abcde', 'abdcde', q=2) ab bc cd de dc bd V1 1 1 1 1 0 0 V2 1 0 1 1 1 1 > stringdist('abcde', 'abdcde', method='jaccard', q=2) [1] 0.5 = [1 - 3 / 6]
Jaro distance: The Jaro distance is a formula of 4 values and effectively a special case of the Jaro-Winkler distance with p = 0.
Jaro-Winkler distance: This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].
> a <- 'abcde' > b <- 'abdcde' A = 5 (length of a) B = 6 (length of b) m = 5 (number of shared symbols) t = 1 (number of necessary transpositions of shared symbols) > d <- function(A,B,m,t) { + 1-(1/3)*(m/A + m/B + (m-t)/m); + } l = 2 (num of symbols at beginning before first mismatch. maximum value is 4) > jw <- function(A,B,m,t,l,p) { + d(A,B,m,t) * (1 - l * p); + } > jw(5, 6, 5, 1, 2, 0.1) [1] 0.09777778 > stringdist(a, b, method='jw', p=0.1) [1] 0.09777778
Meaningul quantification of difference between two strings
What string distance to use depends on the situation. If we want to compensate for typos then the variations of the Levenshtein distances are of good use, because those are taking into account the three or four usual types of typos. The metric could be improved f.x. by factoring the keyboard layout into the calculation. On an english keyboard the distance between “test” and “rest” would then be smaller than the difference between “test” and “best” for obvious reasons. This would be a top-down-assessment of a string metric. The bottom-up couterpart would be by trying to quantify the question “What would a human being (me) assume as similar?” and its answer. This is naturally tough to compute – but there is one case for which it is actually possible! Check this out:
Ins’t it fnnuy taht you can raed tihs steennce eevn tohguh leettsrs ecpext for the fsrit and lsat one are perumted?
So from a top down perspective a good string metric would consider two strings very close if the first and last letter are matching and the letters in between are just permuted. You don’t have to be a genius to tell from the above given descriptions of the algos that none will perform exceptionally well and the one’s that do are probably just immune to perumtations on a whole – but what the heck – I got curious how the metrics respond to permutations. Okay one further aspect – given that even though human reading seems to be unimpressed by framed permutations ambiguous cases might arise – “ecxept”/”except” and “expcet”/”expect” – then the hamming distance would (maybe) determine the interpretation – which is why I chose it for the coloring in the following plot:
Image may be NSFW.
Clik here to view.
Few observations
Some dots I annotated because they were sticking out. Hamming distance of two but maximum distance – for q-gram-, cosine- and Jaccard-distance with q=3 – that is interesting. Or the maximum distance for only one permutation next to the special case “abcdef” – for Jaro-Winkler. This cases can be assumed as something like “algorithmic blind spots”.
Also worth noting is how for q-gram, cosine and Jaccard the number of permutations with same hamming distance per cluster is the same. I don’t think this is obvious from the defintion of the metrics.
The R code producing the distances for “Cosmo Kramer”
library(stringdist) b <- c( "Cosmo Kramer", "Kosmo Kramer", "Comso Kramer", "Csmo Kramer", "Cosmo X. Kramer", "Kramer, Cosmo", "Jerry Seinfeld", " CKaemmoorrs", "Cosmer Kramo", "Kosmoo Karme", "George Costanza", "Elaine Benes", "Dr. Van Nostren", "remarK omsoC", "Mr. Kramer", "Sir Cosmo Kramer", "C.o.s.m.o. .K.r.a.m.e.r", "CsoKae", "Coso Kraer" ) a <- rep("Cosmo Kramer", length(b)) M <- data.frame( m = c("osa", "lv", "dl", "lcs", "qgram", "qgram", "qgram", "cosine", "cosine", "cosine", "jaccard", "jaccard", "jaccard", "jw", "jw", "jw"), q = c(0,0,0,0,1,2,3,1,2,3,1,2,3,0,0,0), p = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.2) ) R <- apply(M, 1, function(m) stringdist(a, b, method=m["m"], q=m["q"], p=m["p"])) R2 <- round(R,3) rownames(R2) <- paste(format(paste("'", a, "'", sep=""), width=14), " - ", format(paste("'", b, "'", sep=""), width=17), sep="" ) colnames(R2) <- M$m write.table(R2, "clipboard", sep="\t")
The R code for producing the permutations scatter plot
library(permute) library(stringdist) library(ggplot2) library(reshape2) M <- data.frame( m = c("hamming","osa","lv","dl","lcs","qgram","cosine","jaccard","qgram", "cosine","jaccard","qgram","cosine","jaccard","jw","jw","jw"), q = c(0,0,0,0,0,3,3,3,2,2,2,1,1,1,0,0,0), p = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.2)) permsRaw <- allPerms(6, max=10^7) perms <- apply(permsRaw,1,function(x)paste(c(x),collapse="")) perms <- c("123456",perms,"abcdef") R <- apply(M,1,function(m)stringdist("123456",perms,method=m["m"],q=m["q"],p=m["p"])) RColMax <- apply(R,2,function(x)max(x)) RColMax[1] <- 1 R0 <- t(t(R)/RColMax) dfR <- as.data.frame(R0) colnames(dfR) <- paste(M$m,"q",M$q,"p",M$p,sep="") dfR$perm <- perms dfR <- dfR[sample(nrow(dfR)),] dfR0 <- melt(dfR,id.vars=c('hammingq0p0','perm')) dfR0 <- dfR0[order(-dfR0$hammingq0p0),] ggplot(dfR0) + geom_jitter( data=dfR0, aes(x=variable, y=value,col=factor(hammingq0p0)), alpha=.7,size=1.7, position = position_jitter(height = .05) )+ scale_color_manual(values = c("#000000", "#ff0000", "#00ff00", "#6BAED6", "#3182BD", "#08519C"))
(original article published on www.joyofdata.de)