Comparison of String Distance Algorithms
For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions...
View ArticleSegmenting a Text Document using the Idea of a Cellular Automata
The German parliament publishes protocols for each of their sessions. A lot of data waiting to be processed. The protocols are published in the form of text files and PDFs. The published text files are...
View ArticleThe tf-idf-Statistic For Keyword Extraction
The tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents...
View ArticleUsing the Linux Shell for Web Scraping
Let’s assume we want to scrape the “Most Popular in News” box from bbc.com. What we need first is a CSS selector to locate what we are interested in. In this case it is simply a div tag with the ID...
View ArticleA Guide on OCR with tesseract 3.03
Tesseract is tough … so tough indeed, even Chuck Norris would have to check the manual twice. Not kidding you. Okay, so this article aimes at structuring what I needed to learn about tesseract to...
View Article