Quantcast
Channel: joy of data » text mining
Browsing all 5 articles
Browse latest View live

Image may be NSFW.
Clik here to view.

Comparison of String Distance Algorithms

For the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions...

View Article


Image may be NSFW.
Clik here to view.

Segmenting a Text Document using the Idea of a Cellular Automata

The German parliament publishes protocols for each of their sessions. A lot of data waiting to be processed. The protocols are published in the form of text files and PDFs. The published text files are...

View Article


Image may be NSFW.
Clik here to view.

The tf-idf-Statistic For Keyword Extraction

The tf-idf-statistic (“term frequency – inverse document frequency”) is a common tool for the purpose of extracting keywords from a document by not just considering a single document but all documents...

View Article

Image may be NSFW.
Clik here to view.

Using the Linux Shell for Web Scraping

Let’s assume we want to scrape the “Most Popular in News” box from bbc.com. What we need first is a CSS selector to locate what we are interested in. In this case it is simply a div tag with the ID...

View Article

Image may be NSFW.
Clik here to view.

A Guide on OCR with tesseract 3.03

Tesseract is tough … so tough indeed, even Chuck Norris would have to check the manual twice. Not kidding you. Okay, so this article aimes at structuring what I needed to learn about tesseract to...

View Article

Browsing all 5 articles
Browse latest View live