Free-text fields can be a very informative data point, but more difficult to analyze. Typically what will happen is that someone will read the text and manually attempt to classify topic, find the overall sentiment, and then forward it on to the proper channels. Usually the first step people attempt with text mining is to build a word-cloud. This isn’t very insightful because it mostly consists of 1 word. The next level of analysis is to do an Ngram analysis, or sentiment analysis.
Another analytical method is called “Topic Discovery.” This method attempts to discover what topics are being talked about in the text and then assigns a probability across all topics. More details about the algorithm can be found on the internet, but KNIME uses Latent Dirichlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). I will not go into much detail on how it works, (there is a good “layman’s” explanation here http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/) but this method provides a way for users to automatically classify documents into topics (or ‘subjects’ or ‘categories’ etc.) This is a very useful procedure for business.
With this method in KNIME, it requires the user to specify two important parameters. The first one is the number of topics. This is arbitrary, and a little unsettling to business users to just pluck a number out of the air, (“Why did you choose 20 topics?” “Because 20 is my lucky number!”) but you should review the results and experiment with the data. There does exist a Bayesian way of determining an optimal number of topics (http://cpsievert.github.io/projects/615/xkcd/). Another aspect that will be difficult to apply in a business setting, is that this method is backwards-looking, and may not be good for new responses. Lets say you run the LDA model on past data; you screw up something new in the future and completely unseen topics start appearing. However LDA will apportion them to the topics it had seen in the past and will not generate the new topic.
MoreTopics => Smaller scope => Fewer Documents per Topic
and Fewer Topics => More Overlap => More Documents per Topic
The second parameter is choosing the number of keywords. You should vary this depending on the number of words in your document. Are these 1-sentence responses? or newspaper articles? or books? The size of the text in your document should drive the number of works. You will have to experiment with this. If you choose 10 keywords on a sentence that gets processed down to 5 words, then you will have an overlap of topics.
There are two other parameters, an alpha and a beta. Use the defaults unless you know what you’re doing. More info is available here (http://stats.stackexchange.com/questions/37405/natural-interpretation-for-lda-hyperparameters ). To quote a user linked above:
A low alpha value puts less such constraints on documents and means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. Likewise, a high beta-value means that each topic is likely to contain a mixture of most of the words, and not any word specifically, while a low value means that a topic may contain a mixture of just a few of the words.
Here is how to build a simple topic model using KNIME. A dump of the original data can be found here.
- Assuming you already have KNIME, the first step is to add in their Text Mining module. Go to File and then choose “Install KNIME Extensions.” Then choose KNIME Labs and then Text Processing.
- Connect to your data. Just do a simple file reader for a text file. At minimum all you need is a text field, but an “Author” and “Title” are also helpful.
- Convert the text into a document. Depending on your data, you will want to add “Title” and “Full Text” (of course) into the document. (In KNIME you can use the RowID module to create a new Row ID column and utilize that.
- Pre-Processed your text document. There are 5 main processing modules.
- POS tagger. Assigns a Part-Of-Speech to a word.
- Case Converter. This is very important, makes sure Cats/CATS/cats are all “CATS”.
- N Chars Filter. Filters out small words. Set to 1 to make it optional. Using 2 will filter out “no” which is a very important word.
- Stop word Filter. Very useful. KNIME automatically comes with a stop-word dictionary with several languages. This filters out common words. Dictionary is in \KNIME_x.x.x\plugins\org.knime.ext.textprocessing_x.x.x.x\resources\stopwordlists\ in case you want to add your own words. It may be useful to duplicate this dictionary and add specific words commonly occurring during your data mining process. For this example, the word “Settled” appeared most of the time.
- Dictionary Replacer. If the collection of documents have a bunch of similar keywords.
- Punctuation Erasure. Obvious, but the system will treat “. ? !” as individual words and possibly grab them as keywords. In each module it is important that you select “Append unchanged documents” so you can view the original text. To learn how each module works. Move the two Document Viewers around and see. It will replace “model/actor” with “modelactor” which the later modules see as 1 word instead of 2. If you want “model actor” instead, then do a string manipulation module at the beginning and replace the “/” with a ” “. To me, it is more useful to keep “model/actor” as one word because it signifies a different event than either of the two previous words
- Export the data back out. Export out both the Topics and the Assignment of topics via CSV.
If you’re interested automating this, or performing the analysis on a much large dataset, then you can use R. R has an implementation of the same method using the MALLET package. Many of the text mining pre-processing methods can be found in the tm package.