Text mining: Using KNIME for Topic Discovery.

To jump straight to the finished Tableau Dashboard, click here.

Free-text fields can be a very informative data point, but more difficult to analyze.  Typically what will happen is that someone will read the text and manually attempt to classify topic, find the overall sentiment, and then forward it on to the proper channels.  Usually the first step people attempt with text mining is to build a word-cloud.  This isn’t very insightful because it mostly consists of 1 word.  The next level of analysis is to do an Ngram analysis, or sentiment analysis.

Another analytical method is called “Topic Discovery.”  This method attempts to discover what topics are being talked about in the text and then assigns a probability across all topics.  More details about the algorithm can be found on the internet, but KNIME uses Latent Dirichlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation).  I will not go into much detail on how it works, (there is a good “layman’s” explanation here http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/) but this method provides a way for users to automatically classify documents into topics (or ‘subjects’ or ‘categories’ etc.)  This is a very useful procedure for business.

With this method in KNIME, it requires the user to specify two important parameters.  The first one is the number of topics.  This is arbitrary, and a little unsettling to business users to just pluck a number out of the air, (“Why did you choose 20 topics?” “Because 20 is my lucky number!”) but you should review the results and experiment with the data.  There does exist a Bayesian way of determining an optimal number of topics (http://cpsievert.github.io/projects/615/xkcd/).  Another aspect that will be difficult to apply in a business setting, is that this method is backwards-looking, and may not be good for new responses.  Lets say you run the LDA model on past data; you screw up something new in the future and completely unseen topics start appearing.  However LDA will apportion them to the topics it had seen in the past and will not generate the new topic.

In general:

MoreTopics => Smaller scope => Fewer Documents per Topic

and Fewer Topics => More Overlap => More Documents per Topic

The second parameter is choosing the number of keywords. You should vary this depending on the number of words in your document. Are these 1-sentence responses? or newspaper articles? or books? The size of the text in your document should drive the number of works. You will have to experiment with this. If you choose 10 keywords on a sentence that gets processed down to 5 words, then you will have an overlap of topics.

There are two other parameters, an alpha and a beta. Use the defaults unless you know what you’re doing. More info is available here (http://stats.stackexchange.com/questions/37405/natural-interpretation-for-lda-hyperparameters ). To quote a user linked above:

A low alpha value puts less such constraints on documents and means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. Likewise, a high beta-value means that each topic is likely to contain a mixture of most of the words, and not any word specifically, while a low value means that a topic may contain a mixture of just a few of the words.

Here is how to build a simple topic model using KNIME. A dump of the original data can be found here.

  1. Assuming you already have KNIME, the first step is to add in their Text Mining module. Go to File and then choose “Install KNIME Extensions.”  Then choose KNIME Labs and then Text Processing.
  2. Connect to your data. Just do a simple file reader for a text file. At minimum all you need is a text field, but an “Author” and “Title” are also helpful.KNIME_Fileload
  3. Convert the text into a document.  Depending on your data, you will want to add “Title” and “Full Text” (of course) into the document. (In KNIME you can use the RowID module to create a new Row ID column and utilize that.
  4. Pre-Processed your text document. There are 5 main processing modules.
    1. POS tagger. Assigns a Part-Of-Speech to a word.
    2. Case Converter. This is very important, makes sure Cats/CATS/cats are all “CATS”.
    3. N Chars Filter. Filters out small words. Set to 1 to make it optional.  Using 2 will filter out “no” which is a very important word.
    4. Stop word Filter. Very useful. KNIME automatically comes with a stop-word dictionary with several languages. This filters out common words.  Dictionary is in \KNIME_x.x.x\plugins\org.knime.ext.textprocessing_x.x.x.x\resources\stopwordlists\ in case you want to add your own words.  It may be useful to duplicate this dictionary and add specific words commonly occurring during your data mining process.  For this example, the word “Settled” appeared most of the time.
    5. Dictionary Replacer. If the collection of documents have a bunch of similar keywords.
    6. Punctuation Erasure.  Obvious, but the system will treat “. ? !” as  individual words and possibly grab them as keywords.  In each module it is important that you select “Append unchanged documents” so you can view the original text. To learn how each module works. Move the two Document Viewers around and see.  It will replace “model/actor” with “modelactor” which the later modules see as 1 word instead of 2.  If you want “model actor” instead, then do a string manipulation module at the beginning and replace the “/” with a ” “.  To me, it is more useful to keep “model/actor” as one word because it signifies a different event than either of the two previous words


  5. Export the data back out.  Export out both the Topics and the Assignment of topics via CSV.

Below is a full view of what the completed workflow looks like:

To see the full Tableau Dashboard created using this dataset look here:

Additional info:

If you’re interested automating this, or performing the analysis on a much large dataset, then you can use R. R has an implementation of the same method using the MALLET package. Many of the text mining pre-processing methods can be found in the tm package.

Zebra Charts

Zebra charts are way to compare two dates for a single data point and graph the differences between these dates. For various reasons a person, or event, or transaction could begin at X time and have another event at Y (which doesn’t necessarily come after, it could be before). I have been using these charts for the last month at work and they have been very informative to diagram events. Of course these are just a type of Bar or Area chart, but the ‘Zebra’ work comes from the striping.
The inspiration for these charts comes from Cox Communications and their demonstration for Tableau. They looked at the difference between deal closing date and the installation date:  https://youtu.be/J7zMXBl0hJw?t=9m28s
In this image, the day the sale was booked is a color, and the bottom axis is install date:


We can see that sales for the month of August (right hand side) are influenced by many previous weeks decisions.

We can take a metric for this week, and visualize how events from past weeks have influenced this behavior. For example, suppose that the sales this week are $100. Then this zebra chart van show that $60 was collected from orders placed last week, $30 for two weeks prior, $10 3 weeks prior

Possible Business reasons:

  • This could be the difference between when the sale was booked and when the money was actually collected.
  • When a customer ordered vs when the order was canceled.
  • When a service actually occurred vs when it was scheduled to occur.
  • When a customer arrived vs when they said they were going to arrive (which could be before or after).

Full Video of Cox’s presentation: https://www.youtube.com/watch?v=J7zMXBl0hJw

Linked here is a full workbook showing the example I made.

Here’s a finished example we used here at work. Valentine’s day is highlighted. You can see that 77% of event #2 fell on the same day as event 1. There is a long tail going after, and a very small number of event #2 actually came before the day.


I suppose one could also call these type of charts “Tiger Charts” if you’re looking for a ferocious name.

Some Additional Color Palettes for Tableau

Last year, while attending one of the Tableau User groups,  I learned how to create custom color palettes.  I have accumulated quite a few over the last year and I thought I’d like to pass them on.   Several of them are useful for heatmaps.  This first set is called Spectral Color and this is based on the colors of the stars, which in turn is based on their temperature.  I acquired this from an ATUG meeting and I have found it useful for heatmaps.


<color-palette name="Spectral 3 Color Div Palette" type="ordered-diverging"> <color>#BC80BD</color> <color>#CCEBC5</color> <color>#FC8D59</color> <color>#FFED6F</color> <color>#FFFFBF</color> <color>#99D594</color> </color-palette>
<color-palette name="Spectral 4 Color Div Palette" type="ordered-diverging"> <color>#D7191C</color> <color>#FDAE61</color> <color>#ABDDA4</color> <color>#2B83BA</color> </color-palette>
<color-palette name="Spectral 5 Color Div Palette" type="ordered-diverging"> <color>#D7191C</color> <color>#FDAE61</color> <color>#FFFFBF</color> <color>#ABDDA4</color> <color>#2B83BA</color> </color-palette>
<color-palette name="Spectral 6 Color Div Palette" type="ordered-diverging"> <color>#D53E4F</color> <color>#FC8D59</color> <color>#FEE08B</color> <color>#E6F598</color> <color>#99D594</color> <color>#3288BD</color> </color-palette>
<color-palette name="Spectral 7 Color Div Palette" type="ordered-diverging"> <color>#D53E4F</color> <color>#FC8D59</color> <color>#FEE08B</color> <color>#FFFFBF</color> <color>#E6F598</color> <color>#99D594</color> <color>#3288BD</color></color-palette>
<color-palette name="Spectral 8 Color Div Palette" type="ordered-diverging"> <color>#D53E4F</color> <color>#F46D43</color> <color>#FDAE61</color> <color>#FEE08B</color> <color>#E6F598</color> <color>#ABDDA4</color> <color>#66C2A5</color> <color>#3288BD</color> </color-palette>
<color-palette name="Spectral 9 Color Div Palette" type="ordered-diverging"> <color>#D53E4F</color> <color>#F46D43</color> <color>#FDAE61</color> <color>#FEE08B</color> <color>#FFFFBF</color> <color>#E6F598</color> <color>#ABDDA4</color> <color>#66C2A5</color> <color>#3288BD</color> </color-palette>
<color-palette name="Spectral 10 Color Div Palette" type="ordered-diverging"> <color>#9E0142</color> <color>#D53E4F</color> <color>#F46D43</color> <color>#FDAE61</color> <color>#FEE08B</color> <color>#E6F598</color> <color>#ABDDA4</color> <color>#66C2A5</color> <color>#3288BD</color></color-palette>
<color-palette name="Spectral 11 Color Div Palette" type="ordered-diverging"> <color>#5E4FA2</color> <color>#9E0142</color> <color>#D53E4F</color> <color>#F46D43</color> <color>#FDAE61</color> <color>#FEE08B</color> <color>#FFFFBF</color> <color>#E6F598</color> <color>#ABDDA4</color> <color>#66C2A5</color></color-palette>

Kris Palette.  I have built these myself with inspiration from the iPhone app called Fractal Plus.  These palettes have an excessive number of colors (you really only need about 7 and Tableau can do the rest), but this allows for others the customize the amount of drift between the major colors.  So if these are black-heavy, just remove some of them to push the middle left, or remove some of the reds to push them right.

<color-palette name="Kpal Burnt Blk-Org-Red div" type="ordered-diverging"> <color>#242422</color> <color>#272624</color> <color>#322e25</color> <color>#3b3529</color> <color>#453c2b</color> <color>#4e412e</color> <color>#584831</color> <color>#604f35</color> <color>#695638</color> <color>#725b3a</color> <color>#84673f</color> <color>#9e7a49</color> <color>#a47f4b</color> <color>#b28952</color> <color>#c29556</color> <color>#d19e5c</color> <color>#e3ab60</color> <color>#e4ac5f</color> <color>#e6a95c</color> <color>#e6ab5e</color> <color>#e7a351</color> <color>#ec9b40</color> <color>#e74e24</color> <color>#e34b2a</color> <color>#d03a2a</color> <color>#de3224</color></color-palette>


<color-palette name="Kpal Burnt Blk-Org-Red-Gry div" type="ordered-diverging"> <color>#242422</color> <color>#453c2b</color> <color>#584831</color> <color>#695638</color> <color>#84673f</color> <color>#a47f4b</color> <color>#c29556</color> <color>#e3ab60</color> <color>#e6a95c</color> <color>#e7a351</color> <color>#e74e24</color> <color>#d03a2a</color> <color>#de3224</color> <color>#B45850</color> <color>#A3918F</color> <color>#B6A7A5</color></color-palette>

City Night

<color-palette name="Kpal City-Night cycl" type="ordered-diverging"> <color>#101825</color> <color>#222d43</color> <color>#333d60</color> <color>#465376</color> <color>#55658a</color> <color>#6d7693</color> <color>#908d86</color> <color>#b4a48c</color> <color>#d9bb88</color> <color>#f8cd76</color> <color>#f4b872</color> <color>#ee9a6b</color> <color>#a95e59</color> <color>#7c4f56</color> <color>#463d58</color> <color>#313250</color> <color>#202948</color> <color>#101d30</color> <color>#0b131e</color></color-palette>


<color-palette name="Kpal City-Night div" type="ordered-diverging"> <color>#101825</color> <color>#222d43</color> <color>#333d60</color> <color>#465376</color> <color>#55658a</color> <color>#6d7693</color> <color>#908d86</color> <color>#b4a48c</color> <color>#d9bb88</color> <color>#f8cd76</color> <color>#f4b872</color> <color>#ee9a6b</color> <color>#a95e59</color>


</color-palette> <color-palette name="Kpal City-Night Orng div" type="ordered-diverging"> <color>#101825</color> <color>#222d43</color> <color>#333d60</color> <color>#465376</color> <color>#55658a</color> <color>#6d7693</color> <color>#908d86</color> <color>#b4a48c</color> <color>#d9bb88</color> <color>#f8cd76</color> <color>#f4b872</color> <color>#ee9a6b</color> <color>#FF985C</color></color-palette>


<color-palette name="Kpal Fire cycl" type="ordered-diverging"> <color>#f9f9f9</color> <color>#e0e7fa</color> <color>#b4bad0</color> <color>#758295</color> <color>#323e49</color> <color>#2f3841</color> <color>#15160b</color> <color>#1a1100</color> <color>#3f0e00</color> <color>#7a0901</color> <color>#9a0601</color> <color>#ba0400</color> <color>#da0100</color> <color>#f80600</color> <color>#fa1c06</color> <color>#fc5212</color> <color>#f8e24e</color> <color>#f7fc66</color> <color>#f7fd81</color> <color>#ebe38e</color> <color>#debe79</color> <color>#c79058</color> <color>#b04f2f</color> <color>#830303</color> <color>#580902</color> <color>#320e02</color> <color>#1e1005</color>


</color-palette> <color-palette name="Kpal Fire Red div" type="ordered-diverging"> <color>#f80600</color> <color>#fa1c06</color> <color>#fc5212</color> <color>#f8e24e</color> <color>#f7fc66</color> <color>#f7fd81</color> <color>#ebe38e</color> <color>#debe79</color> <color>#b04f2f</color> <color>#580902</color> <color>#1e1005</color></color-palette>


<color-palette name="Kpal Fire Blue div" type="ordered-diverging"> <color>#41dfff</color> <color>#68b0c8</color> <color>#8291a3</color> <color>#8f8291</color> <color>#c34448</color> <color>#f80600</color> <color>#fa1c06</color> <color>#fc5212</color> <color>#f8e24e</color> <color>#f7fc66</color> <color>#f7fd81</color> <color>#ebe38e</color> <color>#debe79</color> <color>#b04f2f</color> <color>#580902</color> <color>#1e1005</color>


</color-palette> <color-palette name="Kpal Fire Blu-Wht div" type="ordered-diverging"> <color>#f1fcff</color> <color>#c8f5ff</color> <color>#92ecff</color> <color>#69e5ff</color> <color>#41dfff</color> <color>#68b0c8</color> <color>#8291a3</color> <color>#8f8291</color> <color>#c34448</color> <color>#f80600</color> <color>#fa1c06</color> <color>#fc5212</color> <color>#f8e24e</color> <color>#f7fc66</color> <color>#f7fd81</color> <color>#ebe38e</color> <color>#debe79</color> <color>#b04f2f</color> <color>#580902</color> <color>#1e1005</color></color-palette>

Starry Night

<color-palette name="Kpal Starry Night cycl" type="ordered-diverging"> <color>#fcfdf0</color> <color>#fbf4a3</color> <color>#fcbe27</color> <color>#8a401a</color> <color>#4a0927</color> <color>#27063f</color> <color>#0e0153</color> <color>#01025f</color> <color>#082487</color> <color>#0f3598</color> <color>#2a65c1</color> <color>#5282cc</color> <color>#99b4e4</color> <color>#e0e7f9</color></color-palette>


<color-palette name="Kpal Starry Night div" type="ordered-diverging"> <color>#fcfdf0</color> <color>#fbf4a3</color> <color>#fcbe27</color> <color>#8a401a</color> <color>#4a0927</color> <color>#27063f</color> <color>#0e0153</color> <color>#01025f</color> </color-palette>

Using Tableau’s Command Line Extract tool, some hangups

Recently I have been exploring how to use Tableau’s command line tool to update published extracts on Tableau Server.  “Productinizing” more and more aspects of Tableau.  This is an easy way to trigger an extract to update after the job has completed.  The problem with schedules in Tableau is that by definition they are schedules, and sometimes your data just doesn’t come in on a predictable schedule.  You could trigger a refresh manually by using “Add Full Refresh” option under scheduled tasks and pressing “(Run Now).”  However if you learn how to use the tableau command line, then you could put it into a batch file and run it automatically from a server.  SQL server can execute a batch file as a job step, and it can also execute it from a stored procedure.

Here is Tableau’s online documentation for this feature:

Additional features are listed here.  Apparently there is also way to print to a default printer!

Now there are some pitfalls that I have come across and wanted to share them so it could potentially decrease headaches.  Specifically I encountered this nice error:


Tableau Data Engine Error: 4: Invalid Command: database: Path does not exist

Possible Problem #1) The data source must me made with the current version of the server.

So lets say you build a cool report and you schedule it.  It runs perfectly for dozens of weeks.  You download 8.2 or 9.0 and forget about this report.  Ages later you come around to needed to update that extract via the command line program.  You will encounter that nice message above.  Um, yeah so which path is that?  The server? The Site? The Project? or the datasource? After searching the internet, I found a few things that could have been the problem.  Finally I read something that said my extract was made with a past version of Tableau.  I knew that had to be the case because I am very eager to update my Tableau desktop versions as soon as new ones are released.

Solution:  Create the extract from scratch. 

Now if you are like me and use fewer SQL formulas and more Tableau formulas then the thought of duplicating all the groups / sets / calculations / parameters into a new dataset is boring and lame.  Fortunately the amazing people at Tableau made copying and pasting calculated things extremely easy.  All you need to do is open up the new extract and the old.  Right click the calculated field and choose “copy”, (DO NOT CTRL+C YOU SLACKERS).  Then go into the other dataset, right-click and press “paste”.  Boom!  Simple.  Hold down control, select several fields, menu-copy, menu-paste and Boom “That was easy”, to quote Staples.

Possible Problem #2)  Datasource is live.  It must be an extract.

This is obvious and should be an easy fix.  I almost exclusively use and publish extracts.  This is because most data I work with is fairly large.  If however if you are publishing many data sources, you may occasionally forget which ones are extracts and which ones are live connections.  This tripped me up once.  I was certain that it couldn’t be problem #1 because I had just made and published the dataset, but I got the same error.  This makes sense, since the extension for a Tableau Data Extract (.tde) is not the same again as a data source connection ().

Solution: Make it an extract :)

This is what success looks like: