Convert Shapefiles to TDEs

I didn’t make this tool, but I have found it useful enough on more than one occasion and wanted to mention it.  It is a Alteryx Gallery app that converts Shapefiles to Tableau TDEs.  Craig Bloodworth is the source of this app:

Currently the US Government releases shape files of every subdivision in the country:  In  order to use these easily in Tableau they have to be transformed into a TDE or CSV. This link pointed me to a Alteryx Gallery App located here (sign-in required but it is displayed below).  Now I was only able to get it to work using 2010 shapefiles but not more current shapefiles.

Also this app contains the ability to generalize the polygon.  Checking this box (Generalize Polygon (Advaced)) will will reduce the number of points and help Tableau with performance.  Even with 0.0 mi chosen as the threshold, please note that Alteryx manages to simplify the polygon.  I have here a visual example of what the app does.

For his example I’m using traffic corridors in Georgia.




Data is the Clay of the Modern World.

So I’ve packed up the family and we’re moving to a distant shore of this continent. Naturally, packing for a large move leads you to re-evaluate some of your ‘stuff’.  Unearthed from my early college era were these fine pieces of pottery.

Two plates, both wheel thrown and glazed in western style raku.



And a flowerpot, also constructed in the same fashion.


Pottery is similar to data visualization in that it has unlimited functional applications while simultaneously containing the ability for unlimited artistic expression.  I remember a few times I would go on a Friday night, throw pottery and have a fun time.  It was good to see the work of your hands.  When I use certain data tools, I feel like I am making pottery.  Sure the techniques are different, and there are always tricks and hacks to learn, but they feel the same.  Clay provides instant visual feedback and guides the potter into making additional changes.  Nearly anything that can be imagined can be made, with some constraint by gravity and how wet the clay is.  Data visualization is the same way, nearly anything can be constructed.

First off, you start with a giant block of ugly clay that no one wants to look at.  It’s functionless and useless until you shape it into something out want.   Data sitting in a database is useless until someone or something looks at it.   You find the clay and bring it into your studio; you bring data into your database.  You typically have a goal.  In pottery it is to make a plate, or vase, or cup; in business you want to find out how product X is selling ,or who buys Y.  You typically have an idea for an object: a plate, or vase, or cup.   Then you throw the clay (or carve it for a sculpture) which is similar to bringing data into Excel, or shaping it in Tableau.   Finally you add in the finishing touches, glaze the pot, fire it in a kiln and hope it is received as well as you thought.  You email your report or put your ceramic piece on a pedestal and each time hope someone will look at it and say ‘Wow’.

Data is the Clay of the modern world. It’s impact is obvious.  It is flexible, useful, expressive, and abundant.  Just as past societies left a large number of clay fragments in their wake, future archeologist will look at our data fragments and attempt to piece our world together.



Text mining: Using KNIME for Topic Discovery.

To jump straight to the finished Tableau Dashboard, click here.

Free-text fields can be a very informative data point, but more difficult to analyze.  Typically what will happen is that someone will read the text and manually attempt to classify topic, find the overall sentiment, and then forward it on to the proper channels.  Usually the first step people attempt with text mining is to build a word-cloud.  This isn’t very insightful because it mostly consists of 1 word.  The next level of analysis is to do an Ngram analysis, or sentiment analysis.

Another analytical method is called “Topic Discovery.”  This method attempts to discover what topics are being talked about in the text and then assigns a probability across all topics.  More details about the algorithm can be found on the internet, but KNIME uses Latent Dirichlet Allocation (  I will not go into much detail on how it works, (there is a good “layman’s” explanation here but this method provides a way for users to automatically classify documents into topics (or ‘subjects’ or ‘categories’ etc.)  This is a very useful procedure for business.

With this method in KNIME, it requires the user to specify two important parameters.  The first one is the number of topics.  This is arbitrary, and a little unsettling to business users to just pluck a number out of the air, (“Why did you choose 20 topics?” “Because 20 is my lucky number!”) but you should review the results and experiment with the data.  There does exist a Bayesian way of determining an optimal number of topics (  Another aspect that will be difficult to apply in a business setting, is that this method is backwards-looking, and may not be good for new responses.  Lets say you run the LDA model on past data; you screw up something new in the future and completely unseen topics start appearing.  However LDA will apportion them to the topics it had seen in the past and will not generate the new topic.

In general:

MoreTopics => Smaller scope => Fewer Documents per Topic

and Fewer Topics => More Overlap => More Documents per Topic

The second parameter is choosing the number of keywords. You should vary this depending on the number of words in your document. Are these 1-sentence responses? or newspaper articles? or books? The size of the text in your document should drive the number of works. You will have to experiment with this. If you choose 10 keywords on a sentence that gets processed down to 5 words, then you will have an overlap of topics.

There are two other parameters, an alpha and a beta. Use the defaults unless you know what you’re doing. More info is available here ( ). To quote a user linked above:

A low alpha value puts less such constraints on documents and means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. Likewise, a high beta-value means that each topic is likely to contain a mixture of most of the words, and not any word specifically, while a low value means that a topic may contain a mixture of just a few of the words.

Here is how to build a simple topic model using KNIME. A dump of the original data can be found here.

  1. Assuming you already have KNIME, the first step is to add in their Text Mining module. Go to File and then choose “Install KNIME Extensions.”  Then choose KNIME Labs and then Text Processing.
  2. Connect to your data. Just do a simple file reader for a text file. At minimum all you need is a text field, but an “Author” and “Title” are also helpful.KNIME_Fileload
  3. Convert the text into a document.  Depending on your data, you will want to add “Title” and “Full Text” (of course) into the document. (In KNIME you can use the RowID module to create a new Row ID column and utilize that.
  4. Pre-Processed your text document. There are 5 main processing modules.
    1. POS tagger. Assigns a Part-Of-Speech to a word.
    2. Case Converter. This is very important, makes sure Cats/CATS/cats are all “CATS”.
    3. N Chars Filter. Filters out small words. Set to 1 to make it optional.  Using 2 will filter out “no” which is a very important word.
    4. Stop word Filter. Very useful. KNIME automatically comes with a stop-word dictionary with several languages. This filters out common words.  Dictionary is in \KNIME_x.x.x\plugins\org.knime.ext.textprocessing_x.x.x.x\resources\stopwordlists\ in case you want to add your own words.  It may be useful to duplicate this dictionary and add specific words commonly occurring during your data mining process.  For this example, the word “Settled” appeared most of the time.
    5. Dictionary Replacer. If the collection of documents have a bunch of similar keywords.
    6. Punctuation Erasure.  Obvious, but the system will treat “. ? !” as  individual words and possibly grab them as keywords.  In each module it is important that you select “Append unchanged documents” so you can view the original text. To learn how each module works. Move the two Document Viewers around and see.  It will replace “model/actor” with “modelactor” which the later modules see as 1 word instead of 2.  If you want “model actor” instead, then do a string manipulation module at the beginning and replace the “/” with a ” “.  To me, it is more useful to keep “model/actor” as one word because it signifies a different event than either of the two previous words


  5. Export the data back out.  Export out both the Topics and the Assignment of topics via CSV.

Below is a full view of what the completed workflow looks like:

To see the full Tableau Dashboard created using this dataset look here:

Additional info:

If you’re interested automating this, or performing the analysis on a much large dataset, then you can use R. R has an implementation of the same method using the MALLET package. Many of the text mining pre-processing methods can be found in the tm package.

Zebra Charts

Zebra charts are way to compare two dates for a single data point and graph the differences between these dates. For various reasons a person, or event, or transaction could begin at X time and have another event at Y (which doesn’t necessarily come after, it could be before). I have been using these charts for the last month at work and they have been very informative to diagram events. Of course these are just a type of Bar or Area chart, but the ‘Zebra’ work comes from the striping.
The inspiration for these charts comes from Cox Communications and their demonstration for Tableau. They looked at the difference between deal closing date and the installation date:
In this image, the day the sale was booked is a color, and the bottom axis is install date:


We can see that sales for the month of August (right hand side) are influenced by many previous weeks decisions.

We can take a metric for this week, and visualize how events from past weeks have influenced this behavior. For example, suppose that the sales this week are $100. Then this zebra chart van show that $60 was collected from orders placed last week, $30 for two weeks prior, $10 3 weeks prior

Possible Business reasons:

  • This could be the difference between when the sale was booked and when the money was actually collected.
  • When a customer ordered vs when the order was canceled.
  • When a service actually occurred vs when it was scheduled to occur.
  • When a customer arrived vs when they said they were going to arrive (which could be before or after).

Full Video of Cox’s presentation:

Linked here is a full workbook showing the example I made.

Here’s a finished example we used here at work. Valentine’s day is highlighted. You can see that 77% of event #2 fell on the same day as event 1. There is a long tail going after, and a very small number of event #2 actually came before the day.


I suppose one could also call these type of charts “Tiger Charts” if you’re looking for a ferocious name.