Using Tableau’s Command Line Extract tool, some hangups

Recently I have been exploring how to use Tableau’s command line tool to update published extracts on Tableau Server.  “Productinizing” more and more aspects of Tableau.  This is an easy way to trigger an extract to update after the job has completed.  The problem with schedules in Tableau is that by definition they are schedules, and sometimes your data just doesn’t come in on a predictable schedule.  You could trigger a refresh manually by using “Add Full Refresh” option under scheduled tasks and pressing “(Run Now).”  However if you learn how to use the tableau command line, then you could put it into a batch file and run it automatically from a server.  SQL server can execute a batch file as a job step, and it can also execute it from a stored procedure.

Here is Tableau’s online documentation for this feature:
http://onlinehelp.tableau.com/current/pro/online/mac/en-us/extracting_TDE.html

Additional features are listed here.  Apparently there is also way to print to a default printer!
http://community.tableau.com/servlet/JiveServlet/previewBody/5209-102-2-5623/Tableau%20Command-Line%20Parameters.pdf

Now there are some pitfalls that I have come across and wanted to share them so it could potentially decrease headaches.  Specifically I encountered this nice error:

TableauDataEngineError4

Tableau Data Engine Error: 4: Invalid Command: database: Path does not exist

Possible Problem #1) The data source must me made with the current version of the server.

So lets say you build a cool report and you schedule it.  It runs perfectly for dozens of weeks.  You download 8.2 or 9.0 and forget about this report.  Ages later you come around to needed to update that extract via the command line program.  You will encounter that nice message above.  Um, yeah so which path is that?  The server? The Site? The Project? or the datasource? After searching the internet, I found a few things that could have been the problem.  Finally I read something that said my extract was made with a past version of Tableau.  I knew that had to be the case because I am very eager to update my Tableau desktop versions as soon as new ones are released.

Solution:  Create the extract from scratch. 

Now if you are like me and use fewer SQL formulas and more Tableau formulas then the thought of duplicating all the groups / sets / calculations / parameters into a new dataset is boring and lame.  Fortunately the amazing people at Tableau made copying and pasting calculated things extremely easy.  All you need to do is open up the new extract and the old.  Right click the calculated field and choose “copy”, (DO NOT CTRL+C YOU SLACKERS).  Then go into the other dataset, right-click and press “paste”.  Boom!  Simple.  Hold down control, select several fields, menu-copy, menu-paste and Boom “That was easy”, to quote Staples.

Possible Problem #2)  Datasource is live.  It must be an extract.

This is obvious and should be an easy fix.  I almost exclusively use and publish extracts.  This is because most data I work with is fairly large.  If however if you are publishing many data sources, you may occasionally forget which ones are extracts and which ones are live connections.  This tripped me up once.  I was certain that it couldn’t be problem #1 because I had just made and published the dataset, but I got the same error.  This makes sense, since the extension for a Tableau Data Extract (.tde) is not the same again as a data source connection ().

Solution: Make it an extract :)

This is what success looks like:

TableauDataEngineSuccess

Native Hexbin support in Tableau 9

Yes! Yes! Yes!  I could type that many more times, but I’ll get to the point.  Tableau 9.0 has two new formulas that enable hexbin support!

HEXBINX(x,y)
HEXBINY(x,y)

Now the thing is that these formulas have a unit length of “1” so you will have to scale them yourselves.  So for instance this is what a basic map of the SE USA looks like w/o scaling.

TableauBeta9Hexbin

Although I can’t put latitudes on a Tableau Map, it generally appears that the length of 1 roughly corresponds to one latitude and longitude.  If you need Here is the formula to scale the value up x10 ,  do the hexbin magic, and scale back down.  First I created two different ZoomLvl parameters for you to use to adjust the number of hexbins.  One for X and one for Y.  This will give you the ability to change the # of hexbins in either direction.  This could be useful.

HEXBINY([LONGITUD]*[ZoomLvl_Y],[LATITUDE]*[ZoomLvl_Y])/[ZoomLvl_Y]
HEXBINX([LONGITUD]*[ZoomLvl_X],[LATITUDE]*[ZoomLvl_X])/[ZoomLvl_X]

Now you can see that there is a much higher hexagon resolution.  It’s a heatmap of Accidents.  I will be posting a completed workbook later.

SE_USA_heatmap

Of special note is this: if you are going to try to save steps and copy the field (say latitude) and replace the formula with the other one then rename it “longitude” then Tableau will keep the measure’s geographic role assigned to “latitude”.  Change this to longitude!

Geographic Role

Now here is an example of how HEXBINS can help with an ordinary scatterplot.  This data below was generated from these two functions.  This formula would work in Tableau, however since the random() function in Tableau has a fixed seed, it will not generate different random data for two columns. Stringing multiple randoms together makes a pseudo normal distribution.

x = iif(random()>.5,(random()+random()+random()+random()+random()+random()+random()+random()-3)/16  ,(random()+random()+random()+random()+random()+random()+random()+random()+3)/16 )
y = iif(random()>.5,(random()+random()+random()+random()+random()+random()+random()+random()-3)/16  ,(random()+random()+random()+random()+random()+random()+random()+random()+3)/16 )

You should get a cool pic like this:

4Cluster

Now even with small sizes, and 75% transparency, you will notice some of the points are obscured by the amount of data.  Hexbinning can solve this by bucketizing the data.  Here are the hexbin formulas I used:

HexbinX([F1]*[ZoomLvl_X],[F2]*[ZoomLvl_X])/[ZoomLvl_X]
HexbinY([F1]*[ZoomLvl_Y],[F2]*[ZoomLvl_Y])/[ZoomLvl_Y]

and placing the above formulas will give you a nice simple mapping like this:

4Cluster_Hexbin

Now you can clearly see the center.

Three things will help you in doing Hexbins in Tableau.  They are all custom shapes.  The first one is a hollow hexagon with a thicker border.  The second has a thin border.  And the third shows a filled hex.

Hexagon_M_Thicker Hexagon_M_Hollow Rot90 Hexagon_M_Filled Rot90

Update 3/5/2015:

I have decided to add in 6 hexagons so you can choose the level of fill that you need.

Hex6 Hex5 Hex4 Hex3 Hex2 Hex1

Loading data with KNIME, part 2

So in part 1 I grabbed data from an ftp site and saved it to my computer. Now in part 2 I will move it from one database to another database.

Doing so involves three parts.

  1. Unzipping the files
  2. Connecting KNIME to the .dbf file
  3. Loading that data into SQL Server

Step 1:

knime_step1

Node 70:    Location:  Wherever you saved the unzips from the first part.  Wildcard: F*.zip

Node 153:  This takes the last 4 digits of the file,  Needed to find out the year field. Append column “year”

reverse(substr(reverse(string($Location$)),4 ,4))

Node 154:  Switch Statement in Java to take into account the renaming of the files in the 90s.  Out column “result”

//String resultz;
 switch (c_year) {
     case "BF00": out_result = "2000"; break;
     case "BF94": out_result = "1994"; break;
     case "BF95": out_result = "1995"; break;
     case "BF96": out_result = "1996"; break;
     case "BF97": out_result = "1997"; break;
     case "BF98": out_result = "1998"; break;
     case "BF99": out_result = "1999"; break;
     default: out_result = c_year;
 }

Node 155:  Sort from most recent file (2013) to the first.  Use the “result” column from above and choose descending.  This is important for step 3.

Node 71:  Loop Start.  No changes

Node 74:  Include Location field only.

Node 75:  Creating the directory string. Append Column “NewPath”

join("C:\\Users\\kxe2732\\Desktop\\farsfiles\\",replaceChars(substr($${SLocation}$$,35 ),".zip","\\" ))

Node 79:  Creating the file string.  Append Column “NewDIR”  replaceChars(substr($${SLocation}$$,35 ),”.zip”,”b” )

Node 83:  Making the directory.  Outbut the “resultz” variable if you want to check (not critical).

String resultz;
 File file = new File(c_NewPath);
 if(!file.exists()) 
 {
     resultz = "new";
     new File(c_NewPath).mkdir();
     //new File("$C:\\Users\\kxe2732\\Desktop\\farsfiles\\FARS1975").mkdir();
 } else {
     resultz = "else";
 }
//File dir = new File("nameoffolder");
 //dir.mkdir();
 out_result = resultz;

Node 77:  No changes,

Unzip:  Show variable ports, pass Node 77’s red line to the input variable port of this node.   Change the input of source to use the “Location” variable.  Output directory* needs to be configured to where yo want to drop the unzipped data.  This location important for step #2.  It is important that you ALSO choose the “Overwrite” radio button in the “If a file exists…” section.

Step 2

This will require some work outside of KNIME before setting up the nodes.  MANUALLY unzip the most recent year’s file into the same Output directory* location referenced immediately above.  You should see a number of files ending with .dbf.  Open up your ODBC manager/administrator in Windows (I usually just type in ODBC).  Create a new ODBC connection.  Use a User DSN (or a system DSN if you want).  Press “Add” and scroll down the Microsoft dBase Driver (*.dbf).  There could be multiple versions and different ways to spell it so keep your eyes open for something similar.  For reference I am using version 6.01.7601.xxxxx.  Next you will see this screen:

ODBC_dbf

Now name it whatever you want, and choose version 5.0 and then select the same Output directory* listed previously.  If you followed the unzip instructions at the start of step 2 then you should see some .dbf files.  Now we can get back to KNIME and the nodes in step 2:

knime_step2

odbc:  This is the node that uses the odbc connection you just made.  Change the Database drive to “sun.jdbc.odbc.JdbcOdbcDrive” and change the Database URL to “jdbc:odbc:FARS2012″.  You probably want to check “Allow spaces in column names”.

accident (top database reader node): Connect the output variable port from Unzip to the input to this.  This serves no purpose other than to convince KNIME that this section is part of the loop.  Put “SELECT * FROM accident” into the SQL Statement.

vehicle (middle database reader node):  Same as above except change the SQL statement to “SELECT * FROM vehicle”

person (bottom database reader node):  Same as above except change the SQL statement to “SELECT * FROM person”

Node 141 & Node 148:  The vehicle and person databases do not contain the year.  If we are going to find any usable information we most certainly need to inject year into the data somehow.   You can just inject the variable ‘year’ calculated previously in step 2 or just copy the same formula: reverse(substr(reverse(string($${SLocation}$$)),4 ,4)).    Update:  Use “year” so you don’t populate the bad 90s name and later on have to do updates on your database.

Step 3

Unfortunately here is where the loop fails to be a good loop and requires MANUAL work.  Not every year will have the same fields to load.  This loop will fail eventually.  Thankfully KNIME doesn’t crash and will politely handle these instances.  It will pause the loop and wait for you to manually correct things.  This setup though limits the manual work and makes it extremely easy.  First of all make another odbc connection to the destination database.  Set up the nodes like so:

knime_step3

Node 145:  Table Name:  tbl_accident_A.  Under the “Append Data” heading check both “… to existing table (if any!)” and “insert NULL for missing columns”.  This is important as is the “A” in the table name.

Node 143: Table Name: tbl_vehicle_A

Node 149: Table Name:  tbl_person_A

Node 72:  Just connect the output of unzip from step 1 to here.  No changes.

Column Filter Nodes 152, 150, & 151This is where you will need to make a decision.  Which columns to keep and which to eliminate?  Do you choose less fields and enable the system to run more, or do you choose more fields and require more manual intervention?  If you choose 1 field that’s in all the files (quite a few, refer to the FARS documentation for details) then the loop will run correctly the whole time.  However if you want every field for every year then this will fail.  You will need to manually come to these nodes, remove the non-existent columns, then change the Nodes 145,143,149 and put the table names to “tbl_xxxx_B”.    Each time incrementing the letter for each failure.  Then in SQL do a simple insert from say E to D with null for the missing columns.  (Then tbl_xxxx_D to tbl_xxxx_C, then tbl_xxxx_C to tbl_xxxx_B then tbl_xxxx_B to tbl_xxxx_A).  This was easy, but a little tedious.

 

Note for checking your database.  It’s important to use “with (nolock) on your queries because otherwise your SQL Server will possible wait until KNIME is done with the table to run your ‘what-you-thought-would-be-quick-select-top-1o’ query. 

Grabbing tons of data via KNIME loop. US Traffic fatalities. Part 1

Andy Cotgreave did an amazing Tableau workbook about Seasonality in US road fatalities and it is located here:
https://public.tableausoftware.com/profile/digitalteam#!/vizhome/Seasonality-AndyCotgreave/SeasonalityinUSRoadFatalities

It is a cool wonderful vizualization, that’s obvious as he’s a Zen Master.  In addition, he went into detail about his design choices.  But how did he get all that data (or rather how can you get it)?  Well the US has several great government resources for various pieces of data.  The National Highway Traffic Safety Administration (NHTSA) has a reporting system called FARS which contains a wealth of data.  Now FARS which is short for exactly what it is: Fatality Analysis Reporting System.  I would like to personally call out this administration for having an extraordinarily clear name for something.  I mean even if you Googled something approximate to those words you would probably find it.

Here is an easy KNIME loop to download the data:
KNIME_ftp_loops

KNIME came out with some new features, several of these were related to grabbing files.  No longer do you have to program it in Java or Python.  KNIME now offers an easy way to do it.  What I like the most about KNIME is it is essentially self-documenting.  This data flow clearly shows me what is happening, plus allowing for some annotation.

The settings of the nodes used:

Node62:  The FTP Connection is simply “ftp.nhtsa.dot.gov” the port is the default FTP port of 21.  Make sure ‘none’ is on Authentication.  Then test your connection and BAZINGA:
KNIME_FTP_success

Node 63:  Change directory to /FARS/

Node 64:  This extracts the Uniform resource identifier (URI) information.  Just make sure “Path” is checked.

Node 65:  This String Manipulation module adds in string($Path$+”/DBF/”), this is just going to append a 2nd subdirectory, that thankfully is consistent throughout all files.

Node 67:  No changes.

Node 61:  Download.  Show the variable ports, and take the directory you constructed with the above nodes and feed it into the source file or folder.  Change your target folder to wherever you want the data to be downloaded.  THIS IS VERY IMPORTANT.  Do not download all files, as it is a waste to grab things you don’t need.  Make sure you check “Only download files that match” and then choose F*.zip and a wildcard.  This will limit the number of downloaded files.  Also, if you are only interested in a subset of the years then add a  2nd string manupulation, use “substr($Path$,6 , 4)” and a row filter node to remove unneeded folders from the list.

Node 68:  No changes.

“1 Row Test Case”  This node can replace node 67 if you don’t want a loop and just want to download 1 at a time.