Data Engineering example. Java + SSIS to gather Macroeconomic data from the FED.

The Federal Reserve has a great service for data called FRED (Federal Reserve Economic Data), which is maintained by the St. Louis FED.  It is one of the best sources of economic data about the United States.  One of the datapoints they provide, (other than the national overview) is more detailed slices of data by states and metro areas. The ‘FRED’ also provides any easy was to download the data with .txt link. So if I wanted to see the seasonally adjusted unemployment data in Alaska, then I could just click on this link:  http://research.stlouisfed.org/fred2/data/AKUR.txt

Now if I wanted to gather that information for all 50-ish states (DC and PR may be there) then I ‘could’ click every link and then download every file. That would be a time-consuming endeavor, especially for data sets that may come out monthly.

Since the FRED is very consistent, and we know the data and exist in a simple .txt extension, all we have to do it gather it using code and then load it into the database.  Just by looking at the file, I could see it was <STATE CODE><METRIC>.txt.   Thankfully that was mercifully easy.  I then wrote a data loading program in Java using Eclipse. I ran it twice, once for seasonally adjusted unemployment (SA) data and once for not seasonally adjusted (NSA) data.  Java Code: savefiles.java
This data is saved in a default location which is based on your Eclipse installation, and for me it was saved here: C:\Users\<username>\workspace\savefiles\

Once you execute that java code, then you have to go through them and load them all.  You could manually open up every file and load them, but that also would take a long time.  Some open source methods include Talend and KNIME, BOTH of which have java modules.  If you want to productionize this specific example, you will want to explore those first.  For this first attempt though, I used SQL Server Integration Services (SSIS) to easily load the files since I have it and I’m very familiar with it.   Here are the basic parts of the SSIS package:

  • The File-name variable: We need to know what state and what attribute we’re trying to save. Create one called ‘filename’ and default it to the first file in the directory.
    blog2_Pic1
  • The Container: Add a ForEach Loop Container and add a DFT into it.
    blog2_Pic2
  • Container properties: Click on Collection, ensure the ‘Enumerator’ is set to ‘Foreach File Enumerator’; change the folder to wherever you saved the data previous, change ‘Files’ to *.* (ensure no other files are there)
    blog2_Pic3
  • Click on Variable Mappings and choose the filename system variable you made earlier.
    blog2_Pic4
  • Next create your source and destination connections Your source will be a flat-file connection, and your destination will be your database.
  • Flat File Connection: In this instance, make sure you skip the first 11 rows for the FED data:
    blog2_Pic5
  • Next click on Preview and verify that everything looks okay:
    blog2_Pic6
  • Next go into the The Data Flow Task (DFT). Add in ‘Flat-File Source,’ add in ‘Derived Column’, and OLE DB Destination. Connect the modules like this:
    blog2_Pic7
  • Click on the ‘Derived Column’ module (called Get File Name in the image above) and add in a derived column called ‘File’ and configure it like this:
    blog2_Pic8

Choose a destination table and you will be able to load a bunch of FED data easily. Modify the file names and locations and you can then download & load a variety of state-level data sources fairly easily.

Why did I use Java and SSIS to do all of this? Well I had pulled files from the internet using Java in the past… and I had also used SSIS to load multiple files from a directory. So I just mashed together two easy things I had done before and it didn’t take much time. I knew Java had ways to interface with the internet, and I knew SSIS could loop and connect easily to a DB. Both of these are obvious. Unfortunately SSIS is not open and requires someone to have a SQL Server so this method is pretty restrictive for the part-time data engineer out there. Regardless, I was able to quickly capture 102 text files and load them into a database and build this visualization comparing the Seasonal vs Non-seasonal unemployment rates:

Final:
https://public.tableausoftware.com/views/Unemployment_33/AdjstmentDashboard?:embed=y&:display_count=no

 

Rules and Probabilities for Double Monopoly

Tableau workbook with complete probabilities based on 4 board orientations and 2 8-sided dice with the doubles rule: bit.ly/1qEXfnK

So double monopoly has 2 boards.  It’s twice the fun! Boards a set corner-to-corner.  People move in a figure 8.  Boards can be joined at any corner but to keep consistent, lets try Go to Go (GO2GO), Jail to Jail (J2J), Free Parking to Free Parking (FP2FP), or Goto Jail to Goto Jail (GJ2GJ).  As they will cross that space twice, nothing is diminished from the space’s probability.  If you were to use different corners, one corner will be hit only once.  Click on this:
4_DoubleMonop_boards_Small2

Tableau workbook:

 

A complete Monopoly can consist of the correctly corresponding properties from either reality.  i.e. Park Place and Imperial Palace are a match.  Yodas hut and Farmer Maggots is a match.  If you have Boardwalk and Imperial Palace that’s not a match.  Nor is Hey,Jude; Abbey Road and Ganondorf.  etc.etc. Owning 2 complete monopolies of the same color doubles the value of the rent.  Any unimproved properties quadruples the unimproved price.  Having hotels on Boardwalk and Park-Place, but nothing on Mt Doom and Barad Dur still doubles the rent on BW and PP

In order to speed things up, 8-sided dice are used.  Or three 6 sided ones, if you ensure 2 are the same color so you can still use the 3 doubles-go-to-jail rules.  Also to ensure those hard-to-land on properties are hit, we can modify the free parking rule.  Free parking allows you to move to the next un-purchased property.  After all properties are purchased FP reverts to however you’d usually use it.  Another method of speeding the game up is to shuffle and deal 3-5 properties at the beginning and have the person pay for them at the start.  These should help speed up the game.

Cards will apply only to the board their origin. Go-to cards will then move the player to the same board enabling them to ‘bypass’ the 2nd board.  Go-to-Jail card is the same.Go-to-Jail (G2J) space applies to the board they had just left if the boards are joined at G2J.  Any special rules for that alternate reality board apply to that board only.

$3k is given out.  Due to inflation :), $1s are not used, but now become $1,000s.  Any prices are rounded to nearest 5, Mediterranean Ave must be landed on twice for any pay.  Thus the only change to the starting money is 1 – $1k, and an additional 1-$500 and 1-$5.

Currency options.  It is possible to keep both monies separate and required to pay off debts and purchases in each reality.  So if you need imperial credits, and all you have is ‘love’ then you would have to trade of use the bank as a currency exchanger of last resort.  Since ‘money can’t buy you love’ then you could make it a rule that the bank cannot exchange monopoly money for Beatles ‘love’ bucks.  If you let the bank charge a large fee, say 50%, ($100 monopoly money becomes 50 imperial credits) then other players can act as currency traders and arbitrageurs.  Forcing a very high fee, or ‘money can’t buy you love’ rule add an additional level of screwage that other players can enact on the person.  However the bank will need to use a-previously-agreed-upon exchange rate for bankruptcy proceedings.  Preferably with something like this line:  “Republic credits? Republic credits are no good out here. I need something more real.”

Railroad rents:   Several multiplier options, haven’t decided on which

RR Options
RRs owned x2 Half x1.90
1 25 13           25
2 50 25           48
3 100 50           90
4 200 100         171
5 400 200         326
6 800 400         619
7 1600 800     1,176
8 3200 1600     2,235

Utility rents:

Utilities Owned Multiplier x2.5
1 4
2 10
3 25
4 62.5

With two 8 sided die, a 16 would yield 1000, making the utilities pretty lucrative, if only rarely (1/64 chance with 2 8-sided dice).

 

Tableau workbook with complete probabilities based on 4 board orientations and 2 8-sided dice with the doubles rule: bit.ly/1qEXfnK

Java used build to calc the probabilities.  Should work in a standard Eclipse installation of Java 7: dub.java

Grade Inflation UGA vs GaTech

I have a quick chart here showing grade inflation by schools. It would seem that my Alma Mater experienced less grade inflation than UGA. UGA had a significant bump during the 1990s. Were professors just more liberal with their grades during that time? It could also be that Georgia experienced an increase in higher quality applicants.

Other GA Schools are also available, play with the tableau viz below:

GPA Inflation

Discrete and Continous coloring in Tableau.

This has long bothered me in certain circumstances.  If you have a null value in a continuous pill then Tableau will color that null as if it were a zero.  Under most circumstances this is an okay solution, and I have to credit the programmers; it is better than the alternative.  When in doubt, show something and show it predictably.

So here is the problem I have faced.  Lets say zero is good, 10 is bad, and NULL is indifferent. Tableau will display null as the same color as zero.  Now lets say zero is good, 10 is bad, and NULL is BAD.  Now lets say 10 is good, -10 is bad, and NULL is something else.  When you place data into Tableau it will color nulls as zero.

However, Tableau is awesome and I was able to find an easy solution:

FinishedProduct

The secret is not really hard.  Basically all you need is two calculated fields, and then place  a  “transparent” image in your custom icons.

Formula 1
Float:  FLOAT([Type1])

This will take a number and convert it into a float, if there is a NULL or text it will make them all NULL.

Formula 2&3
NullFill_1: case [float]  when null then ‘NULL’
else ‘FILL’
end

Now just duplicate NullFill_1 as NullFill_2.

Formula 4:
1:1

This “1″ is solely just a place holder, call it anything and set it to 1.   Now take “1″ and place it on the column shelf twice.  Place Nullfill_1 on one shelf and NullFill_2 on another shelf and make 1 transparent, and give another a real shape.  If you need a transparent icon, just use this one:

NA blank trans

Now your Tableau workbook should look like this:

DualAxisThis

Now just dual axis those, adjust some of the legends and you will get this awesomeness!

 

Simple Hex Binning using R

R has of course numerous packages available.  One of the packages is hexbin.  Hexbinning gives the user a way to visualize high-density scatterplots.  There is a way to build it in Tableau without R, but it involves many more calculated fields.  A very simple way is to use hexbin() in R and using a Tableau custom shape.

Now this method relies on using the transparency field to duplicate the ‘look’ of a hexbin, by overlaying the hexagons on top of each other.  You can see that if you highlight one of the hexagons it will say ’5 items selected.’  This method does not give the actual function of grouping things into a hexagonal bin.  Here is the problem with this simple solution.  Tableau likes to receive the exact same number of rows as it sends to R.  Thus it is not possible (yet…) to send 50 rows and receive back 5 rows with a count.  This is essential for a true hexbin implemenation (I am hot on the trail of an idea around this).

Here are the formulas

Hexbin X:
SCRIPT_REAL(‘library(hexbin);hbin<-hexbin(.arg1,.arg2,xbins = .arg3,xbnds = c(-1,1),ybnds = c(-1,1));xys <- hcell2xy(hbin);xys$x’ ,avg(randx),avg(randy),[Bins])

Hexbin Y:
SCRIPT_REAL(‘library(hexbin);hbin<-hexbin(.arg1,.arg2,xbins = .arg3,xbnds = c(-1,1),ybnds = c(-1,1));xys <- hcell2xy(hbin);xys$y’,avg(randx),avg(randy),[Bins])

 (Notice the distinction here xys$x vs xys$y)

Here is what the Script is doing.  Library(hexbin) loads the library.  You may need to install hexbin on your R instance first and here is what the hexbin formula likes to see:  hexbin()

x: The x values
y: The y values
xbins # of bins
xbnds The +/- bound for x
ybnds The +/- bound for y

The next command hcell2xy simply prints the hexagon’s coordinates for each row so that Tableau can then receive it back and then display it.

Possible errors:

  • R error:  “xbnds[1] < xbnds[2] (or ybnds[1] < ybnds[2] )”

This specific error means that the ‘xbnds’ would like to see the lowest bound to the highest bound.

  • Hexagons do not match up and there is a star looking negative space between the ‘bins’. You just need to swap the axis.

But luckily this is a 1 button fix :)  :

 

Oh one more thing, here are two good hexbin files that you can use for custom shapes.  Add these to the \My Tableau Repository\Shapes\My Custom Shapes\ directory.

Hexagon_M_Filled

Hexagon_M_Hollow

Here is the working Tableau Packaged workbook.  Note that due to the R integration, I cannot upload this to Tableau Public.

Using an R function and Tableau 8.1 to map custom areas

The R Project for Statistical Computing has a smörgåsbord of functions available to use.  This is obvious to anyone who has spent 30 minutes with the program.  Tableau is also equally awesome and really excels for displaying maps.  Again this should be manifest for anyone who’s spent 3 minutes with the product.
(Finished Product)
Convex_hull

Steps to Mapping a District using Tableau and R.

Tableau 8.1 now includes R integration.  This is my first major application of R and Tableau.  I had played around with it before, using seasonal decomposition to remove the seasonal swings, but that was mainly for a personal view of the data and not anything to display publically.   This post will outline how to do something very cool by combining Tableau’s easy mapping features AND R’s powerful packages.   (want to skip over all these boring instructions? here is the finished workbook)

Basic setup needed before you attempt anything in Tableau (hold your horses we’ll get there soon):

Step 1:  Install R and R Studio.
Step 2:  Within R Studio, install the Rserve package.

Tools -> Install Packages
blog_pic1

Type Rser and choose Rserve (and type the “ve” if you’re the OCD – intellisense hater type).
blog_pic2
Choose Rserve.
Install

Step 3:  Start up Rserve from the Rstudio Console:

> library(Rserve);
> Rserve()

Step 4: Connect Tableau to your local Rserve and test the connection
blog_pic3

Use localhost and then test the connection:
blog_pic4

BAM! You’re ready to rock with the data rock stars!   \m/ (>.<) \m/
But now onto the hard stuff in Tableau (see that didn’t take much time):

Fields you need:

  • Points you want to map.
  • A Group or Grouping Hierarchy (if need be just create a calculated field Group = “group”)
  • Latitude
  • Longitude

Tableau calculated Fields:

Size: = SIZE()

R-Derived Calculated Fields (In Tableau):

Script_PolyOrder:
SCRIPT_REAL(“X<- matrix(,nrow=.arg1, ncol=2 );X[,1]<-.arg2;X[,2]<-.arg3;”+
“Y <- (ifelse(X[,1] %in% X[chull(X)],1,0));”+
“Z <- matrix(0,ncol = 1,nrow =.arg1);Z[c(chull(X)),1] <- seq_along(chull(X));Z”,[Size],attr([Lat]),attr([Long]) )

2nd Tableau calculated Field:

Script_In_Ext:
case [Script_PolyOrder] when 0.0 then ‘aInterior’
else ‘Exterior’
end

The first field “Script_PolyOrder” simply determines the polygon order of the points.  The second script then determines whether the point is interior or exterior.  I forced ‘Interior’ into ‘aInterior’ to make it come before Exterior.  “Why didn’t your just rename ‘Exterior’ ‘Outside’?”  Okay I realize that now as I write this blog, but I was going for pure function when I wrote this Tableau file.  Below are two additional optional fields, which are useful for experimenting and determining how Tableau sees things.  I will go more into this later.

Script_Lat:
SCRIPT_REAL(“X<- matrix(,nrow=.arg1, ncol=2 );X[,1]<-.arg2;X[,2]<-.arg3;”+
“Y <- (ifelse(X[,1] %in% X[chull(X)],1,0));”+
“Z <- X*Y;X[,1]“,[Size],attr([Lat]),attr([Long]) )

Script_Lng:
SCRIPT_REAL(“X<- matrix(,nrow=.arg1, ncol=2 );X[,1]<-.arg2;X[,2]<-.arg3;”+
“Y <- (ifelse(X[,1] %in% X[chull(X)],1,0));”+
“Z <- X*Y;X[,2]“,[TotalItems],attr([Lat]),attr([Long]) )

Building the Map:

Drag the pills on the map like this:blog_pic5

And dual axis the 2nd Lat pill.  The dual axis part can be done later if you’d likeIt is somewhat nice for clarity to build the different parts separately and THEN see them some together.

Make the first Lat a Polygon.  Drag “Group” onto the detail shelf.  Uncheck Aggregate Measures in the Analysis Menu.  Drag Script_PolyOrder onto path and WHAM you have a polygon!

Now this is the important part for visualization.  Make the 2nd Lat pill the Circle (or shape if you want to) and make sure the first Lat pill is the polygon.  This is important because Tableau places the 2nd pill over the top of the first one.

Shelf Examples:

 BAD shelf / Bad Vis example:

blog_pic6

Good Shelf / Good Viz Example:

blog_pic7

 

Now you have an outline for the “Group” and the interior points for that group do not matter to the polygon.  Thus preventing any sort of jagged border that looks ugly.

Last Step:

Drag the calculate field Script_In_Ext onto the page shelf.  Check “Show History”.  Move the page shelf to the last page “Ext”.  Adjust the “Fade” to fade out the previous.   Now you are finished and it should look amazing.  Keep reading if you want to know the details of how it was achieved:

Alternate strategy:

There is an another option I toyed with: http://community.tableausoftware.com/thread/140023

Basically if you already know your polygon order, and do not wish to have a dynamic polygon drawing, you can ignore using the page shelf and R entirely.  Create a uniqueID which separates interior points from themselves and groups exterior points into the same group.

UniqueID

case [Script_In_Ext] when ‘In’ then “Str”+str(Attr([LowerHierarchy]))
else “D” + str(attr([NextLevelUpHierarchy]))
end

Add this to the shelf in place of your lowest hierarchy.  In essence this calculated field dumps interior points into their own unique 1-point polygon and groups all the exterior points into 1 polygon.  This was the first way I solved it, but looking back this is slightly more complicated AND requires me to know the polygon order.  Adding in the polygon order in Excel is another post but it is possible and maybe preferable in some circumstances.

One benefit of this is by integrating R somewhere earlier in the pipe (Say with Rexcel or KNIME) you don’t need to bog down the visualization server with calculations (and recalculations).  This is a good solution for many cases.

Tableau and R

Tableau wants to see Matrices and vectors from R.  You can only pass 1 column of a matrix back to Tableau.

Lets go over Script_Lat.  In this calculated field, you will pass the latitude and longitude to R, and get R to pass back the Latitude.  This is purely illustrative and not needed to actually function

SCRIPT_REAL(“X<- matrix(,nrow=.arg1, ncol=2 );X[,1]<-.arg2;X[,2]<-.arg3;”+
“Y <- (ifelse(X[,1] %in% X[chull(X)],1,0));”+
“Z <- X*Y;X[,1]“,[Size],attr([Lat]),attr([Long]) )

First function:

X<- matrix(,nrow=.arg1, ncol=2 );

This creates a matrix X with a variable number of rows and 2 columns.  This number of rows is based on the Size calculated field, which is the first argument after the R code in the Tableau

2nd & 3rd Functions:

X[,1]<-.arg2;
X[,2]<-.arg3;

Load Column 1 of matrix X with the 2nd argument, which is latitude, and load longitude into the 2nd column.

4th Function the confusing part:

Y <- (ifelse(X[,1] %in% X[chull(X)],1,0));

Basically save a matrix Y of the results of an if-then statement of points of X that are in the Convex Hull results of X.  I know it’s hard to describe:

http://stackoverflow.com/questions/22072194/basic-r-how-to-populate-a-vector-with-results-from-a-function

http://stackoverflow.com/questions/22096182/r-retain-order-from-a-vector-apply-it-to-another-vector

5th function:

Z <- X*Y;

Save a Matrix Z which is matrix X and vector Y note that this is an “Element-Wise Multiplication” and not true matrix multiplication.  This zeros outs any Interior points leaving only exterior points.  The Tableau calculated field then uses this.  This is a screen grab of Z within RStudio using random data showing zeros for the interior points.
blog_pic8

6th function:

X[,1]

Now return Latitude back.  X[,1] X[,2] or Z[,1] .  This is the last step Tableau expects to see.  Finished Calc’d field in Tableau:
blog_pic9

Possible Errors:

You may get this error (possibly many times :D ) :blog_pic10

Press details.  It could be that you need to start Rserve.  Or you could need re-do the Size field (configuring the size field was difficult, play with Aggregate Measures / use Tableau’s Total fields or generally adjust so you’re sending R the correct # of elements to R). Or you did some sort of error within R.  If the problem is R syntax, I suggest following the trail within RStudio to determine the error.  Use this to generate random points and start debugging:
X <- matrix(stats::rnorm(100), ncol = 2)

 

Personal Traffic patterns

For the past two years I have been using OpenPaths to analyze my personal movement data.  I looked at the data through Tableau and came up with some cool views. I decided to slice the data in new ways by using longitude and latitude binning. This easy Excel map hack gave me ideas and I then created a variable latitude/longitude binning method in Tableau.  One problem of this approach is that the bin sizes are pretty static.  The variable zoom method uses a formula of Round([x]*[Zoom],0)/[Zoom]  instead of just rounding to a magnitude of 10.  So for instance, placing the formula =ROUND(A1*4,0)/4 into excel will round the values to the nearest quarter.  A formula of =ROUND(A1*100,0)/100 will round it to two digits.

I used that binning method to create a series of maps tracking my personal location.  In this first map I have a general record of two paths I took to work.  Each bin is colored by the number of records.  Red is high, green is low.

Atlanta_Metro_numberofRecords2

The next map shows those two routes.  I took the average velocity for that bin and colored it by that.  Lower velocity is in blue while a higher velocity is in red.  What is interesting is that the route to job 1 was much faster, while route 2 had several sections of slowness (bottlenecks)

Routes_to_job

This third map is interesting, showing the effect of a traffic light on velocity.  I only analyzed this location on the afternoon commute (when I traveled north), so you can see a blue coloring leading to the intersection and immediately after, followed by a bright red (fast) bin.  As velocity is calculated using the distance from the previous measurement, and a congested intersection takes a little bit of time to clear out, it makes sense that this would lag the intersection some.

Intersection

Full listing of Obamacare Health Care Plans. Price visualization.

So here is a Tableau visualization of all the Obamacare health plans for states that elected to use the federal exchange (not sure about the states where the Feds and state Gov’t split some responsibility). Here is a breakout map. The values differ by county so I am showing the averages. Full visualization of all the health plans can be found here.

Clicking on any circle, or highlighting a section will bring up those plans below. You can then sort and filter as needed.