Learning Plan

I’ve been told at work to spend some time learning additional topics at home on my own time. My primary focus for this year will be Docker and Kubernetes. From what I have read so far, both will be a challenge. I have become accustomed to configuring and using virtual machines over the past five years. I even have a document to guide me through creating one which takes about one hour.

Docker will eliminate the need for a virtual machine while Kubernetes will eliminate the need for the crontab I have on the VMs.

On the personal side, I have been working through Python and Django. While I am okay at Python, Django will be new to me. I took a class three years ago on PHP and learned enough to create a web page that interacts with a MySQL database. I have been working through the Django class for far too long and new to complete it early this year.

I have also decided to add on two other technologies. The first is Flask. This will be more useful at work than home, but it is not a suggested item at work. I am interested in making some RESTful APIs for some common data collection routines. We’ll see if the class will get me to my objective.

The second class is on D3. D3 (Data Driven Documents) is a JavaScript library for visualizing and interacting with data on the web. I attended a Data Jam session yesterday which was my first encounter with D3. The library is more complicated than I imagined and after a few hours, I decided to utilize the sale at Udemy and purchased a class.

That is quite a bit of learning that needs to be accomplished this year. I have come to a loose grip on the lack of free time I have access to during the week. It’s tough to accept since it puts pressure on the weekends. Dealing with my self-created deadlines will be my personal maturity achievement for the year.


Credit Growth and GDP Visualization

I thought about writing a post describing this new series I will follow. Instead, I decided to use it as a curriculum for data visualization. Let’s start with the graph, which will open in a new window or tab. The focus of your eyes should immediately go to the red and green lines. Those are the important elements of the graph and it should be what you notice first.

Let’s move beyond that and talk about why they are the most noticeable. They are a brighter color compared to the rest of the items on the graph. The right-hand side is clear of other items and that space gives us a clear view of the most recent data observations of the two series.

After that, we may look to the horizontal or vertical axis — the horizontal axis because we are interested in the time of the observation or the vertical because we are interested in the scale of the data. In both cases, there are numbers without text. I have been a strong advocate for labeling each axis until recently when I noticed people using text within the graph to describe the axes. There are two advantages to this method: text is not rotated 90 degrees for the vertical axis and the text box doubles as the legend. In effect, we have replaced three text boxes (the horizontal axis label, the vertical axis label, and the legend) and replaced them with a single text box that may not have any more words than the composite of the original three.

In this particular case, we see what the horizontal and vertical axes are. We can also see a legend which shows the two data sets using the color of the line as part of the legend. This removes the graph type icon from the legend of most software and doubles the density within the new legend — a single line describes the series and defines which series it is.

Perhaps next our vision moves to the center of the graph which shows another text box with the most recent observation. I generally approve of this on all graphs since it shows how current the data is as well as the value of the most recent observation. It is not always easy to tell what each observation is especially as the line moves farther away from the scale which is generally on the left-hand side. This text box leaves no doubt on the value at the end point of the line. Notice that I also included the color of the line for each data observation. While this is not necessary for this graph, it provides reinforcement on the last observation for each data series to match it to the proper line.

By now, we have noticed the large paragraph of data near the left-side of the graph. I generally do not like detailed text on graphs. I especially do not like them when I am including a lecture — if I am truly speaking to the graph, there is no reason for additional text to distract from my comments. The counter to the lecture is the printed graph, which is what we are seeing now. For this type of presentation, it might be useful to include an explanatory paragraph to provide detail into how to read the graph and why it is meaningful. Since this graph will be updated at best quarterly, including background and detail will be helpful for any audience.

Finally, there is a text box in the lower-right which shows the source of the data. I am usually quite deficient at successfully including the data source. It is best to include it because you will have skeptics who will not believe your conclusions and observations until they look at the data series themselves.

Okay, that is it for the text boxes. They add a lot to the visual stimulation of the graph, but I think they have their purpose and they do not detract from the data. There is one item to discuss that is not seen — the gridlines. I did not speak of them because they do not stand out. Tableau defaults to a light gray color and I highly approve. Microsoft Excel defaults to black and it can make it unnecessarily difficult to read a graph. In this case, they are subtle and come into visual acuity only if needed.

Graph Axis

One of the inspirations I have for wanting to conduct a class on data visualization is watching students attempt to show data. It is unfortunate they are not being guided to what makes an effective graph. I thought about this today when I saw a graph of the capacity utilization of a machine and the y-axis was scaled from 0 to 1.2.

There are two things wrong with the y-axis. First, the measure of capacity utilization is in percent which means the y-axis should also show percent. Second, capacity utilization can never go above 100% which means the y-axis should not be shown above 100% either. I know that Excel 2007 and prior versions defaulted with a black box encasing the graph and that made a full bar chart difficult to fully understand, but it is not hard to remove that black box.

HomeworkCompletionI was thinking about a similar graph that I would update and show to students on how they were making progress through the homework. I made an example on the right. I’m not terribly thrilled with the data labels, but I can modify those later. The graph shows every student has completed homework 1 and 80% have completed homework A. From there, it might be a reasonable conclusion that we haven’t reached the point where the topics in homework 3 have been covered in class. The structure of the class allows the students to complete the homework at their own pace. We can see there are some who are working ahead. That could be an issue if they become bored in class or lose track of what they should be learning on their own.

The point of displaying this graph is to show a proper way to scale the y-axis when displaying percentages.The data label helps confirm the first bar does not extend above the top of the graph (if there needed to be an additional clue).

Since this data will be shown every week, the next step might be to devise a 3-d bar graph. I am generally opposed to any 3-d graph when using 2-d media, but it could be constructed properly without being impossible to read. Showing it in combination with the current 2-d version would make for a good story.

Data Analytics and Data Visualization

Some may know that I recently got a second job as an adjunct professor in the University of Washington system. While the first course I will be teaching is on Operations and Project Management, which is part of the supply chain discipline of the School of Business, I have been encouraged to work on an emphasis of data analytics and visualization. Last month, I was invited to give a guest lecture to a class on visualization. Developing those slides gave me inspiration to consider developing a curriculum around analytics and visualization. To organize my thoughts, I decided to begin here on WordPress.

It seems to me a class should be organized around the steps of developing data to answer questions. The following list could be used for academic or industrial purposes. I don’t think this list is original, but it seems logical and fits with my own approach at work.

  1. Have a question without an answer
  2. Find a data source that may lead to answer
  3. Look for incompleteness, inconsistencies, and possible errors
  4. Perform initial analysis and construct initial visuals
  5. Write a story describing the analysis and visuals and include a conclusion
  6. Has the initial question been answered? Are there new questions?

I have encountered instances where only steps 2, 4, and 5 are performed. That has resulted in dissatisfaction from the requestor and the analyst. It seems almost all of the dissatisfaction arises from missing step one. If you don’t know what you would like to solve, you don’t know when you have finished. It is the finish that is determined in step six. Sometimes an answer to a question brings up new questions since the conclusion may not be what was expected or the answer doesn’t perfectly answer the initial question.

I have also seen instances where step three is skipped. This can be fatal to any analysis since it can lead to incorrect conclusions or worse, no conclusion when there should have been one. There generally isn’t a way to easily verify the accuracy of a data set. Incompleteness can be easy to check. My background has been in time series analysis which may be the easiest of all types of data sets to verify completeness. My best advice is to spend time with the data and use statistics along with graphs to see if the data looks reasonable. Odd situations can be observed with a basic approach that can lead to questions regarding events that influenced the data.

One of my goals will be to also examine the graphs I have constructed and maintained on the data site. I have some opinions on the construction of charts and how to do it well. With regard to software, I have been using Microsoft Excel for twenty years and still consider it to be the best general purpose software for data analysis. For this blog though, I have turned towards Tableau Public for visualization. I used a recent entry to discuss how that software has made it simpler for me to keep the visuals of this blog updated.

Okay, that is a simple introduction. No time table on what I put in this category exists, but this will keep my imagination active for awhile.