Daily Workflow

I have created multiple Jupyter Notebooks and keeping track of the order creates an efficient workflow experience. There are two parts to every day due to availability of the previous day’s results.

The first part collects future days. The first step is to go to FantasyPros and collect the next three days starters who are owned in less than 20% of leagues.

The second part just uses the streaming pitchers from two days ahead and collects data in five separate areas.

First is information around ERA. Each previous start has an ERA calculation and, in aggregate, VPR and Volatility are calculated. A scorecard color is assigned based on a minimum number of innings pitched per start as well as the ratio of excellent starts to poor starts.

Second is information around WHIP. This will be slightly similar to ERA. Each previous start has a WHIP calculation and, in aggregate, VPR and Volatility are calculated. A scorecard color is assigned based on the ratio of excellent starts to poor starts. A minimum number of innings is not used since it has already been added in the ERA category.

Third is the calculation of luck. There are three measures that determine a pitcher’s luck and all three are available from Fangraphs. I don’t have scorecard colors for this measure because there is doubt on the resiliency of luck.

Fourth is information regarding the opponent’s ERA. This is complex to calculate since I need to calculate what the team did against each opposing starting pitcher. To do that I have to collect the result of every starting pitcher and who they faced. Then reverse the collection. A scorecard color is assigned based on the relative standing versus the other teams.

Fifth is information regarding the opponent’s WHIP. Using the data collected already, this calculation follows the same pattern. A scorecard color is assigned based on the relative standing versus the other teams.

The final output is an HTML file that has a little bit of CSS and a little bit of JavaScript. There is a large amount of data being displayed. To avoid a wall of text, the JavaScript only shows one of the five tables at a time. This is much easier to see and digest.

Later in the day, Baseball-Reference is updated with the previous day’s results. There is a six-step process I follow.

Step One collects the results from the previous day’s streamers.

Step Two runs a comparison between the selected streamer and all streamers. Eventually I look like to construct a grade based on the daily rank.

Step Three looks at the next three days of streamers and flags those who are missing their Baseball-Reference code or their Fangraphs code.

Step Four collects the team schedule from Fangraphs. I use this page because it includes the starting pitchers.

Step Five captures the results from all starting pitchers.

Step Six captures the results from all teams. This is a new step that I recently added because I want to analyze a team’s offense through the entire game. It seems that wOBA is the preferred measure and I am developing a view over a few time periods.

Yu can see there are quite a few steps and some are dependent on their predecessors. Having a published order makes everything work together smoothly and I can complete all of this in under fifteen minutes.

Advertisements

Daily Streamers Results

Before I write about the automation of collecting the daily streamers, I want to take one step back and talk about the list of streamers on Fantasy Pros. The site lists all players owned in less than 50% of leagues. I realized another way to score the selection is to compare the player selected to all possible candidates. The site also lists the streaming options for the next seven days. I would need to make some changes to the data I would be scraping.

Between Requests, Beautiful Soup, and Pandas, the collection of table based data in an HTML page is shockingly easy. Each day was a separate table and my target was just the next three days. I pulled in the first three table into three separate DataFrames. I then limited the rows to those pitchers owned in less than 20% of leagues.

With those three tables, I concatenated them and sent the final table to a CSV file.

In order to collect each pitching line, I needed to create a translation table between Fantasy Pros and Baseball Reference. There isn’t any way to do that except manually. Fortunately, there are only a few pitchers who qualify as streamers each day.

The data collection script reads in the list of streamers using a specified target date (entered as an argument — the script isn’t fully automated). The pitching list has names from Fantasy Pros. I read in the translation table and, using the map() function, convert the name to the Baseball Reference ID.

There is a loop for each pitcher which pulls in the appropriate page, finds the first table tag on the page, looks for the next to last row (the last row summaries the year), and concatenates it to the daily results table. The final table is exported to a CSV file.

All of this executes in less than ten seconds. Timing of the execution is currently unplanned because i get up before Baseball Reference is updated. This is one of the reasons I wanted the target date to be entered as an argument rather than guessed at based on the current date.

Next will be the discussion on organization and the daily workflow.

Summarizing Results

There are two extra fields of data I had decided to capture with each starter — who they were facing and where. After several months, it was clear that Nick preferred pitchers who were facing a small subset of teams. These included Baltimore, Chicago White Sox, Detroit, Miami, and San Diego. Most teams were in the opponent list except for Colorado at home and Boston anywhere. He talks frequently about avoiding those conditions.

Pandas makes it quite easy to pull in an Excel table and convert the data into a DataFrame. I then added a column calculating the Game Score. Here I applied a function to each row of the DataFrame because I would be using data from several columns.

Next I created a column to determine whether the selection was a win or a loss. I had thought to only count whether the Game Score was above the baseline of 38, but then I decided to include a minimum of five innings pitched. My thought was the pitcher can only get a win if he pitches five innings, so why not include that in this determination. Again, I applied a function to each row since I was using more than one column of the DataFrame.

Collecting the win and loss total becomes easy at this point since I am counting the frequency of the two values. I realized that I had Nick with a much better winning percentage than he was mentioning on the daily podcast. I also broke down the winning percentage by the home and road split. The groupby() function in Pandas made quick work of that 2×2 table.

Next I created some composite statistics on the selected pitchers. ERA and WHIP were obvious choices, but I also calculated innings per start (IPS), strikeouts per nine innings, and strikeouts per walk. These were all collected with the sum() and count() functions on each relevant column of the DataFrame.

Next I wanted to see which team has the best records against the selections. The sort_values() function by default sorts in ascending order. With a 1-3 record and the only record below 500, it was clear that facing Kansas City was not a winning idea.

The next obvious idea was to perform the same sort for each pitcher. There are plenty of pitchers who were selected only once and did not get a win. Similar to that, I wanted to see who was picked most often. Nick Kingham was selected eight times. This seems odd to me. The result of selecting these players and seeing them be successful should mean they are likely to be added in leagues and no longer meet the criteria for streaming. Yet, there is Nick Kingham.

I tried combining the best and worst lists into crosstabs to find some correlation, but there is simply too little data for anything to come of it. There were fun things that I found. Nick has twelve wins when his selections face San Diego. These twelve wins come from twelve pitchers. I like that consistency.

The next post will focus on the automation of getting the daily streamers. There is also some bigger analysis that can be done by grabbing all of the potentials streamers.

Fantasy Pitching Streamers

One of the podcasts I have been listening to this year is Pitcher List. I have enjoyed all of the different subjects from that channel. One of the features on the daily podcast is the pitcher to stream for the day. I have been thinking about how to follow along with the selections and determine an algorithm for selection and verification.

Nick Pollack hosts the show during the week and I’m guessing he is the sole person picking the pitcher to stream. Eligible pitchers must be owned in less than twenty percent of leagues. By June 18th, he was 42-24 in his selections. I had thought about creating a list of the daily streamers but never followed up.

After several months of ideas bouncing around in my head, Nick mentioned he gets the percentage ownership from Fantasy Pros. That site is very well organized and it was easy to find the list of pitchers with low ownership. I also noticed the page is in PHP format. That would make screen scraping quite easy.

I decided I was on my way. This will become a series of blog posts about my thinking and the data gathering techniques and analysis I have done and finally what my plans are.

The first part of the data gathering was to collect all of the daily pitchers. That involved listening to every podcast for the season. I was lucky that Pitcher List has a single page with a list of their podcasts in chronological order. I was also lucky they have links to the podcast on Sound Cloud. Hosts Nick Pollack and Alex Fast always have a longer than normal pause before they provide their streaming picks. I could make a guess where in the podcast they would begin talking about the pick, capture that player, and move on to the next day. Six hours later, I had the full list.

The next step was to collect the results from each pitcher. I still did not know what constitutes a “Win” for Nick, so I had made the decision to follow Game Score from Tom Tango. A “Win” would be a final score above the baseline of 38. Computing Game Score involves only six measures from each start. My favorite location for getting game results is Baseball Reference. I sorted the players and begin going through each of their pages. This took about two hours since there were so few numbers to collect for each game.

Everything I have accomplished is manual and collected in an Excel workbook. I knew at some point I would need to add in some automation or I would grow tired of the daily grind. Next, I’ll talk about the Python script to read the Excel document and provide some summary statistics.

Investment Crisis

I’ve been having a crisis over the past two months in my main investment account. Performance has been lagging for a year and it isn’t one stock. I have exactly one strategy that now appears to not be working. I really didn’t have a backup plan so I have decided to take on some risk.

Part of the portfolio has been dedicated to FAANG and BAT stocks. This consists of Facebook, Amazon, Apple, Netflix, and Google along with Baidu, Alibaba, and Tencent. It has been interesting to watch these stocks rise aggressively (in aggregate) since my purchase. Some are neutral which is heavily offset by the rapid rise in Netflix.

The rest of the portfolio got trimmed down and I started to notice a trend. Some stocks would get hit with a bad news report. I began to look at history and noticed a lack of positive news articles. It seems my technical indicators are finding companies with non-top tier management teams.

It occurs to me that I should be looking for the best management teams in selected sectors. Yet while I can pick the sectors based on macroeconomic situations, I do not have the ability to select management teams. Okay, scrap that idea.

Today, I re-joined the American Association of Individual Investors (AAII) after a several decade absence. They have a model portfolio that looks really good on a ten-year basis but is below average on an eleven-year basis. The portfolio had a 50% loss in 2007. That is a nasty performance to overcome. Recently the portfolio has severely lagged as several stocks have had negative surprises. Technically I called these oval events due to the oval pattern that forms on their Bollinger Bands.

I am going to allocate a modest percentage to selected companies in the portfolio. This will be a test year.

It is interesting trying to figure out how to invest in a market where the Federal Reserve is raising interest rates, reducing its balance sheet, and the Federal Debt is rising at an accelerated rate. Add in an administration that is starting a trade war and the result is a situation that I am unfamiliar with.

Learning Plan

I’ve been told at work to spend some time learning additional topics at home on my own time. My primary focus for this year will be Docker and Kubernetes. From what I have read so far, both will be a challenge. I have become accustomed to configuring and using virtual machines over the past five years. I even have a document to guide me through creating one which takes about one hour.

Docker will eliminate the need for a virtual machine while Kubernetes will eliminate the need for the crontab I have on the VMs.

On the personal side, I have been working through Python and Django. While I am okay at Python, Django will be new to me. I took a class three years ago on PHP and learned enough to create a web page that interacts with a MySQL database. I have been working through the Django class for far too long and new to complete it early this year.

I have also decided to add on two other technologies. The first is Flask. This will be more useful at work than home, but it is not a suggested item at work. I am interested in making some RESTful APIs for some common data collection routines. We’ll see if the class will get me to my objective.

The second class is on D3. D3 (Data Driven Documents) is a JavaScript library for visualizing and interacting with data on the web. I attended a Data Jam session yesterday which was my first encounter with D3. The library is more complicated than I imagined and after a few hours, I decided to utilize the sale at Udemy and purchased a class.

That is quite a bit of learning that needs to be accomplished this year. I have come to a loose grip on the lack of free time I have access to during the week. It’s tough to accept since it puts pressure on the weekends. Dealing with my self-created deadlines will be my personal maturity achievement for the year.

Portfolio Status 2018

As we start 2018, it’s time for me to reset where my portfolio is. This follows from the idea of each part of the portfolio having a job. You will also see a trend of moving away from institutions and towards a people-focused perspective.

At the top of the portfolio are two retirement accounts — one a self-directed IRA and one a 401(k). The IRA is split between trend reversal dividend stocks and FAANG (Facebook, Apple, Amazon, Netflix, and Google) and BAT (Baidu, Alibaba, and Tencent) stocks. The 401(k) is all index funds. My choices there are high-fee, sector funds and low-fee index funds. I’ll choose the low-fee option almost every time.

Both of the equity funds have done exceptionally well this year. Since I am a macro-driven investor, I benchmark against a flat number.

I cannot reach my savings goal in those two funds alone. I save other funds into another equity account. This is guided by an investment writer that I have been following for years. He recently started an ETF guidance newsletter that seems promising.

The bond part of the portfolio has changed into two accounts. I have been allowing my Treasury bonds to mature and moving the proceeds into Lending Club and Fundrise. At Lending Club, I am focusing on well capitalized, short-term, credit card payoff loans, but in the medium risk profile. It has been hard to stay fully invested. At Fundrise, I have been involved in the long-term growth portfolio of real estate.

The cash part of the portfolio has changed from bank CDs to Prosper. Prosper is very similar to Lending Club, but I am not happy with the level of detail I have been receiving. Thus, my investment there is quite small and limited to well-capitalized, short-term, low-risk loans.

There are tax benefits to saving in the 401(k) and the IRA. Those two funds get the majority of my savings and that is causing an imbalance in the overall portfolio with respect to percent allocation by fund. The only way to resolve that is by increasing my saving percentage. I hope to have a quiet year regarding expenses and have plans to reach that savings goal.

The two retirement funds are designed to provide income at two phases of retirement — the 401(k) will be early and the IRA will be late. The other funds are all taxable and will be used as necessary.