Estimating the proportion of peanuts in a bag of trail mix
Yesterday, I was eating trail mix and it got me thinking about the importance of representative samples. If I drew a handful of trail mix and got back only peanuts, analyzed their contents, and tried to generalize to a true population that included peanuts and raisins, I would have a bad time. While this is a simplistic example, it underscores the importance of a representative sample when it comes to inference in our every day lives (e.g. surveys, poll estimates). I'm a big fan of the bootstrap, so I decided to simulate the estimation of the proportion of peanuts using the hypergeometric distribution (where the 'bag' is full of raisins and peanuts, and I draw a handful of size 4 without replacement). I modeled the true number of raisins and peanuts as Poission processes.
As you would expect, as the number of samples increases, the estimated mean (the blue line) converges to the true mean (the red line). I love to see the central limit theorem in action! The results are in the image below.
Disneyland! gganimate! ggmap! oh my.
In a few weeks, my family and I are going on vacation to Disneyland and California Adventure in California. I started to think about how the trip could be optimized based on distance between rides. The first step in this was to get the coordinates of rides (e.g. Splash Mountain, Star Wars Rise of the Resistance) and plot those using ggmap.
Then, I created a random ordering of rides and used gganimate to simulate traveling between rides. I currently have this set up for 12 that I manually pulled coordinates for using google maps, which I then saved to a csv file.
The starting place is the entrance to the park, then I loop through each 'candidate ride', calculate the distance (using distGeo), then remove that ride from consideration, and calculate the distance to the remaining rides and repeat.
The final ride order based on this approach and the ones I have recorded is below.
 "Space Mountain" "Buzz Lightyear Astro Blasters" "Finding Nemo"
 "Mr Toad Wild Ride" "Sleeping Beauty Castle Walkthrough" "Dumbo"
 "It's a Small World" "Big Thunder Railroad" "Splash Mountain"
 "Pirates of the Carribean" "Main Street" "Buzz Lightyear Astro Blasters"
 "Star Wars: Rise of the Resistance"
Limitations and potential next steps:
The distance calculation does not take into account paths (such as not being able to walk through a ride). A more sophisticated approach would be to get the shape of the paths between rides and use that to calculate distance (a reinforcement agent with the paths as the 'track' could be a fun way to do this!)
Some rides have longer wait times than others. Ideally I want to optimize subject to the constraints on distance traveled and time waited (e.g. I might be OK walking further to a ride that has a lower wait time). I was able to pull the wait time data from the `Thrilldata.com` website, but I haven't integrated it into the ride ranking at this time.
More rides! There are more than 12 rides at Disneyland, but to get started I picked the 12 I cared about the most.
Code can be found here:
A few weeks ago I had the opportunity to attend and present at rstudio::conf(2022) in Washington, D.C. It was easily the most inclusive conference I have ever attended. One of my favorite things was the "PacMan" style of talking - where you leave an open space in a circle of people so others can jump in. The inclusivity was also evident in the food choices as there was a mixture of meat, vegetarian, and vegan options! Including the best root vegetable casserole I've ever tasted.
I also presented at the conference! My talk was about Query Optimization and the recording can be found here. The highlights were to use an explain plan and use distribution keys and sort keys.
I highly recommend checking out the keynote talks - specifically about Quarto and the past and future of Shiny! As someone who loves to build Shiny apps, it's exciting that Python users will also be able to get in on the fun.
After three years, I finished my Master's degree in Applied and Computational Mathematics from Johns Hopkins University (while working full time!). The graduation ceremony in Baltimore was wonderful. It is a surreal feeling. It's bittersweet as I love being a student and learning new things, but it will be nice to have a lot more free time (and maybe finish some side projects?). I'll also miss my institutional access to research papers.
I had the opportunity to work on some fun projects through my coursework that I uploaded to my Github. The implementations use a mixture of R and Python.
Diabetes Prediction Using Probabilistic Graphical Models
Mortality Prediction in Heart Failure Patients Admitted to the Intensive Care Unit (ICU)
Kalman Filters in Remote Patient Monitoring: A Review and Application of Literature
RMarkdown is my favorite mechanism for writing papers as it makes beautiful documents. There's a lot of R in my life lately, as at the end of July I'll be speaking at rstudio::conf(2022) in Washington DC. It will be my first in-person conference since COVID.
Until next time.
Book Topics using Project Gutenberg
There has been a lot of discussion in the media recently about banned books. I won't pretend that I'm educated on this topic, however I do believe that it's important to learn from the past (especially the ugliest parts of history) and literature is a great way to achieve this. With some books being banned in some areas, I wanted to think of a way to quickly summarize topics for a given book to try to extract meaning/topics/key terms.
Building upon code by Andrea Perlato, I created a Shiny app that takes a book title as an input and returns topics and terms. TW: some terms may be culturally explicit.
Try it out! The book must be available on Project Gutenberg because of how the script sources text data. An improvement on this project could be leveraging a new data source to be able to model more books. The topic modeling is done using Latent Dirichlet Allocation (LDA) and implemented in R.
PS - if you've never built a Shiny app (and like programming), I highly recommend it. They are so fun with lots of code examples online!
Book Ranking - 2021 Edition
In 2021, I tried to keep data as the year progressed on books I read and how I felt about them at the time. Emphasis on try - I'm disappointed that I didn't capture start/end dates very well. I suppose I will have to give in to syncing my Kindle with GoodReads so that type of information is more easily accessible. With the exception of LOTR, this list contains net new reads.
Here it is.
The Champions (Top 5)
The Shining - Stephen King
War and Peace - Leo Tolstoy
The Brothers Karamazov - Fyodor Dostoevsky
Think Again - Adam Grant (Audiobook)
Anna Karenina - Leo Tolstoy
Lord of the Rings Trilogy - JRR Tolkien
The Count of Monte Cristo - Alexander Dumas
The Hunchback of Notre Dame - Victor Hugo
How Not to Die - Dr. Michael Greger
Remains of the Day - Kazuo Ishiguro
Klara and the Sun - Kazuo Ishiguro
Questions are the Answer - Dr Hal Gregersen (Audiobook)
Phantom of the Opera - Gaston Leroux
Wuthering Heights - Emily Brontë
The Picture of Dorian Gray - Oscar Wilde
Elevating Child Care - Janet Lansbury
The Hands Off Manager - Steve Chandler, Duane Black
Buddhism for Beginners - Thubten Chodron, His Holiness the Dalai Llama
The Art of War - Shin Tzu (Audiobook)
An American Sickness - Elisabeth Rosenthal (Audiobook)
How Not to Diet - Dr. Michael Greger
Unf*ck your Boundaries - Faith G. Harper (Audiobook)
Cheers to another year of life, literature, and better data collection.
Our House is on Fire
What action will you take today to reduce your carbon footprint?
Like many others, I regularly experience climate related anxiety. It feels overwhelming and hopeless. I worry about my son's future. I worry about suffering on a global scale. Worry often turns into spiraling and hyperventilating panic and it can be difficult to have hope for humanity.
Today, the Fridays for Future organization organized a global climate strike. I am inspired by this group of young people. They motivate me to be an agent of change instead of a powerless victim.
Lacking the courage to protest in person, I took today off work to research tangible ways I can make a difference, and take action based on that research. I'm grateful to work at a company where I am able to take the time to do this - a luxury I realize many do not have. In light of my privilege, I'm writing this post to share my findings.
Take action today by:
Talking to a friend or family member about your climate concerns.
Writing to representatives about your climate concerns (Science moms makes this extremely easy).
Take a day off from eating meat (1).
Purchase a carbon offset (e.g. terrapass) (2).
Take action over time by:
Fly less (and/or purchase carbon offsets for your travel) (3).
Voting in local elections.
Maintaining your car so that it can run efficiently (or buy hybrid/electric). (4)
Install a smart thermostat in your home (https://www.pse.com/rebates/smart-thermostat) - many energy companies provide rebates.
Eating more plants and eating less meat (e.g. meatless monday, only eat meat on the weekend). (1)
We can do this.
The Risk of Christmas During a Pandemic
As we continue to find ourselves in the midst of a terrible pandemic, it can be difficult to navigate holiday plans. So far in the US, ~215K people have died from Covid-19. It is imperative that we do everything in our power to prevent the spread of the virus, inclusive of making the difficult choice to limit exposure during traditional gathering times (e.g. Thanksgiving, Christmas) and wearing masks.
One way to remediate the risk of gatherings is to have attendees get tested. We know that the tests in place today are not perfect, with a sensitivity ranging from 80%-90% depending on the type of test (rapid antibody vs. RT-PCR). My inner statistician had the urge to estimate the likelihood of exposure given that all event attendees are tested for the virus using a mixture of Bayes' Theorem and the Binomial Distribution.
Gitlab repo: https://github.com/bhadi26/covid-christmas
May 26, 2020
It's been a while since my last post. My focus for the past year has been on coursework toward my master's degree in Applied and Computational Mathematics at Johns Hopkins University. Between school, work, and family, I haven't had as much time to work on my side projects. In the past year I have taken courses related to statistical methods and data analytics (highly reminiscent of Actuarial Exam P), Linear Algebra, Statistical Models and Regression, and most recently, Neural Networks. One of the assignments in the class was to code a multi-layer perceptron feed-forward backpropagation network from scratch (e.g. Numpy in Python).
My code can be found here: https://github.com/bhadi26/neural-net/blob/master/NodeLayerClass.py
The FFBP network is a type of supervised learning because there is a desired output. The network takes inputs and brings them through the hidden layer of the network with weights at each edge, then to the output node or nodes. FFBP networks are a good candidate for regression or classification problems, but have other use cases.
Example Multi-Layer Perceptron Topology
Spotify Year in review: The API Version
January 23, 2019
This post has been a long time in the making. Back in November, I thought it would be fun to explore the Python package spotipy that could export data from Spotify on a variety of metrics (e.g. top artists, songs, recent listening history).
Work and family life became busy over the course of November and December and other things took precedence (such as Elon getting back-to-back ear infections, poor bud!). In that time, Spotify released Spotify Wrapped , showcasing user data. I'd be very surprised if behind the hood the Year in Review leveraged some of the API data. What a fun exercise in personalization that allows me to bask in a subtle and sharable narcissism.
This post is going to be much less interesting now that you just go look at your own Spotify Year in Review. But if you want to see a version using Python, today is your lucky day. If I had more time, I would produce a word cloud visualization of the genre data I end up exporting. I'm going to be starting my master's program in Applied and Computational Mathematics via Johns Hopkins University next week, so it may be a while before I pick back up on this Spotify project. Go Blue Jays! I feel very fortunate to live in the age of virtual learning.
It's about time
December 16, 2018
This time on the blog, I explore time series forecasting using R the packages forecast and timeseries (along with the tidyverse) on data from Washington DC's bike share program and the counts of bike shares over time. Given that its almost Christmas, it felt right to do a post dealing with seasonality.
Click the link below to be directed to the hosted HTML file of my R markdown file.
In other news, I am working on another post using Python to extract data using Spotipy, a Python library for accessing Spotify's API, and then get the top most recent artists. Similar idea to Spotify's end of the year summary, which I suspect the API has a similar code base.
Code on gitlab: https://github.com/bhadi26/time-series
November 2, 2018
It's been a quiet summer on the blog. My first child, Elon, was born in June, and now that the newborn phase has ended and I have (eagerly) returned to work, it's time to get back to blogging!
Check out the post below on my analysis of some of the data we captured for Elon during his first few months of life. Data are everywhere! Even in your offspring.
Game of Thrones - You win or you die
May 10, 2018
While eagerly awaiting for season 8 of Game of Thrones, or The Winds of Winter to be released, check out a quick analysis I did on the percentage of living members by house allegiance. Shout out to whomever gathered these data (link to Kaggle contained in HTML file).
For now, my data watch has ended!
Code can be found at https://github.com/bhadi26/game-of-thrones
April 25, 2018
As a pet lover, I was excited when I found some data on pet licenses in Seattle. Shout out to Kaggle for serving as an excellent repository for fun data sets to play with. Lately I've been particularly interested in spatial analysis/playing with maps, so I was also looking for a data set that had some geographic attributes.
Click the link below to view!
I've heard good things about data.gov, but I personally haven't found data sets that I'm interested in, but perhaps that would be a good source of data for a future post.
For this month's blog post, I am trying to find the balance between producing posts entirely in R Markdown (then attaching the PDF or HTML output), or manually copying the desired results and commentary into the post. Since I'm a big fan of reproducibility, I lean toward linking the R markdown output that I'm hosting on my github page so that if there are any changes/corrections, the blog post would be pointing to the most up to date version!
The downside of that approach is that I believe it's more user friendly from a blog perspective to only have to go to one page to read content (vs. clicking on another link). I imagine my preferences will continue to evolve as I continue to blog. I also want to play more with Github pages as this may meet my need. Until then, I've linked a hosted version of my HTML output file that's on my github.
Lights, Camera, Analysis
April 5, 2018
I recently completed a course on data analysis and modeling using R. I thought the final project for the class was a fun challenge in finding a data set, formulating a question that can be answered using data, then analyzing that question.
For my project, I found a data set on Kaggle on movies with features such as genre, votes, budget, and revenue. I was then curious on to what extent these features could model the probability that a movie would be profitable. More detail on my project can be found below! The Shiny app is particularly fun to play around with. Even though this data set was published to Kaggle, a significant portion of this project involved cleaning and transforming the data before any modeling actually took place.
The main principles I took away from this class were those of validity and reproducibility.
Validity seems obvious, but it's an important principle to keep in mind when using data to answer a question. Do these data accurately measure the question at hand? What limitations exist within the data set and what impact does that have on our ability to measure the question?
The second principle of reproducibility is one I feel can't be emphasized enough. Not only does creating reproducible code and processes make it easier to understand your work as time goes by, it improves collaboration since others can understand your assumptions and method of analysis. This is a principle I try to employ in my personal and professional work.
Pokemon Go - Gyms in Seattle (i.e. having fun with ggmap)
March 18, 2018
Remember the summer of 2016 when everyone had the urge to go outside, get some exercise, and observe wildlife? I remember something close to that. Except... the wildlife were Pokemon and the exercise was an unintended consequence of searching the city for their nests.
I recently completed an R course on data analysis and modeling that featured a lab on analysis of spatial data with the ggmap package. I was blown away by how simple it was to play with google maps (thanks Google API!) data and layer points on top of it. Now that the course has ended, I wanted to see what lat/long data I could easily find online to plot using ggmap.
That's where ggmap meets Pokemon Go in this blog post. I found this website that contained the latitude and longitude of Pokemon Go gyms in the Seattle area. Using the archaic method of copying/pasting, I created an excel spreadsheet of the data on the site. Since I'm just playing around with this data I'm OK with that approach, but ideally I could have found a more automate way to export the data or access it within R directly via an API or some other means.
Having missed out on the height of Pokemon Go's popularity because the smart phone I had at the time couldn't handle the UI, I am not sure if the data I downloaded is an exhaustive set. I'm also not familiar with the locations of gyms vs. other popular sites in the Pokemon Go realm. My purpose here is to play with adding data to a ggmap object, I'm not bothered by these shortcomings.
Here is the first map I created that contained all of the data points. I was surprised that a "Seattle" dataset had points closer to Renton (and seemingly none in between?).
To get a better glimpse of the downtown region, I altered the zoom in the ggmap.
In this view, there don't appear to be any gyms outside the Downtown/Queen Anne area. Ideally I would be able to zoom in more, but I'm not sure how to change the center of the map and increase the zoom without cutting off more points. It's surprising that there aren't any gyms in Fremont/U-District/Ballard neighborhoods (or any other neighborhoods of Seattle, really), so that makes me question the integrity of this data set. Perhaps the initial points in Renton were entered incorrectly? Perhaps there is another page that contains the data for other gyms in the city of Seattle?
The code and data and I used to create these maps can be found here: https://github.com/bhadi26/pokemon-go