Category: Data

  • Simulating the Monty Hall Problem in R.

    The Monty Hall Problem is famous in the world of statistics and probability. For those struggling with the intuition, simulating the problem is a great way to get at the answer. Randomly choose a door for the prize, randomly choose a door for the user to pick first, play out Monty’s role as host, and […]

  • Clustering in R

    Clustering is a useful technique for exploring your data. It groups records into clusters based on similar features. It’s also a key technique of unsupervised learning. The following is a simple example in R where I plotted the clusters and centroids. The example uses the mtcars dataset built into R, which contains auto data extracted […]

  • Installing pymc on OS X using homebrew

    I’ve been working through the following book on Bayesian methods with an emphasis on the pymc library: However, pymc installation on OS X can be a bit of a pain. The issues comes down to fortran… I know. The version of gfortran in newer gcc implementations doesn’t work well with the pymc build, you need […]

  • The Math of Machine Learning

    (hover for CC attribution) One of the challenges of data science in general is that it is a multi-disciplinary field. For any given problem, you may need skills in data extraction, data transformation, data cleaning, math, statistics, software engineering, data visualization, and the domain. And that list likely isn’t inclusive. One of the first questions […]

  • An Overview of Machine Learning in R

    I presented at the Cleveland SciPy/Julia/R Data Science Group on 6/14. The talk is a fairly high-level introduction to some of the machine learning methods and packages available in R. Here is the video: Here are the slides. Here are the notebooks.