Text Processing in R Talk With the TM Package

I gave a talk at my local Cleveland R User Group about text processing and document vectorization. You can view the talk here:

Note that I’m using the tm package, which is the traditional way to work with a document collection in R. There are new ways like tidytext that are gaining popularity. I may do a follow up talk on that.

Feedback, and More Videos

Enjoy, and feedback is welcome! And if you are interested in more video content on machine learning in R, check out this post.

Simulating the Monty Hall Problem in R.

The Monty Hall Problem is famous in the world of statistics and probability. For those struggling with the intuition, simulating the problem is a great way to get at the answer. Randomly choose a door for the prize, randomly choose a door for the user to pick first, play out Monty’s role as host, and then show the results of both strategies.

Simulating Monty Hall in R
Simulating the strategies of Monty Hall

The numeric output will vary, but look something like:

> print(summary(games$strategy) / nrow(games))
stay switch
0.342 0.658

The following code does this in a rather short R example:

Clustering in R

Clustering is a useful technique for exploring your data. It groups records into clusters based on similar features. It’s also a key technique of unsupervised learning. The following is a simple example in R where I plotted the clusters and centroids.

kmeans() car clusters with centroids

The example uses the mtcars dataset built into R, which contains auto data extracted from Motor Trend Magazine in 1973-1974.

Clustering is done with the kmeans() function. Note that the graph is 2-dimensional, and I cluster by 2 features, but you could cluster by more features and project down to a 2-dimensional plane.

Feel free to make suggestions:

Interview and Upcoming Projects

Here is a recent interview I did for CLK Tech. CLK Tech is a newsletter based out of Northeast Ohio, run by a couple of tech recruiters in the area. Topics span general career questions and data science in particular.

In addition, I’m busy with a project that I look forward to announcing soon. It’s shaping up to be a a busy year…

The Math of Machine Learning

Matrix multiplication diagram
(hover for CC attribution)

One of the challenges of data science in general is that it is a multi-disciplinary field. For any given problem, you may need skills in data extraction, data transformation, data cleaning, math, statistics, software engineering, data visualization, and the domain. And that list likely isn’t inclusive.

One of the first questions when it comes to machine learning in specific, is “how much math do I need to know?”

This is where I would recommend you start, to get the most value for your time:

  • Matrix Multiplication (Subject: Linear Algebra)
  • Probability (Subject: Statistics)
  • Normal Distributions (Subject: Statistics)
  • Bayes Theorem (Subject: Statistics)
  • Linear Regression (Subject: Statistics)

Of course you will run across other math needs, but I think the above list represents the foundation.

If you need places to get started with those topics, check out Kahn Academy, Coursera, or your location library.

For more on machine learning, check out other posts such as ML in R, Linear Algebra in R, and ML w/XGBoost.