Category: Data

  • Hadoop: Accessing Google Cloud Storage

    First, go here to choose the hadoop google cloud storage connector for your version of hadoop, likely hadoop 2. Copy that file to $HADOOP_HOME/share/hadoop/tools/lib/. If you followed the instruction in the prior post, that directory is already in your class path. If not, add the following to your hadoop-env.sh file (found in $HADOOP_CONF directory): #GS […]

  • Hadoop: Accessing S3

    This post follows in a series of doing local hadoop setup on macOS for development / learning purposes. In the first post, we installed hadoop. If you get stuck or need more detail, feel free to check out the apache docs on S3 support. First, we have to add the directory with the necessary jar […]

  • Hadoop: Installing on macOS

    Hadoop is traditionally run on a linux-based system. For learning and development purposes, you may want to install hadoop on macOS. This is the first in a series of posts that will walkthrough working with Hadoop and cloud-based storage. First, you’ll want to use homebrew to install hadoop and any related tools you would like. […]

  • Files and Pipes in R Video Demo

    I’ve worked with various alternate file handlers in python before and wanted to explore the options in R. I was pleasantly surprised to find handlers prebuilt for tasks like compressing data. In addition, a pipe function is available to allow you to use less common commands on your file, like gpg for encryption. I put […]

  • Text Processing in R Talk With the TM Package

    I gave a talk at my local Cleveland R User Group about text processing and document vectorization. You can view the talk here: Note that I’m using the tm package, which is the traditional way to work with a document collection in R. There are new ways like tidytext that are gaining popularity. I may […]