Hadoop: Accessing Google Cloud Storage

First, go here to choose the hadoop google cloud storage connector for your version of hadoop, likely hadoop 2.

Copy that file to $HADOOP_HOME/share/hadoop/tools/lib/. If you followed the instruction in the prior post, that directory is already in your class path. If not, add the following to your hadoop-env.sh file (found in $HADOOP_CONF directory):

#GS / AWS S3 Support
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

Create a service account in google cloud that has the necessary Storage permissions. Download the credentials and save somewhere, in my case I renamed the file and saved it in .config/gcloud/hadoop.json.

Add the following properties in your core-site.xml:
<configuration>
<property>
<name>fs.gs.project.id</name>
<value>someproject-123</value>
<description>
Required. Google Cloud Project ID with access to configured GCS buckets.
</description>
</property>

<property>
<name>google.cloud.auth.service.account.enable</name>
<value>true</value>
<description>
Whether to use a service account for GCS authorizaiton.
</description>
</property>

<property>
<name>google.cloud.auth.service.account.json.keyfile</name>
<value>/Users/tim/.config/gcloud/hadoop.json</value>
</property>

<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>The implementation class of the GS Filesystem</description>
</property>

<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>The implementation class of the GS AbstractFileSystem.</description>
</property>

</configuration>

Note to change someproject-123 to your actual project-id, which can be found in the google cloud dashboard.

Now test this setup with:

hdfs dfs -ls gs://somebucket

Of course you’ll need to replace somebucket with an actual bucket/directory in your google storage account.

Now you should be setup to use S3 and Google storage with your local hadoop setup.

Hadoop: Accessing S3

This post follows in a series of doing local hadoop setup on macOS for development / learning purposes. In the first post, we installed hadoop.

If you get stuck or need more detail, feel free to check out the apache docs on S3 support.

First, we have to add the directory with the necessary jar files to the Hadoop classpath. In hadoop-env.sh (which is in the $HADOOP_CONF directory), add the following lines to the end of the file:

#AWS S3 Support
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

Make sure you have the following properties set in your core-site.xml file (which is in the $HADOOP_CONF directory).

<configuration>

<property>
<name>fs.s3a.access.key</name>
<value>KEY_HERE</value>
<description>AWS access key ID.
Omit for IAM role-based or provider-based authentication.</description>
</property>

<property>
<name>fs.s3a.secret.key</name>
<value>SECRET_KEY_HERE</value>
<description>AWS secret key.
Omit for IAM role-based or provider-based authentication.</description>
</property>

<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
<description>The implementation class of the S3A Filesystem</description>
</property>

<property>
<name>fs.AbstractFileSystem.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3A</value>
<description>The implementation class of the S3A AbstractFileSystem.</description>
</property>

</configuration>

Note that you will need to replace KEY_HERE and SECRET_KEY_HERE with your actual S3 access keys. You can also set the appropriate environment variables with your keys. I put them in this file because I use multiple AWS profile using the configuration files, which is not picked up on by hadoop.

You can test access by using a public data set. For example, I tested with:

hdfs dfs -ls s3a://nasanex/NEX-DCP30

You should see the contents of that bucket, which includes 5 files.

Note the use of s3a in the protocol, this is the preferred provider over s3n and the deprecated s3.

In the next post, we’ll look at setting up google cloud storage in a similar manner.

Hadoop: Installing on macOS

Hadoop is traditionally run on a linux-based system. For learning and development purposes, you may want to install hadoop on macOS.

This is the first in a series of posts that will walkthrough working with Hadoop and cloud-based storage.

First, you’ll want to use homebrew to install hadoop and any related tools you would like.
brew install hadoop apache-spark pig hbase

Next, you’ll want to setup some environment variables. This can be in your shell rc file (.bashrc, .zshrc), or other places if you use a shell config tool like oh-my-zsh.

Make sure you have set JAVA_HOME, which may differ from my setup below.

export HADOOP_INSTALL=/usr/local/opt
export HADOOP_HOME=$HADOOP_INSTALL/hadoop/libexec
export HADOOP_CONF=$HADOOP_HOME/etc/hadoop
PATH="$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH"

Then test your install with the following:

hdfs dfs -ls ~

You should see the contents of your home directory.

You can also run a hadoop example with:
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 10 100

You should see a (poor) estimate of pi.

You should now be set to use hadoop. In future posts we will look at using the S3 filesystem from AWS and the Google Cloud Storage as well.

Files and Pipes in R Video Demo

I’ve worked with various alternate file handlers in python before and wanted to explore the options in R. I was pleasantly surprised to find handlers prebuilt for tasks like compressing data. In addition, a pipe function is available to allow you to use less common commands on your file, like gpg for encryption.

I put together a quick video demo of how to use these functions, and it’s available on youtube:

If you are having a hard time reading the text, click here to view the video directly on youtube.

Comment here or on the video with any feedback or questions.

Markov Chain Simulation

I’ve been reading up on Markov chains and related concepts. On the wikipedia page there is an example of a 2 state Markov process. I decided to simulate it in R and plot the mean of the means.

Quick Code example here:

The mean of means (of state e) is close to .36. If you take .3 * .36 + .4 * (1-.36), you get .364, so this seems to make sense. Note that I’m weighting the switching to e percentage based on the percentage of being in that state in the first place.