Hadoop: Accessing Google Cloud Storage

First, go here to choose the hadoop google cloud storage connector for your version of hadoop, likely hadoop 2.

Copy that file to $HADOOP_HOME/share/hadoop/tools/lib/. If you followed the instruction in the prior post, that directory is already in your class path. If not, add the following to your hadoop-env.sh file (found in $HADOOP_CONF directory):

#GS / AWS S3 Support export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

Create a service account in google cloud that has the necessary Storage permissions. Download the credentials and save somewhere, in my case I renamed the file and saved it in .config/gcloud/hadoop.json.

Add the following properties in your core-site.xml:
<configuration> <property> <name>fs.gs.project.id</name> <value>someproject-123</value> <description> Required. Google Cloud Project ID with access to configured GCS buckets. </description> </property>


  <property>

    <name>google.cloud.auth.service.account.enable</name>

    <value>true</value>

    <description>

      Whether to use a service account for GCS authorizaiton.

    </description>

  </property>
  <property>

    <name>google.cloud.auth.service.account.json.keyfile</name>

    <value>/Users/tim/.config/gcloud/hadoop.json</value>

  </property>
  <property>

    <name>fs.gs.impl</name>

    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>

    <description>The implementation class of the GS Filesystem</description>

  </property>
  <property>

    <name>fs.AbstractFileSystem.gs.impl</name>

    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>

    <description>The implementation class of the GS AbstractFileSystem.</description>

  </property>

</configuration>

Note to change someproject-123 to your actual project-id, which can be found in the google cloud dashboard.

Now test this setup with:

hdfs dfs -ls gs://somebucket

Of course you’ll need to replace somebucket with an actual bucket/directory in your google storage account.

Now you should be setup to use S3 and Google storage with your local hadoop setup.