Hadoop: Accessing S3 – Tim Hoolihan

This post follows in a series of doing local hadoop setup on macOS for development / learning purposes. In the first post, we installed hadoop.

If you get stuck or need more detail, feel free to check out the apache docs on S3 support.

First, we have to add the directory with the necessary jar files to the Hadoop classpath. In hadoop-env.sh (which is in the $HADOOP_CONF directory), add the following lines to the end of the file:

#AWS S3 Support export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

Make sure you have the following properties set in your core-site.xml file (which is in the $HADOOP_CONF directory).

<configuration>


  <property>

    <name>fs.s3a.access.key</name>

    <value>KEY_HERE</value>

    <description>AWS access key ID.

    Omit for IAM role-based or provider-based authentication.</description>

  </property>
  <property>

    <name>fs.s3a.secret.key</name>

    <value>SECRET_KEY_HERE</value>

    <description>AWS secret key.

    Omit for IAM role-based or provider-based authentication.</description>

  </property>
  <property>

    <name>fs.s3a.impl</name>

    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>

    <description>The implementation class of the S3A Filesystem</description>

  </property>
  <property>

    <name>fs.AbstractFileSystem.s3a.impl</name>

    <value>org.apache.hadoop.fs.s3a.S3A</value>

    <description>The implementation class of the S3A AbstractFileSystem.</description>

  </property>

</configuration>

Note that you will need to replace KEY_HERE and SECRET_KEY_HERE with your actual S3 access keys. You can also set the appropriate environment variables with your keys. I put them in this file because I use multiple AWS profile using the configuration files, which is not picked up on by hadoop.

You can test access by using a public data set. For example, I tested with:

hdfs dfs -ls s3a://nasanex/NEX-DCP30

You should see the contents of that bucket, which includes 5 files.

Note the use of s3a in the protocol, this is the preferred provider over s3n and the deprecated s3.

In the next post, we’ll look at setting up google cloud storage in a similar manner.

Comments

One response to “Hadoop: Accessing S3”

Hadoop: Accessing Google Cloud Storage | Tim Hoolihan

2017.12.23

[…] that file to $HADOOP_HOME/share/hadoop/tools/lib/. If you followed the instruction in the prior post, that directory is already in your class path. If not, add the following to your hadoop-env.sh file […]