{"id":4098,"date":"2017-12-23T11:33:05","date_gmt":"2017-12-23T16:33:05","guid":{"rendered":"http:\/\/hoolihan.net\/blog-tim\/?p=4098"},"modified":"2017-12-23T11:33:05","modified_gmt":"2017-12-23T16:33:05","slug":"hadoop-accessing-google-cloud-storage","status":"publish","type":"post","link":"http:\/\/hoolihan.net\/blog-tim\/2017\/12\/23\/hadoop-accessing-google-cloud-storage\/","title":{"rendered":"Hadoop: Accessing Google Cloud Storage"},"content":{"rendered":"<p>First, go <a href=\"https:\/\/cloud.google.com\/dataproc\/docs\/concepts\/connectors\/cloud-storage\" rel=\"noopener\" target=\"_blank\">here<\/a> to choose the hadoop google cloud storage connector for your version of hadoop, likely hadoop 2.<\/p>\n<p>Copy that file to $HADOOP_HOME\/share\/hadoop\/tools\/lib\/. If you followed the instruction in the <a href=\"http:\/\/hoolihan.net\/blog-tim\/2017\/12\/23\/hadoop-accessing-s3\/\" rel=\"noopener\" target=\"_blank\">prior post<\/a>, that directory is already in your class path. If not, add the following to your hadoop-env.sh file (found in $HADOOP_CONF directory):<\/p>\n<p><code>#GS \/ AWS S3 Support<br \/>\nexport HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME\/share\/hadoop\/tools\/lib\/*<\/code><\/p>\n<p>Create a service account in google cloud that has the necessary Storage permissions. Download the credentials and save somewhere, in my case I renamed the file and saved it in .config\/gcloud\/hadoop.json.<\/p>\n<p>Add the following properties in your core-site.xml:<br \/>\n<code>&lt;configuration&gt;<br \/>\n  &lt;property&gt;<br \/>\n    &lt;name&gt;fs.gs.project.id&lt;\/name&gt;<br \/>\n    &lt;value&gt;someproject-123&lt;\/value&gt;<br \/>\n    &lt;description&gt;<br \/>\n      Required. Google Cloud Project ID with access to configured GCS buckets.<br \/>\n    &lt;\/description&gt;<br \/>\n  &lt;\/property&gt;<\/p>\n<p>  &lt;property&gt;<br \/>\n    &lt;name&gt;google.cloud.auth.service.account.enable&lt;\/name&gt;<br \/>\n    &lt;value&gt;true&lt;\/value&gt;<br \/>\n    &lt;description&gt;<br \/>\n      Whether to use a service account for GCS authorizaiton.<br \/>\n    &lt;\/description&gt;<br \/>\n  &lt;\/property&gt;<\/p>\n<p>  &lt;property&gt;<br \/>\n    &lt;name&gt;google.cloud.auth.service.account.json.keyfile&lt;\/name&gt;<br \/>\n    &lt;value&gt;\/Users\/tim\/.config\/gcloud\/hadoop.json&lt;\/value&gt;<br \/>\n  &lt;\/property&gt;<\/p>\n<p>  &lt;property&gt;<br \/>\n    &lt;name&gt;fs.gs.impl&lt;\/name&gt;<br \/>\n    &lt;value&gt;com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem&lt;\/value&gt;<br \/>\n    &lt;description&gt;The implementation class of the GS Filesystem&lt;\/description&gt;<br \/>\n  &lt;\/property&gt;<\/p>\n<p>  &lt;property&gt;<br \/>\n    &lt;name&gt;fs.AbstractFileSystem.gs.impl&lt;\/name&gt;<br \/>\n    &lt;value&gt;com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS&lt;\/value&gt;<br \/>\n    &lt;description&gt;The implementation class of the GS AbstractFileSystem.&lt;\/description&gt;<br \/>\n  &lt;\/property&gt;<\/p>\n<p>&lt;\/configuration&gt;<\/code><\/p>\n<p>Note to change someproject-123 to your actual project-id, which can be found in the google cloud dashboard. <\/p>\n<p>Now test this setup with:<\/p>\n<p><code>hdfs dfs -ls gs:\/\/somebucket<\/code><\/p>\n<p>Of course you&#8217;ll need to replace somebucket with an actual bucket\/directory in your google storage account. <\/p>\n<p>Now you should be setup to use S3 and Google storage with your local hadoop setup.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>First, go here to choose the hadoop google cloud storage connector for your version of hadoop, likely hadoop 2. Copy that file to $HADOOP_HOME\/share\/hadoop\/tools\/lib\/. If you followed the instruction in the prior post, that directory is already in your class path. If not, add the following to your hadoop-env.sh file (found in $HADOOP_CONF directory): #GS [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55,306,332,19],"tags":[340,337,335],"class_list":["post-4098","post","type-post","status-publish","format-standard","hentry","category-data","category-data-science","category-hadoop","category-os-x","tag-google-cloud","tag-hadoop","tag-macos"],"_links":{"self":[{"href":"http:\/\/hoolihan.net\/blog-tim\/wp-json\/wp\/v2\/posts\/4098","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/hoolihan.net\/blog-tim\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/hoolihan.net\/blog-tim\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/hoolihan.net\/blog-tim\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/hoolihan.net\/blog-tim\/wp-json\/wp\/v2\/comments?post=4098"}],"version-history":[{"count":0,"href":"http:\/\/hoolihan.net\/blog-tim\/wp-json\/wp\/v2\/posts\/4098\/revisions"}],"wp:attachment":[{"href":"http:\/\/hoolihan.net\/blog-tim\/wp-json\/wp\/v2\/media?parent=4098"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/hoolihan.net\/blog-tim\/wp-json\/wp\/v2\/categories?post=4098"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/hoolihan.net\/blog-tim\/wp-json\/wp\/v2\/tags?post=4098"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}