Wednesday, May 25, 2016

Loading CSV data into DynamoDB using Hive on EMR

To complete this recipe, you will need:

* An AWS account and an EMR cluster with Hive installed (it's the default when launching a cluster 4.6 EMR cluster)
* Some data in CSV format that you can load into an S3 bucket (2)
* A DynamoDB table into which you plan to load this data (note, you need not specify all the column names - just the primary key and perhaps extra hash keys)

The data I'm going to use is a set of data that looks as follows:
   id; fortune; randomNumber

I generated this using the fortune(1) program on Linux. The random number from $RANDOM in bash and of course a count.

   1;Q:Why did the germ cross the microscope? A:To get to the other slide.;22913
   2;Do not sleep in a eucalyptus tree tonight.;2448
   ...

The general gist of this is to load the data into S3, then create a meta-table in Hive that will allow Hive to "select" through this meta-table from S3 and "insert" into the DynamoDB table.

Upload the data to s3 using the AWS cli:

 $ aws s3 cp fortunes.csv s3://*my-s3-bucket*/mydata/
  
Once you've created the EMR cluster, log into the master node using ssh:

  ssh hadoop@ec2-52-3-247-254.compute-1.amazonaws.com

Now it's time to create your meta-table in Hive for the S3 data. Type

  $ hive

(to get into the hive prompt)
 
  hive> CREATE EXTERNAL TABLE fortunes_from_s3 (id bigint, fortune string, randomNum bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\;" LOCATION 's3://*my-s3-bucket*/mydata';  

At this point, you should be able to "select" from this meta-table. This is a good test:
  select * from fortunes_from_s3;

If you get results that you expect from your data in S3, then we're half-way there.

Now it's time to create an external meta-table (hive_test_table) in DynamoDB to load the data.

  hive> CREATE EXTERNAL TABLE hive_test_table (ID bigint, Fortune string, RandomNum bigint) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "", "dynamodb.column.mapping" = "ID:id,Fortune:fortune,RandomNum:randomnum"); (1)


Note that one needs to map the fields, in this case:
  ID maps to id (in DDB), Fortune maps to fortune (in DDB), etc., where ID is the Hive meta-table column, Fortune is the Hive meta-table column.


We've got to the last stage to do at this point; loading the data into DynamoDB. To do this, one "inserts" from the "select":

  hive> INSERT OVERWRITE TABLE 'hive_test_table' SELECT * FROM 'fortunes_from_s3';

There will be a load of stuff that'll fly by as it loads the data:

  Query ID = hadoop_20160525155656_2ad0b238-f350-4805-8d8a-ebe9646f276c
  Total jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks is set to 0 since there's no reduce operator
  Starting Job = job_1463217542741_0016, Tracking URL = http://ip-192-168-0-122.ec2.internal:20888/proxy/application_1463217542741_0016/
  Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1463217542741_0016
  Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
  2016-05-25 15:56:34,246 Stage-0 map = 0%,  reduce = 0%
  2016-05-25 15:56:51,858 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 4.31 sec
  MapReduce Total cumulative CPU time: 4 seconds 310 msec
  Ended Job = job_1463217542741_0016
  MapReduce Jobs Launched:
  Stage-Stage-0: Map: 1   Cumulative CPU: 4.31 sec   HDFS Read: 200 HDFS Write: 0   SUCCESS
  Total MapReduce CPU Time Spent: 4 seconds 310 msec
  OK
  Time taken: 32.273 seconds



It's time to look in DynamoDB to check whether the data are loaded. If they're there, then the job is done.

Permissions

IAM may well stop you doing this job. In order to make sure the data can be loaded, create a new role in IAM. This role should have, at minimum the following Policies attached:

  AmazonElasiticMapReduceRole
  AmazonElasiticMapReducefor EC2Role


This will ensure that EMR can get the data from S3. Attach this role to the DynamoDB under Access Control policies for the table.

Gothas

(1) The dynamodb.column.mapping can catch one out. The ID:id,.... should not have space between the various column mappings (i.e. after the "," and before the next column mapping) 
(2) make sure that the CSV is the only thing in the bucket and don't specify the actual file name. Hive will try to read the file directly from the bucket