Thursday, August 18, 2016

Quick 1..2 on how to create an EMR cluster with a custom boostrap action (BA)

This is more a reference for me than for those reading this. I know I'm going to have to do it again in the future and I really don't want to run into the problems I had in getting this to work.

So, the AWS cli command to create the cluster:

myBA='--bootstrap-action Path="s3://< my-s3-bucket >/shell/install_profile.sh"'; aws emr create-cluster --release-label emr-5.0.0 --name testBA-10 --applications {Name="Hive",Name="Spark",Name="Zeppelin",Name="Ganglia"} --ec2-attributes my-Keypair --region us-east-1 --use-default-roles --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge ${myBA}

The environmental variable myBA is the path to the S3 bucket that stores my script for the BA. It's worth noting that the script should exit with a 0 exit value if you don't want your BA to fail. Here's my script: #!/usr/bin/env bash set -x set -e HADOOP_HOME=/home/hadoop # My profile aws s3 cp s3://< my S3 bucket >/AWSStuff/std/myfuncs ${HADOOP_HOME} echo "About to modify my .bashrc login" echo -e "\n. ~/myfuncs\ngetNewProfile\n. .hbw" >> ${HADOOP_HOME}/.bashrc # Install byobu sudo /usr/bin/yum -y --enablerepo=epel install byobu # And we're ready to rock exit 0 Bingo. We're ready to roll.

Sunday, July 17, 2016

s3-dist-cp and regex's and the groupBy switch


I spent an inordinate amount of time trying to work out how s3-dist-cp [and more specifically, the  groupBy flag] works, because you would think that the reference to the Wikipeda link on regex's would mean that the Perl compatible regular expressions would work.

Not so fast there Tonto.....

It took me a good deal of experimenting to find this out. Hopefully this blog post helps you save some time....Please post your saved time to me at the address at the bottom of this post!

test-split/split-1/full
                  /full.1
                  /full.10
                  /full.11
                  /full.12
test-split/split-2/full.14
                  /full.15
                  /full.16
                  /full.17
                  /full.18
test-split/split-3/full.2
                  /full.20
                  /full.21
                  /full.22
                  /full.23
test-split/split-4/full.25
                  /full.26
                  /full.27
                  /full.28
                  /full.3
test-split/split-5/full.5
                  /full.6
                  /full.7
                  /full.8
                  /full.9

Operators within the '()' will be 'saved' for the output.

I tried this regex:
--groupBy '(full).1[5-7]'

expecting that it would groupBy
full.15, full.16 and full.17

Nope, it didn't work until I did this:
--groupBy '.*(full).1[5-7]*'

So it seems the regex is looking for anything (.*) prior to 'full' too - the full path!

The --groupBy is not a RE as I was expecting it to be!

Once figuring out that rexex's are not really regex's, I tried this:

s3-dist-cp --src s3://s3-lab/test-split/ --srcPattern='.*split-[12].*' --dest hdfs:///user/hadoop/PBXfull/ --groupBy ".*(full).1[0-9]*"

In the above, I expected s3-dist-cp would use a source pattern of .*split-[12].* for the directories and then the groupBy to select the files WITHIN those directories. Something was wrong though, because output ended up [seemingly] randomly in split-2 or split-1. I renamed the directories to foo, baz, bar, ipsum, lorum as this was my use-case.....

test-split/foo/full
                  /full.1
                  /full.10
                  /full.11
                  /full.12
test-split/bar/full.14
                  /full.15
                  /full.16
                  /full.17
                  /full.18
test-split/baz/full.2
                  /full.20
                  /full.21
                  /full.22
                  /full.23
test-split/ipsum/full.25
                  /full.26
                  /full.27
                  /full.28
                  /full.3
test-split/lorum/full.5
                  /full.6
                  /full.7
                  /full.8
                  /full.9

Then I ran this:

s3-dist-cp --src s3://s3-lab/test-split/ --srcPattern='.*(foo|bar|baz).*' --dest hdfs:///user/hadoop/PBXfull/s2/ --groupBy ".*(full).1[0-9]*"
     
This DIDN'T do what I expected (I expected it to grab foo, baz and bar, sifting out only the files full.10, full.11, etc).

So I tried this instead:

s3-dist-cp --src s3://s3-lab/test-split/ --srcPattern='.*baz.*' --dest hdfs:///user/hadoop/PBXfull/s2/ --groupBy ".*(full).1[0-9]*"

And now it worked. The main problem is that the srcPattern means that the output is in PBXfull/s2/baz while in the (foo|bar|baz) it put all the output into PBXfull/s2/foo....

But it doesn't end there....
What I would really like is that all the files in foo be concatenated, then all the files in baz, then all the files in bar, etc. Not one large file.

Thanks for Barbaros who got the solution and I'm documenting it here.

So in trying to work this out, I tried:

   s3-dist-cp --src=s3://s3-lab/test-split/foo/ --dest=hdfs:///user/hadoop/PBXfull/ --groupBy=".*(full)\..*"

This joined the files in test-split/foo together into a file 'full', but clearly this is going to break if I try to join the files in bar, bar, etc...

Let's test a couple of options:

   s3-dist-cp --src=s3://s3-lab/test-split/ --dest=hdfs:///user/hadoop/PBXfull/ --groupBy=".*(full)\..*"

Yup. It creates a file (full) combining all the files and outputs it to one (presumably the last) directory (in my case, I still have split-5 knocking around).

   s3-dist-cp --src=s3://s3-lab/test-split/ --srcPattern=".*bar.*" --dest=hdfs:///user/hadoop/PBXfull/ --groupBy=".*(full)\..*"

Yup. This does what I want. bar is concatenated into a file full in PBXfull/bar/full, but there's still a problem. The main one is that I have 10000 of these foo, bar, baz directories and I really want evey one in it's own directory, so I can't specify the srcPattern every time. It'll drive me crazy....crazier...

In the end, I tested Barebaros' solution:

   s3-dist-cp --src=s3://s3-lab/test-split/ --dest=hdfs:///user/hadoop/PBXfull/ --groupBy=".*/(\\w+)/.*"

The groupBy here won't group by the filename full.*, but WILL group by the diretory foo, bar, baz, etc....i.e. all the full.* files will be concatenated and grouped by the directory.
 
This produced:
   PBXfull/foo/foo [note the filename is now foo, not full]
   PBXfull/bar/bar [again, the filename is bar, not full]
etc.

The groupBy here is grouping by the directory and not the filename!

Wow! So nothing is quite what it appears to be.

A final example:
  s3-dist-cp --src=s3://s3-lab/test-split/ --dest=hdfs:///user/hadoop/PBXfull/ --groupBy=".*/(split)\-([0-9])/.*(full)\.1[0-9]*"

This, in theory groups by the split-1, split-2, etc. Or not. The output is:

   PBXfull/split-2/split2full

And this makes sense because:
   the first part of the groupBy (split)\-([0-9])
   will consider the split-1, split-2, etc.
And the second part .*(full)\.1[0-9]*
  will only consider files:
     full.10, full.11, full.12, etc.

And there are only 2 directories with these files in them, namely split-1, and split-2. Which means the last 'split-' will be where the output of the files goes - hence split-2/split2full.

Ok. Don't worry to send me the time you spent reading this post :-), but hopefully it helps someone (perhaps me even) sometime in the future.

Wednesday, May 25, 2016

Loading CSV data into DynamoDB using Hive on EMR

To complete this recipe, you will need:

* An AWS account and an EMR cluster with Hive installed (it's the default when launching a cluster 4.6 EMR cluster)
* Some data in CSV format that you can load into an S3 bucket (2)
* A DynamoDB table into which you plan to load this data (note, you need not specify all the column names - just the primary key and perhaps extra hash keys)

The data I'm going to use is a set of data that looks as follows:
   id; fortune; randomNumber

I generated this using the fortune(1) program on Linux. The random number from $RANDOM in bash and of course a count.

   1;Q:Why did the germ cross the microscope? A:To get to the other slide.;22913
   2;Do not sleep in a eucalyptus tree tonight.;2448
   ...

The general gist of this is to load the data into S3, then create a meta-table in Hive that will allow Hive to "select" through this meta-table from S3 and "insert" into the DynamoDB table.

Upload the data to s3 using the AWS cli:

 $ aws s3 cp fortunes.csv s3://*my-s3-bucket*/mydata/
  
Once you've created the EMR cluster, log into the master node using ssh:

  ssh hadoop@ec2-52-3-247-254.compute-1.amazonaws.com

Now it's time to create your meta-table in Hive for the S3 data. Type

  $ hive

(to get into the hive prompt)
 
  hive> CREATE EXTERNAL TABLE fortunes_from_s3 (id bigint, fortune string, randomNum bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\;" LOCATION 's3://*my-s3-bucket*/mydata';  

At this point, you should be able to "select" from this meta-table. This is a good test:
  select * from fortunes_from_s3;

If you get results that you expect from your data in S3, then we're half-way there.

Now it's time to create an external meta-table (hive_test_table) in DynamoDB to load the data.

  hive> CREATE EXTERNAL TABLE hive_test_table (ID bigint, Fortune string, RandomNum bigint) STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES ("dynamodb.table.name" = "", "dynamodb.column.mapping" = "ID:id,Fortune:fortune,RandomNum:randomnum"); (1)


Note that one needs to map the fields, in this case:
  ID maps to id (in DDB), Fortune maps to fortune (in DDB), etc., where ID is the Hive meta-table column, Fortune is the Hive meta-table column.


We've got to the last stage to do at this point; loading the data into DynamoDB. To do this, one "inserts" from the "select":

  hive> INSERT OVERWRITE TABLE 'hive_test_table' SELECT * FROM 'fortunes_from_s3';

There will be a load of stuff that'll fly by as it loads the data:

  Query ID = hadoop_20160525155656_2ad0b238-f350-4805-8d8a-ebe9646f276c
  Total jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks is set to 0 since there's no reduce operator
  Starting Job = job_1463217542741_0016, Tracking URL = http://ip-192-168-0-122.ec2.internal:20888/proxy/application_1463217542741_0016/
  Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1463217542741_0016
  Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
  2016-05-25 15:56:34,246 Stage-0 map = 0%,  reduce = 0%
  2016-05-25 15:56:51,858 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 4.31 sec
  MapReduce Total cumulative CPU time: 4 seconds 310 msec
  Ended Job = job_1463217542741_0016
  MapReduce Jobs Launched:
  Stage-Stage-0: Map: 1   Cumulative CPU: 4.31 sec   HDFS Read: 200 HDFS Write: 0   SUCCESS
  Total MapReduce CPU Time Spent: 4 seconds 310 msec
  OK
  Time taken: 32.273 seconds



It's time to look in DynamoDB to check whether the data are loaded. If they're there, then the job is done.

Permissions

IAM may well stop you doing this job. In order to make sure the data can be loaded, create a new role in IAM. This role should have, at minimum the following Policies attached:

  AmazonElasiticMapReduceRole
  AmazonElasiticMapReducefor EC2Role


This will ensure that EMR can get the data from S3. Attach this role to the DynamoDB under Access Control policies for the table.

Gothas

(1) The dynamodb.column.mapping can catch one out. The ID:id,.... should not have space between the various column mappings (i.e. after the "," and before the next column mapping) 
(2) make sure that the CSV is the only thing in the bucket and don't specify the actual file name. Hive will try to read the file directly from the bucket  

Tuesday, May 1, 2012

Human resources vs Personnel

Human Resources vs Personnel department

So something DaveK said to me the other day got me thinking about this. His comment went something like this "the problem with HR was when they changed it's name from personnel department to HR". I thought he was making too much of a fuss really, in that they were both pretty ineffective.

But something in me has made me ponder this point for longer and the more I think about it, the more I realise that calling someone a "human resource" is a bit like calling someone a gogga, or some other form of un-PC hate-speech. What was it in the apartheid days - dehumanise them and then it's easier to persecute them. So now here we are, human resources. We're not people any longer. We're just resources that, well are human, but can be thought of primarily as resources. And what's a resource anyhow? Mirriam Webster describes a resource as "a source of information or expertise". So if humans are resources, and we apply this definition then one would expect, being a source of information and expertise would mean that the humans in question get treated with some degree of value.


My experience of HR though, [ and not only in my present position, but in previous companies ], is that the humans which the HR department should be valuing, are seen as of little value. IT used to have [ and in many places still does ] a reputation of being out for themselves. They couldn't care a hoot for the user or what they want and need. My experience is that HR have the same selfish attitude. So why is this the case? Why is it that HR people seem to wield such power in organisations and seem oblivious to the very people they should be serving who are of value to the organisation?


"Great Human Resource professionals add value to any organization. Recruiting and retaining star performers, building a productive workforce, coaching managers to perform at higher levels, ensuring that the organization stays compliant, and raising the bar on performance are what HR brings to the table"

This may be true, but I've never seen it. Are we just extremely short of great HR people or have the HR people I've ever dealt with lost touch with what's important - the people. So that brings me to the personnel department. Somehow, this keeps the people who should be dealing with people understanding that they're dealing with people. It could just be me, but would a return to the name personnel department make a difference to how they deal with the valuable people in the organisaion?

Sunday, January 3, 2010

Paperless Geocaching with my Nokia e71

Recently I've got into geocaching. Not that I've not been interested before, but I simply didn't want to lay out the heaps of cash they want for GPS's. But, thanks to cell phone technological improvements and an upgrade (not to an HTC with Virgin Mobile!), I have a GPS receiver on my phone.

This started me thinking about doing some Geocaching. This along with the fact that the chillens don't like to go for a walk like my wife and I, and need some form of carrot to get them to walk regularly. Enter geocaching.

This is a tutorial on how to do paperless geocaching primarily because I don't want to have to pre-prepare for this past-time. I want to simply do one if I have the time.

I run Linux, which can sometimes make things a little more challenging. Fortunately I'm not the first to attempt this, but after scouring the 'Net and with some help from a colleague, I've got this right now. So, here are my experiences and technical insights.

What you'll need:
Hardware:
  • A Symbian S60 phone (mine is a Nokia e71) with built-in GPS receiver.
  • A PC that can run ruby (pretty much all machines, but if you can't apt-get it, then you'll need to do some more reading!)
Software:
  • A program called geotoad (Geotoad) (the reason for this will be explained below)
  • SmartGPX (SmartGPX)
Account:
Co-ordinates of where you wish to start from.

Recipe:
Unless you decide to become a premium member of geocaching.com, obtaining caches in the GPX format (which is essentially an XML file of all the caches you select, their difficulty, description, the clue, and the last 5 logs), you're stuffed and can't do paperless at all. This is where geotoad comes in. Geotoad is a neat ruby program that will allow you to define which geocache's you're interested in and then will download them as GPX files. While it has a command line interface (which some may not like), I think it's just grand as it consumes little memory/resources and is quick an easy to use.

To use it, you're going to have to give it your geocaching.com account details. You've signed up an account already. If not, visit Geocaching.com and do so now.

Also, using Google Maps, you can install a GPS co-ordinate plugin, which will allow you to get the GPS co-ordinates of pretty much any place you wish to start your search (filter) from.

Now it's time to fire up geotoad. You'll use options (1) username, (2) search type (can be coords, or other type), (2) the co-ordinates from where you're likely to start your geocaching adventures, and (4) the distance (in miles sadly) as a radius from which you wish to do paperless geocaching.

I personally chose the same co-ordinates and then varied the distance from 50 miles to 200 miles from this point and saved all those in separate GPX files (option 23). In addition, I varied the starting points, so now whether I'm going to the Cedarberg or the Garden Route, I can load up cache's in those areas. Nice!

Once you have set these settings (1-4 and 23), hit (s) for search and then go and make coffee. Geotoad, in order not to put too much load on the servers, backs off while downloading the individual caches. Depending on how many you're loading, it may take up to an hour.

Have GPX file...will geocache!

Not yet big daddy. Time to get some software for the phone. The phone software comes in the form of SmartGPX. It's a great piece of programming and it's not crashed once! Well done guys :-).

You'll need 2 pieces of software actually. The SmartGPX and the AddOns. Download both. You'll have to sign these using the Symbian Signed website (SymbianSigned). For this you're going to need your IMEI number on the phone. Easy if your phone responds to the *#60# keypad code....but mine didn't. Had to remove the battery. Upload both pieces of s/w to the webpage (one at a time) and then download them again. They'll be signed. Install both.

Now it's time to load the GPX file(s). Sterreman (waiting for a link from him) refers to not trashing your speed camera GPX files by overwriting them with this GPX file. I have no experience of this, so if this is a concern to you, do some more research. I can't see it being more complicated than joining the 2 GPX files (perhaps removing some header information from the 2nd one) since the GPX file is only XML and human readable.

Right. Start SmartGPX and Import the GPX file for the place(s) you're likely to be (you've transferred the GPX output from the PC running geotoad to the phone already have you?).

Now, the annoying thing about the inbuilt GPS is that it cannot change the way it's co-ordinates are displayed. They're always in degrees, minutes, seconds and hundredth of a second. Not of any use for geocaching since all the co-ordinates on the site are in degrees, minutes and hundredth of a minute. So, the navigation tool of the Nokia can't understand the GPX co-ordinates. Annoying, I know, but that's where the SmartGPXAddOn's comes in. It allows you to select a particular geocache, and then import those co-ordinates into the 'Navigation tool'. From there, it's simply a matter of using the navigation tool to find the cache.

But wait...there's more.

Once you've found the cache, you can log it on your phone immediately. Nice! There's a 'Log your visit'. So, you can log a note, the time you found the cache, and of course what you took/left. That log is saved on the phone, but clearly, that's a problem if you do multiple caches in a day. You don't want to re-type what you typed into the phone when you actually logged it.

So, when you get back to your computer, you can Export these logs, and import them straight into geocaching.com using this link (Upload Field Notes). That won't publish them immediately but will allow you to modify them before publishing to the website. So....geocaching got easier.

Have fun being the search engine!

Tuesday, November 3, 2009

Telkom monopoly - their dominance continues (to break the rest of us)!

Telkom once again got bad press on the issue of the link between Sutherland (SALT) and the Observatory.


One would have thought, given the competitiveness of business, that Telkom would not wish such bad press on themselves. As I suspect though, they simply couldn't care! In effect, Telkom have responded to the stink these articles created by simply sticking to their guns and demanding over twice the original, signed, contract. They're wanting in excess of R20M for this link - the signed contract was for R10M.

It's estimated that to lay a new fibre, the cost would be approximately R40M, only ~40% more than Telkom are asking. How is this possible Telkom?

1) How do you explain this change in price,
2) the fact that this bad press seems to be ignored by you,
3) how any of us get to continue to do business by excluding Telkom?

Telkom have been embarrassed more than once. The Winston pigeon shows up Telkom for providing a less than broadband solution. They face a penalty for anti-competative behaviour, and yet, they continue to stifle innovation, science, education and the progress of research and business in South Africa. Where will it end?

I cannot tell you how frustrating this is. Apart from anything, Telkom hold the entire South Africa to ransom. As a colleague put it, this makes a mockery of their slogan 'Touching Tomorrow'. One is certainly left wondering whether we'll even be able to compete with other African countries as they speedily de-regulate their telecomms industries

Officials at Telkom this we 'just want the SALT problem to go away' and yet they've dug their heels in saying 'a mistake was made in the tender process'. From what I understand, this is not the only place they're reneg-ing on their deals. So, how does one hold them to their contractual agreements? That a mistake was made is either:
  • Incompetence in drawing up the response to the tender or
  • they're simply lying.
Given the current climate in South Africa, I suspect both!

This morning's Business Report headlines with an article "This decade is marked by march of mediocrity", and oh how we're seeing that in South Africa! Without getting side-tracked into why we've ended up in mediocrity, I do feel depressed about being held to ransom by Telkom; or perhaps now it's more important than ever to stand up and be counted because the 'better life for all' is slipping through our fingers - all because of a lack of decisive leadership.

Friday, July 10, 2009

Meeting people


I guess there are those people who are in the limelight (like let's say, Michael Jackson) who you just know aren't going to be that great to meet 1-on-1. Watching Wimbeldon the other day, I was rooting for Rodger Federer for a number of reasons, but primarily because I think that if I caught him in an airport, he'd give me the time of day.

Today was my last day of 2 weeks looking after my 3 (or 4 or 5) children. It's been a rewarding, sometimes trying but generally interesting 2 weeks. Having 3 certainly helps! But that's not the point of this musing. Cara and I have been trying to build a xylophone for her school project. It was an ambitious project (though at the outset, I thought "how hard can this be!"). We're almost done now. The Internet, some planning on paper and then Cara and I tackled it. There was much sawing and glueing, and we got 90% of the way there. I'd made the keys, but not tuned them. Save my two earbones (and Jenny says those don't work very well; especially when it comes to 'have you washed the car lately?'), I have not a musical bone in my body and the Internet was either overly mathematical or generally vague on how to tune these things. So, true to dad fashion, I phoned a friend. Well; not actually a friend, but someone who makes their living playing the Marimbas.

Enter Ross Johnson.

So, having looked them up on Google, I got hold of amaAmbush productions. The guy I spoke to said Ross would call me back. I thought "ja, right!". But he called and he offered to come to their factory and help me tune these beasties. We duly arrived and what an experience! Ross went out of his way to include the children. He explained what we were about to do, played a piece ('In the Jungle') on one of his Marimbas with them playing along, and focused on their task, including them all the way.


It's funny. For such a long time I've been to see his concerts and enjoyed every minute. He looked like the Rodger Federer type; but who's to know? Yet today, Ross showed why he's destined for great things. It was such a pleasure meeting someone of his caliber. If I were only that influential at his age! Sigh.

I guess he'll not remember me at future concerts, but I'll still believe that he's a really fantastic guy who is unbelievably talented and doing more good here in our funny South African than most other people out there. So there you go.