Sunday, July 17, 2016

s3-dist-cp and regex's and the groupBy switch


I spent an inordinate amount of time trying to work out how s3-dist-cp [and more specifically, the  groupBy flag] works, because you would think that the reference to the Wikipeda link on regex's would mean that the Perl compatible regular expressions would work.

Not so fast there Tonto.....

It took me a good deal of experimenting to find this out. Hopefully this blog post helps you save some time....Please post your saved time to me at the address at the bottom of this post!

test-split/split-1/full
                  /full.1
                  /full.10
                  /full.11
                  /full.12
test-split/split-2/full.14
                  /full.15
                  /full.16
                  /full.17
                  /full.18
test-split/split-3/full.2
                  /full.20
                  /full.21
                  /full.22
                  /full.23
test-split/split-4/full.25
                  /full.26
                  /full.27
                  /full.28
                  /full.3
test-split/split-5/full.5
                  /full.6
                  /full.7
                  /full.8
                  /full.9

Operators within the '()' will be 'saved' for the output.

I tried this regex:
--groupBy '(full).1[5-7]'

expecting that it would groupBy
full.15, full.16 and full.17

Nope, it didn't work until I did this:
--groupBy '.*(full).1[5-7]*'

So it seems the regex is looking for anything (.*) prior to 'full' too - the full path!

The --groupBy is not a RE as I was expecting it to be!

Once figuring out that rexex's are not really regex's, I tried this:

s3-dist-cp --src s3://s3-lab/test-split/ --srcPattern='.*split-[12].*' --dest hdfs:///user/hadoop/PBXfull/ --groupBy ".*(full).1[0-9]*"

In the above, I expected s3-dist-cp would use a source pattern of .*split-[12].* for the directories and then the groupBy to select the files WITHIN those directories. Something was wrong though, because output ended up [seemingly] randomly in split-2 or split-1. I renamed the directories to foo, baz, bar, ipsum, lorum as this was my use-case.....

test-split/foo/full
                  /full.1
                  /full.10
                  /full.11
                  /full.12
test-split/bar/full.14
                  /full.15
                  /full.16
                  /full.17
                  /full.18
test-split/baz/full.2
                  /full.20
                  /full.21
                  /full.22
                  /full.23
test-split/ipsum/full.25
                  /full.26
                  /full.27
                  /full.28
                  /full.3
test-split/lorum/full.5
                  /full.6
                  /full.7
                  /full.8
                  /full.9

Then I ran this:

s3-dist-cp --src s3://s3-lab/test-split/ --srcPattern='.*(foo|bar|baz).*' --dest hdfs:///user/hadoop/PBXfull/s2/ --groupBy ".*(full).1[0-9]*"
     
This DIDN'T do what I expected (I expected it to grab foo, baz and bar, sifting out only the files full.10, full.11, etc).

So I tried this instead:

s3-dist-cp --src s3://s3-lab/test-split/ --srcPattern='.*baz.*' --dest hdfs:///user/hadoop/PBXfull/s2/ --groupBy ".*(full).1[0-9]*"

And now it worked. The main problem is that the srcPattern means that the output is in PBXfull/s2/baz while in the (foo|bar|baz) it put all the output into PBXfull/s2/foo....

But it doesn't end there....
What I would really like is that all the files in foo be concatenated, then all the files in baz, then all the files in bar, etc. Not one large file.

Thanks for Barbaros who got the solution and I'm documenting it here.

So in trying to work this out, I tried:

   s3-dist-cp --src=s3://s3-lab/test-split/foo/ --dest=hdfs:///user/hadoop/PBXfull/ --groupBy=".*(full)\..*"

This joined the files in test-split/foo together into a file 'full', but clearly this is going to break if I try to join the files in bar, bar, etc...

Let's test a couple of options:

   s3-dist-cp --src=s3://s3-lab/test-split/ --dest=hdfs:///user/hadoop/PBXfull/ --groupBy=".*(full)\..*"

Yup. It creates a file (full) combining all the files and outputs it to one (presumably the last) directory (in my case, I still have split-5 knocking around).

   s3-dist-cp --src=s3://s3-lab/test-split/ --srcPattern=".*bar.*" --dest=hdfs:///user/hadoop/PBXfull/ --groupBy=".*(full)\..*"

Yup. This does what I want. bar is concatenated into a file full in PBXfull/bar/full, but there's still a problem. The main one is that I have 10000 of these foo, bar, baz directories and I really want evey one in it's own directory, so I can't specify the srcPattern every time. It'll drive me crazy....crazier...

In the end, I tested Barebaros' solution:

   s3-dist-cp --src=s3://s3-lab/test-split/ --dest=hdfs:///user/hadoop/PBXfull/ --groupBy=".*/(\\w+)/.*"

The groupBy here won't group by the filename full.*, but WILL group by the diretory foo, bar, baz, etc....i.e. all the full.* files will be concatenated and grouped by the directory.
 
This produced:
   PBXfull/foo/foo [note the filename is now foo, not full]
   PBXfull/bar/bar [again, the filename is bar, not full]
etc.

The groupBy here is grouping by the directory and not the filename!

Wow! So nothing is quite what it appears to be.

A final example:
  s3-dist-cp --src=s3://s3-lab/test-split/ --dest=hdfs:///user/hadoop/PBXfull/ --groupBy=".*/(split)\-([0-9])/.*(full)\.1[0-9]*"

This, in theory groups by the split-1, split-2, etc. Or not. The output is:

   PBXfull/split-2/split2full

And this makes sense because:
   the first part of the groupBy (split)\-([0-9])
   will consider the split-1, split-2, etc.
And the second part .*(full)\.1[0-9]*
  will only consider files:
     full.10, full.11, full.12, etc.

And there are only 2 directories with these files in them, namely split-1, and split-2. Which means the last 'split-' will be where the output of the files goes - hence split-2/split2full.

Ok. Don't worry to send me the time you spent reading this post :-), but hopefully it helps someone (perhaps me even) sometime in the future.