simplifying problem bit, have set of text files "records" delimited double newline characters. like
'multiline text'
'empty line'
'multiline text'
'empty line'
and forth.
i need transform each multiline unit separately , perform mapreduce on them.
however, aware default wordcount setting in hadoop code boilerplate, input value
variable in following function single line , there no guarantees input contiguous previous input line.
public void map(longwritable key, text value, outputcollector<text, intwritable> output, reporter reporter) throws ioexception ;
and need input value
1 unit of double newline delimited multiline text.
some searching turned recordreader
class , getsplits
method no simple code examples wrap head around.
an alternative solution replace newline characters in multiline text space characters , done it. i'd rather not because there's quite bit of text , it's time consuming in terms of runtime. have modify lot of code if dealing through hadoop attractive me.
if files small in size, won't split. each file 1 split assigned 1 mapper instance. in case, agree thomas. can build logical record in mapper class, concatenating strings. can detect record boundary looking empty string coming in value mapper.
however, if files big , split, don't see other option implement own text input format class. clone existing hadoop linerecordreader , linereader java classes. have make small change in version of linereader class record delimiter 2 new lines, instead of one. once done, mapper receive multiple lines input value.
Comments
Post a Comment