java - Processing paraphragraphs in text files as single records with Hadoop -


simplifying problem bit, have set of text files "records" delimited double newline characters. like

'multiline text'

'empty line'

'multiline text'

'empty line'

and forth.

i need transform each multiline unit separately , perform mapreduce on them.

however, aware default wordcount setting in hadoop code boilerplate, input value variable in following function single line , there no guarantees input contiguous previous input line.

public void map(longwritable key, text value,                  outputcollector<text, intwritable> output,                  reporter reporter) throws ioexception ; 

and need input value 1 unit of double newline delimited multiline text.

some searching turned recordreader class , getsplits method no simple code examples wrap head around.

an alternative solution replace newline characters in multiline text space characters , done it. i'd rather not because there's quite bit of text , it's time consuming in terms of runtime. have modify lot of code if dealing through hadoop attractive me.

if files small in size, won't split. each file 1 split assigned 1 mapper instance. in case, agree thomas. can build logical record in mapper class, concatenating strings. can detect record boundary looking empty string coming in value mapper.

however, if files big , split, don't see other option implement own text input format class. clone existing hadoop linerecordreader , linereader java classes. have make small change in version of linereader class record delimiter 2 new lines, instead of one. once done, mapper receive multiple lines input value.


Comments