Parsing XmlInputFormat element larger than hdfs block size

panther_ip · Sep 28, 2012

I'm new to Hadoop MapReduce (4 days to be precise) and I've been asked to perform distributed XML parsing on a cluster. As per my (re)search on the Internet, it should be fairly easy using Mahout's XmlInputFormat, but my task is to make sure that the system works for huge (~5TB) XML files.As per my knowledge, the file splits sent to the reducers cannot be larger than the hdfs block size (or the per-job block size). [Correct me if I'm mistaken].The issue I'm facing is that some XML elements are large (~200MB) and some are small (~1MB)So my question is: What happens when the XML element chunk created by XmlInputFormat is bigger than the block size? Will it send the entire large file (say 200MB) to a mapper or will it send out the element in three splits (64+64+64+8)??I currently don't have access to the company's hadoop cluster (and wont be until sometime) so I cannot perform a test and find out. Kindly help me out.

Parsing XmlInputFormat element larger than hdfs block size

panther_ip

New Member