I am working on a utility which reads multiple parquet files at a time and writing them into one single output file.
the implementation is very straightforward. This utility reads parquet files from the directory, reads Group from all the file and put them into a list .Then uses ParquetWrite to write all these Groups into a single file. After reading 600mb it throws Out of memory error for Java heap space. It also takes 15-20 minutes to read and write 500mb of data.
Is there a way to make this operation more efficient?
Read method looks like this:
ParquetFileReader reader = new ParquetFileReader(conf, path, ParquetMetadataConverter.NO_FILTER);
ParquetMetadata readFooter = reader.getFooter();
MessageType schema = readFooter.getFileMetaData().getSchema();
ParquetFileReader r = new ParquetFileReader(conf, path, readFooter);
reader.close();
PageReadStore pages = null;
try {
while (null != (pages = r.readNextRowGroup())) {
long rows = pages.getRowCount();
System.out.println("Number of rows: " + pages.getRowCount());
MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
for (int i = 0; i < rows; i++) {
Group g = (Group) recordReader.read();
//printGroup(g);
groups.add(g);
}
}
} finally {
System.out.println("close the reader");
r.close();
}
Write method is like this:
for(Path file : files){
groups.addAll(readData(file));
}
System.out.println("Number of groups from the parquet files "+groups.size());
Configuration configuration = new Configuration();
Map<String, String> meta = new HashMap<String, String>();
meta.put("startkey", "1");
meta.put("endkey", "2");
GroupWriteSupport.setSchema(schema, configuration);
ParquetWriter<Group> writer = new ParquetWriter<Group>(
new Path(outputFile),
new GroupWriteSupport(),
CompressionCodecName.SNAPPY,
2147483647,
268435456,
134217728,
true,
false,
ParquetProperties.WriterVersion.PARQUET_2_0,
configuration);
System.out.println("Number of groups to write:"+groups.size());
for(Group g : groups) {
writer.write(g);
}
writer.close();
I use these functions to merge parquet files, but it is in Scala. Anyway, it may give you good starting point.
import java.util
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.hadoop.{ParquetFileReader, ParquetFileWriter}
import org.apache.parquet.hadoop.util.{HadoopInputFile, HadoopOutputFile}
import org.apache.parquet.schema.MessageType
import scala.collection.JavaConverters._
object ParquetFileMerger {
def mergeFiles(inputFiles: Seq[Path], outputFile: Path): Unit = {
val conf = new Configuration()
val mergedMeta = ParquetFileWriter.mergeMetadataFiles(inputFiles.asJava, conf).getFileMetaData
val writer = new ParquetFileWriter(conf, mergedMeta.getSchema, outputFile, ParquetFileWriter.Mode.OVERWRITE)
writer.start()
inputFiles.foreach(input => writer.appendFile(HadoopInputFile.fromPath(input, conf)))
writer.end(mergedMeta.getKeyValueMetaData)
}
def mergeBlocks(inputFiles: Seq[Path], outputFile: Path): Unit = {
val conf = new Configuration()
val parquetFileReaders = inputFiles.map(getParquetFileReader)
val mergedSchema: MessageType =
parquetFileReaders.
map(_.getFooter.getFileMetaData.getSchema).
reduce((a, b) => a.union(b))
val writer = new ParquetFileWriter(HadoopOutputFile.fromPath(outputFile, conf), mergedSchema, ParquetFileWriter.Mode.OVERWRITE, 64*1024*1024, 8388608)
writer.start()
parquetFileReaders.foreach(_.appendTo(writer))
writer.end(new util.HashMap[String, String]())
}
def getParquetFileReader(file: Path): ParquetFileReader = {
ParquetFileReader.open(HadoopInputFile.fromPath(file, new Configuration()))
}
}
I faced with the very same problem. On not very big files (up to 100 megabytes), the writing time could be up to 20 minutes.
Try to use kite-sdk api. I know it looks as if abandoned but some things in it are done very efficiently. Also if you like Spring you can try spring-data-hadoop (which is some kind of a wrapper over kite-sdk-api). In my case the use of this libraries reduced the writing time to 2 minutes.
For example you can write in Parquet (using spring-data-hadoop but writing using kite-sdk-api looks quite similiar) in this manner:
final DatasetRepositoryFactory repositoryFactory = new DatasetRepositoryFactory();
repositoryFactory.setBasePath(basePath);
repositoryFactory.setConf(configuration);
repositoryFactory.setNamespace("my-parquet-file");
DatasetDefinition datasetDefinition = new DatasetDefinition(targetClass, true, Formats.PARQUET.getName());
try (DataStoreWriter<T> writer = new ParquetDatasetStoreWriter<>(clazz, datasetRepositoryFactory, datasetDefinition)) {
for (T record : records) {
writer.write(record);
}
writer.flush();
}
Of course you need to add some dependencies to your project (in my case this is spring-data-hadoop):
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop-boot</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop-config</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-hadoop-store</artifactId>
<version>${spring.hadoop.version}</version>
</dependency>
If you absolutely want to do it using only native hadoop api, in any case it will be useful to take a look at the source code of these libraries in order to implement efficiently writing in parquet files.
What you are trying to achieve is already possible using the merge command of parquet-tools. However, it is not recommended for merging small files, since it doesn't actually merge the row groups, only places them one after the another (exactly how you describe it in your question). The resulting file will probably have bad performance characteristics.
If you would like to implement it yourself nevertheless, you can either increase the heap size, or modify the code so that it does not read all of the files into memory before writing the new file, but instead reads them one by one (or even better, rowgroup by rowgroup), and immediately writes them to the new file. This way you will only need to keep in memory a single file or row group.
I have implemented something solution using Spark with pyspark script, below is sample code for same, here loading multiple parquet files from directory location, also if parquet files schema is little different in files we are merging that as well.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("App_name") \
.getOrCreate()
dataset_DF = spark.read.option("mergeSchema", "true").load("/dir/parquet_files/")
dataset_DF.write.parquet("file_name.parquet")
Hope it will be short solution.
Related
I need to create 3 separate files.
My Batch job should read from Mongo then parse through the information and find the "business" column (3 types of business: RETAIL,HPP,SAX) then create a file for their respective business. the file should create either RETAIL +formattedDate; HPP + formattedDate; SAX +formattedDate as the file name and the information found in the DB inside a txt file. Also, I need to set the .resource(new FileSystemResource("C:\filewriter\index.txt)) into something that will send the information to the right location, right now hard coding works but only creates one .txt file.
example:
#Bean
public FlatFileItemWriter<PaymentAudit> writer() {
LOG.debug("Mongo-writer");
FlatFileItemWriter<PaymentAudit> flatFile = new
FlatFileItemWriterBuilder<PaymentAudit>()
.name("flatFileItemWriter")
.resource(new FileSystemResource("C:\\filewriter\\index.txt))
//trying to create a path instead of hard coding it
.lineAggregator(createPaymentPortalLineAggregator())
.build();
String exportFileHeader =
"CREATE_DTTM";
StringHeaderWriter headerWriter = new
StringHeaderWriter(exportFileHeader);
flatFile.setHeaderCallback(headerWriter);
return flatFile;
}
My idea would be something like but not sure where to go:
public Map<String, List<PaymentAudit>> getPaymentPortalRecords() {
List<PaymentAudit> recentlyCreated =
PaymentPortalRepository.findByCreateDttmBetween(yesterdayMidnight,
yesterdayEndOfDay);
List<PaymentAudit> retailList = new ArrayList<>();
List<PaymentAudit> saxList = new ArrayList<>();
List<PaymentAudit> hppList = new ArrayList<>();
//String exportFilePath = "C://filewriter/";??????
recentlyCreated.parallelStream().forEach(paymentAudit -> {
if (paymentAudit.getBusiness().equalsIgnoreCase(RETAIL)) {
retailList.add(paymentAudit);
} else if
(paymentAudit.getBusiness().equalsIgnoreCase(SAX)) {
saxList.add(paymentAudit);
} else if
(paymentAudit.getBusiness().equalsIgnoreCase(HPP)) {
hppList.add(paymentAudit);
}
});
To create a file for each business object type, you can use the ClassifierCompositeItemWriter. In your case, you can create a writer for each type and add them as delegates in the composite item writer.
As per creating the filename dynamically, you need to use a step scoped writer. There is an example in the Step Scope section of the reference documentation.
Hope this helps.
My question is how to transform from a DataStream to a List, for example in order to be able to iterate through it.
The code looks like :
package flinkoracle;
//imports
//....
public class FlinkOracle {
final static Logger LOG = LoggerFactory.getLogger(FlinkOracle.class);
public static void main(String[] args) {
LOG.info("Starting...");
// get the execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TypeInformation[] fieldTypes = new TypeInformation[]{BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
//get the source from Oracle DB
DataStream<?> source = env
.createInput(JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("oracle.jdbc.driver.OracleDriver")
.setDBUrl("jdbc:oracle:thin:#localhost:1521")
.setUsername("user")
.setPassword("password")
.setQuery("select * from table1")
.setRowTypeInfo(rowTypeInfo)
.finish());
source.print().setParallelism(1);
try {
LOG.info("----------BEGIN----------");
env.execute();
LOG.info("----------END----------");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
LOG.info("End...");
}
}
Thanks a lot in advance.
Br
Tamas
Flink provides an iterator sink to collect DataStream results for testing and debugging purposes. It can be used as follows:
import org.apache.flink.contrib.streaming.DataStreamUtils;
DataStream<Tuple2<String, Integer>> myResult = ...
Iterator<Tuple2<String, Integer>> myOutput = DataStreamUtils.collect(myResult)
You can copy an iterator to a new list like this:
while (iter.hasNext())
list.add(iter.next());
Flink also provides a bunch of simple write*() methods on DataStream that are mainly intended for debugging purposes. The data flushing to the target system depends on the implementation of the OutputFormat. This means that not all elements sent to the OutputFormat are immediately shown up in the target system. Note: These write*() methods do not participate in Flinkās checkpointing, and in failure cases, those records might be lost.
writeAsText() / TextOutputFormat
writeAsCsv(...) / CsvOutputFormat
print() / printToErr()
writeUsingOutputFormat() / FileOutputFormat
writeToSocket
Source: link.
You may need to add the following dependency to use DataStreamUtils:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-contrib</artifactId>
<version>0.10.2</version>
</dependency>
In newer versions, DataStreamUtils::collect has been deprecated. Instead you can use DataStream::executeAndCollect which, if given a limit, will return a List of at most that size.
var list = source.executeAndCollect(100);
If you do not know how many elements there are, or if you simply want to iterate through the results without loading them all into memory at once, then you can use the no-arg version to get a CloseableIterator
try (var iterator = source.executeAndCollect()) {
// do something
}
I have the following problem: suppose that I have a directory containing compressed directories which contain multiple files, stored on HDFS. I want to create an RDD consisting some objects of type T, i.e.:
context = new JavaSparkContext(conf);
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaPairRDD<String, String> filesRDD = context.wholeTextFiles(inputDataPath);
JavaRDD<T> processingFiles = filesRDD.map(fileNameContent -> {
// The name of the file
String fileName = fileNameContent._1();
// The content of the file
String content = fileNameContent._2();
// Class T has a constructor of taking the filename and the content of each
// processed file (as two strings)
T t = new T(content, fileName);
return t;
});
Now when inputDataPath is a directory containing files this works perfectly fine, i.e. when it's something like:
String inputDataPath = "hdfs://some_path/*/*/"; // because it contains subfolders
But, when there's a tgz containing multiple files, the file content (fileNameContent._2()) gets me some useless binary string (quite expected). I found a similar question on SO, but it's not the same case, because there the solution is when each compression consists of one file only, and in my case there are many other files which I want to read individually as whole files. I also found a question about wholeTextFiles, but this doesn't work in my case.
Any ideas how to do this?
EDIT:
I tried with the reader from here (trying to test the reader from here, like in the function testTarballWithFolders()), but whenever I call
TarballReader tarballReader = new TarballReader(fileName);
and I get NullPointerException:
java.lang.NullPointerException
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:83)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:77)
at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
at utils.TarballReader.<init>(TarballReader.java:61)
at main.SparkMain.lambda$0(SparkMain.java:105)
at main.SparkMain$$Lambda$18/1667100242.call(Unknown Source)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1015)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The line 105 in MainSpark is the one I showed upper in my edit of the post, and line 61 from TarballReader is
GZIPInputStream gzip = new GZIPInputStream(in);
which gives a null value for the input stream in in the upper line:
InputStream in = this.getClass().getResourceAsStream(tarball);
Am I on the right path here? If so, how do I continue? Why do I get this null value and how can I fix it?
One possible solution is to read data with binaryFiles and extract content manually.
Scala:
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream
import org.apache.spark.input.PortableDataStream
import scala.util.Try
import java.nio.charset._
def extractFiles(ps: PortableDataStream, n: Int = 1024) = Try {
val tar = new TarArchiveInputStream(new GzipCompressorInputStream(ps.open))
Stream.continually(Option(tar.getNextTarEntry))
// Read until next exntry is null
.takeWhile(_.isDefined)
// flatten
.flatMap(x => x)
// Drop directories
.filter(!_.isDirectory)
.map(e => {
Stream.continually {
// Read n bytes
val buffer = Array.fill[Byte](n)(-1)
val i = tar.read(buffer, 0, n)
(i, buffer.take(i))}
// Take as long as we've read something
.takeWhile(_._1 > 0)
.map(_._2)
.flatten
.toArray})
.toArray
}
def decode(charset: Charset = StandardCharsets.UTF_8)(bytes: Array[Byte]) =
new String(bytes, StandardCharsets.UTF_8)
sc.binaryFiles("somePath").flatMapValues(x =>
extractFiles(x).toOption).mapValues(_.map(decode()))
libraryDependencies += "org.apache.commons" % "commons-compress" % "1.11"
Full usage example with Java: https://bitbucket.org/zero323/spark-multifile-targz-extract/src
Python:
import tarfile
from io import BytesIO
def extractFiles(bytes):
tar = tarfile.open(fileobj=BytesIO(bytes), mode="r:gz")
return [tar.extractfile(x).read() for x in tar if x.isfile()]
(sc.binaryFiles("somePath")
.mapValues(extractFiles)
.mapValues(lambda xs: [x.decode("utf-8") for x in xs]))
A slight improvement on the accepted answer is to change
Option(tar.getNextTarEntry)
to
Try(tar.getNextTarEntry).toOption.filter( _ != null)
to contend with malformed / truncated .tar.gzs in a robust way.
BTW, is there anything special about the size of the buffer array? Would it faster on average if it were closer to the average file size, maybe 500k in my case? Or is the slowdown I am seeing more likely the overhead of Stream relative to a while loop that was more Java-ish, I guess.
I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup.
I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory.
The problem is that I cannot figure out how to actually read the website data (URLs and HTML) in a Java/Scala program. I read this document, but find it a bit overwhelming since I've never used Hadoop.
I tried to adapt the example code to my environment, and this is what I arrived at (mostly by guesswprk):
val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()), ".../apache-nutch-1.8/crawl/segments/20140711115438/content/part-00000", new Configuration())
var key = null
var value = null
reader.next(key, value) // test for a single value
println(key)
println(value)
However, I am getting this exception when I run it:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1873)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
I am not sure how to work with a MapFile.Reader, specifically, what constructor parameters I am supposed to pass to it. What Configuration objects am I supposed to pass in? Is that the correct FileSystem? And is that the data file I'm interested in?
Scala:
val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)
val webdata = Stream.continually {
val key = new Text()
val content = new Content()
reader.next(key, content)
(key, content)
}
println(webdata.head)
Java:
public class ContentReader {
public static void main(String[] args) throws IOException {
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (reader.next(key, content)) {
try {
System.out.write(content.getContent(), 0,
content.getContent().length);
} catch (Exception e) {
}
}
}
}
Alternatively, you can use org.apache.nutch.segment.SegmentReader (example).
I am writing a Java program using the SVNKit API, and I need to use the correct class or call in the API that would allow me to find the diff between files stored in separate locations.
1st file:
https://abc.edc.xyz.corp/svn/di-edc/tags/ab-cde-fgh-axsym-1.0.0/src/site/apt/releaseNotes.apt
2nd file:
https://abc.edc.xyz.corp/svn/di-edc/tags/ab-cde-fgh-axsym-1.1.0/src/site/apt/releaseNotes.apt
I have used the listed API calls to generate the diff output, but I am unsuccessful so far.
DefaultSVNDiffGenerator diffGenerator = new DefaultSVNDiffGenerator();
diffGenerator.displayFileDiff("", file1, file2, "10983", "8971", "text", "text/plain", output);
diffClient.doDiff(svnUrl1, SVNRevision.create(10868), svnUrl2, SVNRevision.create(8971), SVNDepth.IMMEDIATES, false, System.out);
Can anyone provide guidance on the correct way to do this?
Your code looks correct. But prefer using the new API:
final SvnOperationFactory svnOperationFactory = new SvnOperationFactory();
try {
final ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
final SvnDiffGenerator diffGenerator = new SvnDiffGenerator();
diffGenerator.setBasePath(new File(""));
final SvnDiff diff = svnOperationFactory.createDiff();
diff.setSources(SvnTarget.fromURL(url1, svnRevision1), SvnTarget.fromURL(url2, svnRevision1));
diff.setDiffGenerator(diffGenerator);
diff.setOutput(byteArrayOutputStream);
diff.run();
} finally {
svnOperationFactory.dispose();
}