Large file to be read in chunks

Large file to be read in chunks - java

I need a process a large file and insert into Db and don't want to spend lot of ram doing the same. I know we can read line in streaming mode by using apache commons API or buffered reader....bt I wish to insert in DB in batch mode like 1000 insertions at 1 go and not 1 by 1. ....is reading the file line by line ,adding to a list ,counting size ,inserting and refreshing the list of lines the only option to achieve this ?

According to your description, Spring-Batch fit very well.
Basically, it use chunk concept to read/process/write the content. By the way, it can be concurrent for performance.
#Bean
protected Step loadFeedDataToDbStep() {
return stepBuilder.get("load new fincon feed").<com.xxx.Group, FinconFeed>chunk(250)
.reader(itemReader(OVERRIDDEN_BY_EXPRESSION))
.processor(itemProcessor(OVERRIDDEN_BY_EXPRESSION, OVERRIDDEN_BY_EXPRESSION_DATE, OVERRIDDEN_BY_EXPRESSION))
.writer(itemWriter())
.listener(archiveListener())
.build();
}
You can refer to here for more

Related

Processing large CSV's using dataflow jobs

I am trying to process 6GB CSV's file (750 MB in GZ) using GCP dataflow jobs. I am using machineType as n1-standard-4, which is 15GB RAM size with 4vCPU's.
My Data Flow Code
PCollection<TableRow> tableRow = lines.apply("ToTableRow", ParDo.of(new
StringToRowConverter()));
static class StringToRowConverter extends DoFn<String, TableRow> {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) {
String inputLine = c.element();
String[] split = inputLine.split(",");
TableRow output = new TableRow();
c.output(new TableRow().set("id", split[0]).set("apppackage", split[1]));
}
}
My Job is running since last 2 hours and still not processed.
Once I break manually this large file in small parts, it works properly.
I have to process 400GB of compressed files to put in bigquery.All zipped Files are in GCP storage.
My query is if only 6GB file is processing in so much time, how can I process 400GB of zipped files?
Is there way I can optimise this process so that I will be able to insert this data in my BQ.

6GB in CSV is not much data. CSV is just an really inefficient way of storing numerical data, and for string-alike data, it still carries significant overhead and is hard to parse, and impossible to seek to specific positions at rest (need to be parsed first). So, we can be pretty optimistic that this will actually work out, data wise. It's an import problem.
Don't roll your own parser. For example: What about fields that contain a , in their text? There's enough CSV parsers.
You say you want to get that data into your BigQuery – so go google's way and follow:
https://cloud.google.com/bigquery/docs/loading-data-local#bigquery-import-file-java
as bigquery already comes with it's own Builder that supports CSV.

How to access SparkContext on executors to save DataFrame to Cassandra?

How can I use SparkContext (to create SparkSession or Cassandra Sessions) on executors?
If I pass it as a parameter to the foreach or foreachPartition, then it will have a null value. Shall I create a new SparkContext in each executor?
What I'm trying to do is as follows:
Read a dump directory with millions of XML files:
dumpFiles = Directory.listFiles(dumpDirectory)
dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
In parse(), every XML file is validated, parsed and inserted into several tables using Spark SQL. Only valid XML files will present objects of same type that can be saved. Portion of the data needs to be replaced by other keys before being inserted into one of the tables.
In order to do that, SparkContext is needed in the function parse to use sparkContext.sql().

If I'd rephrase your question, what you want is to:
Read a directory with millions of XML files
Parse them
Insert them into a database
That's a typical Extract, Transform and Load (ETL) process that terribly easy in Spark SQL.
Loading XML files can be done using a separate package spark-xml:
spark-xml A library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. The structure and test tools are mostly copied from CSV Data Source for Spark.
You can "install" the package using --packages command-line option:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
Quoting spark-xml's Scala API (with some changes to use SparkSession instead):
// Step 1. Loading XML files
val path = "the/path/to/millions/files/*.xml"
val spark: SparkSession = ???
val files = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load(path)
That makes the first requirement almost no-brainer. You've got your million XML files taken care by Spark SQL.
Step 2 is about parsing the lines (from the XML files) and marking rows to be saved to appropriate tables.
// Step 2. Transform them (using parse)
def parse(line: String) = ???
val parseDF = files.map { line => parse(line) }
Your parse function could return something (as the main result) and the table that something should be saved to.
With the table markers, you split the parseDF into DataFrames per table.
val table1DF = parseDF.filter($"table" === "table1")
And so on (per table).
// Step 3. Insert into DB
table1DF.write.option(...).jdbc(...)
That's just a sketch of what you may really be after, but that's the general pattern to follow. Decompose your pipeline into digestable chunks and tackle one chunk at a time.

It is important to keep in mind that in Spark we are not supposed to program in terms of executors.
In Spark programming model, your driver program is mostly a self-contained program where certain sections will be automatically converted to a physical execution plan. Ultimately a bunch of tasks distributed across worker/executors.
When you need to execute something for each partition, you can use something like mapPartitions(). Refer Spark : DB connection per Spark RDD partition and do mapPartition for further details. Pay attention to how the dbConnection object is enclosed in the function body.
It is not clear what you mean by a parameter. If it is just data (not a DB connection or similar), I think you need to use a boradcast variable.

How does Apache Flink parallelize reading of a CSV file

I am using readCsvFile(path) function in Apache Flink api to read a CSV file and store it in a list variable. How does it work using multiple threads?
For example, is it splitting the file based on some statistics? if yes, what statistics? Or does it read the file line by line and then send the lines to threads to process them?
Here is the sample code:
//default parallelism is 4
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
csvPath="data/weather.csv";
List<Tuple2<String, Double>> csv= env.readCsvFile(csvPath)
.types(String.class,Double.class)
.collect();
Suppose that we have a 800mb CSV file on local disk, how does it distribute the work between those 4 threads?

The readCsvFile() API method internally creates a data source with a CsvInputFormat which is based on Flink's FileInputFormat. This InputFormat generates a list of so-called InputSplits. An InputSplit defines which range of a file should be scanned. The splits are then distributed to data source tasks.
So, each parallel task scans a certain region of a file and parses its content. This is very similar to how it is done by MapReduce / Hadoop.

This is the same as How does Hadoop process records split across block boundaries?
I extract some code from flink-release-1.1.3 DelimitedInputFormat file.
// else ..
int toRead;
if (this.splitLength > 0) {
// if we have more data, read that
toRead = this.splitLength > this.readBuffer.length ? this.readBuffer.length : (int) this.splitLength;
}
else {
// if we have exhausted our split, we need to complete the current record, or read one
// more across the next split.
// the reason is that the next split will skip over the beginning until it finds the first
// delimiter, discarding it as an incomplete chunk of data that belongs to the last record in the
// previous split.
toRead = this.readBuffer.length;
this.overLimit = true;
}
It's clear that if it don't read line delimiter in one split, it will get another split to find.( I haven't find The corresponding code, I will try.)
Plus: the image below is how I find the code, from readCsvFile() to DelimitedInputFormat.

Fastest way to import millions of JSON documents to MongoDB

I have more than 10 million JSON documents of the form :
["key": "val2", "key1" : "val", "{\"key\":\"val", \"key2\":\"val2"}"]
in one file.
Importing using JAVA Driver API took around 3 hours, while using the following function (importing one BSON at a time):
public static void importJSONFileToDBUsingJavaDriver(String pathToFile, DB db, String collectionName) {
// open file
FileInputStream fstream = null;
try {
fstream = new FileInputStream(pathToFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
System.out.println("file not exist, exiting");
return;
}
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
// read it line by line
String strLine;
DBCollection newColl = db.getCollection(collectionName);
try {
while ((strLine = br.readLine()) != null) {
// convert line by line to BSON
DBObject bson = (DBObject) JSON.parse(JSONstr);
// insert BSONs to database
try {
newColl.insert(bson);
}
catch (MongoException e) {
// duplicate key
e.printStackTrace();
}
}
br.close();
} catch (IOException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
Is there a faster way? Maybe, MongoDB settings may influence the insertion speed? (for, example adding key : "_id" which will function as index, so that MongoDB would not have to create artificial key and thus index for each document) or disable index creation at all at insertion.
Thanks.

I'm sorry but you're all picking minor performance issues instead of the core one. Separating the logic from reading the file and inserting is a small gain. Loading the file in binary mode (via MMAP) is a small gain. Using mongo's bulk inserts is a big gain, but still no dice.
The whole performance bottleneck is the BSON bson = JSON.parse(line). Or in other words, the problem with the Java drivers is that they need a conversion from json to bson, and this code seems to be awfully slow or badly implemented. A full JSON (encode+decode) via JSON-simple or specially via JSON-smart is 100 times faster than the JSON.parse() command.
I know Stack Overflow is telling me right above this box that I should be answering the answer, which I'm not, but rest assured that I'm still looking for an answer for this problem. I can't believe all the talk about Mongo's performance and then this simple example code fails so miserably.

I've done importing a multi-line json file with ~250M records. I just use mongoimport < data.txt and it took 10 hours. Compared to your 10M vs. 3 hours I think this is considerably faster.
Also from my experience writing your own multi-threaded parser would speed things up drastically. The procedure is simple:
Open the file as BINARY (not TEXT!)
Set markers(offsets) evenly across the file. The count of markers depends on the number of threads you want.
Search for '\n' near the markers, calibrate the markers so they are aligned to lines.
Parse each chunk with a thread.
A reminder:
when you want performance, don't use stream reader or any built-in line-based read methods. They are slow. Just use binary buffer and search for '\n' to identify a line, and (most preferably) do in-place parsing in the buffer without creating a string. Otherwise the garbage collector won't be so happy with this.

You can parse the entire file together at once and the insert the whole json in mongo document, Avoid multiple loops, You need to separate the logic as follows:
1)Parse the file and retrieve the json Object.
2)Once the parsing is over, save the json Object in the Mongo Document.

I've got a slightly faster way (I'm also inserting millions at the moment), insert collections instead of single documents with
insert(List<DBObject> list)
http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#insert(java.util.List)
That said, it's not that much faster. I'm about to experiment with setting other WriteConcerns than ACKNOWLEDGED (mainly UNACKNOWLEDGED) to see if I can speed it up faster. See http://docs.mongodb.org/manual/core/write-concern/ for info
Another way to improve performance, is to create indexes after bulk inserting. However, this is rarely an option except for one off jobs.
Apologies if this is slightly wooly sounding, I'm still testing things myself. Good question.

You can also remove all the indexes (except for the PK index, of course) and rebuild them after the import.

Use bulk operations insert/upserts. After Mongo 2.6 you can do Bulk Updates/Upserts. Example below does bulk update using c# driver.
MongoCollection<foo> collection = database.GetCollection<foo>(collectionName);
var bulk = collection.InitializeUnorderedBulkOperation();
foreach (FooDoc fooDoc in fooDocsList)
{
var update = new UpdateDocument { {fooDoc.ToBsonDocument() } };
bulk.Find(Query.EQ("_id", fooDoc.Id)).Upsert().UpdateOne(update);
}
BulkWriteResult bwr = bulk.Execute();

You can use a bulk insertion
You can read the documentation at mongo website and you can also check this java example on StackOverflow

Optimize metadata writing in DOC, XLS files

I'm doing a program that modifies only the metadata (standard and custom) in files Doc, xls, ppt and Vsd, the program works correctly but I wonder if there is a way to do this without loading the entire file into memory:
POIFSFileSystem POIFS = new POIFSFileSystem (new FileInputStream ("file.xls"))
The NPOIFSFileSystem method is faster and consumes less memory but is read only.
I'm using Apache POI 3.9

You could map the desired part to memory and then work on it using java.nio.FileChannel.
In addition to the familiar read, write, and close operations of byte channels, this class defines the following file-specific operations:
Bytes may be read or written at an absolute position in a file in a way that does not affect the channel's current position.
A region of a file may be mapped directly into memory; for large files this is often much more efficient than invoking the usual read or write methods.

At the time of your question, there sadly wasn't a very low memory way to do it. The good news is that as of 2014-04-28 it is possible! (This code should be in 3.11 when that's released, but for now it's too new)
Now that NPOIFS supports writing, including in-place write, what you'll want to do is something like:
// Open the file, and grab the entries for the summary streams
NPOIFSFileSystem poifs = new NPOIFSFileSystem(file, false);
DocumentNode sinfDoc =
(DocumentNode)root.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
DocumentNode dinfDoc =
(DocumentNode)root.getEntry(DocumentSummaryInformation.DEFAULT_STREAM_NAME);
// Open and parse the metadata
SummaryInformation sinf = (SummaryInformation)PropertySetFactory.create(
new NDocumentInputStream(sinfDoc));
DocumentSummaryInformation dinf = (DocumentSummaryInformation)PropertySetFactory.create(
new NDocumentInputStream(dinfDoc));
// Make some metadata changes
sinf.setAuthor("Changed Author");
sinf.setTitle("Le titre \u00e9tait chang\u00e9");
dinf.setManager("Changed Manager");
// Update the metadata streams in the file
sinf.write(new NDocumentOutputStream(sinfDoc));
dinf.write(new NDocumentOutputStream(dinfDoc));
// Write out our changes
fs.writeFilesystem();
fs.close();
You ought to be able to do all of that in under 20% of the memory of the size of your file, quite possibly less than that for larger files!
(If you want to see more on this, look at the ModifyDocumentSummaryInformation example and the HPSF TestWrite unit test)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Large file to be read in chunks - java

Related

Processing large CSV's using dataflow jobs

How to access SparkContext on executors to save DataFrame to Cassandra?

How does Apache Flink parallelize reading of a CSV file

Fastest way to import millions of JSON documents to MongoDB

Optimize metadata writing in DOC, XLS files

Categories

Resources