Dataframes are slow to parse through small amount of data

Dataframes are slow to parse through small amount of data - java

I have 2 classes doing a similar task in Apache Spark but the one using data frame is many times slower than the "regular" one using RDD. (30x)
I would like to use data frame since it will eliminate a lot of code and classes we have but obviously I can't have it be that much slower.
The data set is nothing big. We have 30 some files with json data in each about events triggered from activities in another piece of software. There are between 0 to 100 events in each file.
A data set with 82 events will take about 5 minutes to be processed with data frames.
Sample code:
public static void main(String[] args) throws ParseException, IOException {
SparkConf sc = new SparkConf().setAppName("POC");
JavaSparkContext jsc = new JavaSparkContext(sc);
SQLContext sqlContext = new SQLContext(jsc);
conf = new ConfImpl();
HashSet<String> siteSet = new HashSet<>();
// last month
Date yesterday = monthDate(DateUtils.addDays(new Date(), -1)); // method that returns the date on the first of the month
Date startTime = startofYear(new Date(yesterday.getTime())); // method that returns the date on the first of the year
// list all the sites with a metric file
JavaPairRDD<String, String> allMetricFiles = jsc.wholeTextFiles("hdfs:///somePath/*/poc.json");
for ( Tuple2<String, String> each : allMetricFiles.toArray() ) {
logger.info("Reading from " + each._1);
DataFrame metric = sqlContext.read().format("json").load(each._1).cache();
metric.count();
boolean siteNameDisplayed = false;
boolean dateDisplayed = false;
do {
Date endTime = DateUtils.addMonths(startTime, 1);
HashSet<Row> totalUsersForThisMonth = new HashSet<>();
for (String dataPoint : Conf.DataPoints) { // This is a String[] with 4 elements for this specific case
try {
if (siteNameDisplayed == false) {
String siteName = parseSiteFromPath(each._1); // method returning a parsed String
logger.info("Data for site: " + siteName);
siteSet.add(siteName);
siteNameDisplayed = true;
}
if ( dateDisplayed == false ) {
logger.info("Month: " + formatDate(startTime)); // SimpleFormatDate("yyyy-MM-dd")
dateDisplayed = true;
}
DataFrame lastMonth = metric.filter("event.eventId=\"" + dataPoint + "\"").filter("creationDate >= " + startTime.getTime()).filter("creationDate < " + endTime.getTime()).select("event.data.UserId").distinct();
logger.info("Distinct for last month for " + dataPoint + ": " + lastMonth.count());
totalUsersForThisMonth.addAll(lastMonth.collectAsList());
} catch (Exception e) {
// data does not fit the expected model so there is nothing to print
}
}
logger.info("Total Unique for the month: " + totalStudentForThisMonth.size());
startTime = DateUtils.addMonths(startTime, 1);
dateDisplayed = false;
} while ( startTime.getTime() < commonTmsMetric.monthDate(yesterday).getTime());
// reset startTime for the next site
startTime = commonTmsMetric.StartofYear(new Date(yesterday.getTime()));
}
}
There are a few things that are not efficient in this code but when I look at the logs it only adds a few seconds to the whole processing.
I must be missing something big.
I have ran this with 2 executors and 1 executor and the difference is 20 seconds on 5 minutes.
This is running with Java 1.7 and Spark 1.4.1 on Hadoop 2.5.0.
Thank you!

So there a few things, but its hard to say without seeing the breakdown of the different tasks & their time. The short version is you are doing way to much work in the driver and not taking advantage of Spark's distributed capabilities.
For example, you are collecting all of the data back to the driver program (toArray() and your for loop). Instead you should just point Spark SQL at the files in needs to load.
For the operators, it seems like your doing many aggregations in the driver, instead you could use the driver to generate the aggregations and have Spark SQL execute them.
Another big difference between your in-house code and the DataFrame code is going to be Schema inference. Since you've already created classes to represent your data, it seems likely that you know the schema of your JSON data. You can likely speed up your code by adding the schema information at read time so Spark SQL can skip inference.
I'd suggest re-visiting this approach and trying to build something using Spark SQL's distributed operators.

Related

Last partition taking very long time to save into S3 bucket spark RDD

this is how currently we are persisting RDD to S3
private void saveResult(Config jobConfiguration, SparkContext sparkContext, JavaRDD<Row> rowJavaRDD) {
final Persister persister = new PersisterBuilder()
.withRdd(rowJavaRDD)
.withSparkContext(sparkContext)
.withAwsCredentialsFile(jobConfiguration.awsCredentialsFile)
.withCacheEnabled(jobConfiguration.cacheDataBeforeSave)
.withJsonOutput(jobConfiguration.saveAsJson)
.withSaveMode(jobConfiguration.saveMode)
.buildForOutputPath(jobConfiguration.outputPath);
persister.save(jobConfiguration.outputPath);
persister.clean();
}
I have 1000 tasks and 984 tasks finishes very quickly within 20 minutes but last 16 takes forever to complete or never completes.
to be very specific this is how we write to S3
dataFrameWriter.option("header", "true")
.partitionBy("year", "month","submitDate");
So it creates folder with Year and then Month and for every date .
I am running job between 1st Sept to 18 Sept and that's why last 16 Tasks takes very long time.
Inside 1St Sept folder i see many files created that's good but i can not other folders created for 2nd Sept or others .
Is there any way i can improve this ?
Shuffle Read: 1258.5 GB / 61804285
How can we improve the task so that writing into S3 will be faster?
Update :
This is how we save partition
Do you suggest to increase the partition by adding date filed to it ?
Will it help distribute load ?
final StructType schema = DataTypes.createStructType(fields);
DataFrame dataFrame = SQLContext.getOrCreate(sparkContext)
.createDataFrame(rdd, schema);
if(cacheDataBeforeSave)
dataFrame = dataFrame.persist();
DataFrameWriter dataFrameWriter = dataFrame
.write();
if(saveMode == Config.SAVEMODE.OVERWRITE) {
System.out.printf("Overwriting data in '%s'%n", path);
dataFrameWriter = dataFrameWriter.mode(SaveMode.Overwrite);
}else {
System.out.printf("Appending data to '%s'%n", path);
dataFrameWriter = dataFrameWriter.mode(SaveMode.Append);
}
dataFrameWriter.option("header", "true")
.partitionBy("year", "month",,"submitDate");
if(saveAsJson)
dataFrameWriter.json(path);
else
dataFrameWriter.parquet(path);
}
#Override
public void clean() {
persisterHelper.clean();
}
}

How can I run a function multiple times with multiple data sets or values?

I'm kinda new to programming and got this as an assignment at work. I need to run a method that sends a message (FIX format) multiple times with multiple data sets. Here's how I build with its respective data and send the message:
private void testCaseAttempt(String testCaseName) throws Exception {
StringBuilder errorBuilder = new StringBuilder();
// Read test case arguments
new Arguments(testCaseName);
QuoteRequestBuilder builder = app.builders().quoteRequest();
BigDecimal b1;
b1 = new BigDecimal(10000);
//Date transactTime;
//transactTime = new Date(0);
//expireTime 10 minutes from now
Calendar now = Calendar.getInstance();
now.add(Calendar.MINUTE, 10);
Date expireTime = now.getTime();
//BUILD THE MESSAGE
builder
.setField(131, "5EB26EAAC074000D0000")
.symbol("DANBNK")
.securityID("SE0011116474")
.currency("SEK")
.securityIDSource("4")
.setField(54, "2")
.expireTime(expireTime)
.orderQty(b1)
.setField(64, "20200508")
.setField(1629, "10")
.setField(1916, "0")
.setField(60, "20200526-15:48:53.006")
.setField(761, "1")
.partyID("13585922", PartyIDSource.PROPRIETARY_CUSTOM_CODE, 11, null)
.partyID("1270", PartyIDSource.PROPRIETARY_CUSTOM_CODE, 13, null)
.partyID("SEB", PartyIDSource.PROPRIETARY_CUSTOM_CODE, 1, null)
.partyID("1786343", PartyIDSource.PROPRIETARY_CUSTOM_CODE, 117, null);
Message quoteRequestMessage = builder.getMessage();
//SEND THE MESSAGE
app.sendMessage(quoteRequestMessage, app.getSession(session));
long timeout = Properties.getLong(0L, "waitForMessage", "FIX");
Message responseMessage;
}
I build the FIX message with the "setfield" instructions then I just send it. This works just fine except I need to do it 20-30 times (so 20-30 messages) and I need to slightly change the values or parameteres each time.
I have an idea how to do this with cucumber using a feature file with an "Examples" table with my desired data so it calls this method but that feels like overkill at the moment. I was thinking of using an excel file with a table so I can comfortably change the values in each row and just feed it to this function somehow.
By the way, I didn't copy all the code in the function, I just copied the lines in which the msg is built and sent.
Any idea how I can do this? Your replies are much appreciated!
Thanks in advance.

Create your field map and just iterate that field map as below
Map<Integer,String> fieldMap = new HashMap<>();
fieldMap.put(131,"5EB26EAAC074000D0000");
fieldMap.put(54,"2");
fieldMap.put(64,"20200508");
fieldMap.put(1629,"10");
fieldMap.forEach((k,v)->{
builder.setField(k,v)
});

How do I save a retrieve specific data from a csv file without headers in java?

I am writing an application which needs to load a large csv file that is pure data and doesn't contain any headers.
I am using a fastCSV library to parse the file, however the data needs to be stored and specific fields need to be retrieved. Since the entire data is not necessary I am skipping every third line.
Is there a way to set the headers after the file has been parsed and save it in a data structure such as an ArrayList?
Here is the function which loads the file:
public void fastCsv(String filePath) {
File file = new File(filePath);
CsvReader csvReader = new CsvReader();
int linecounter = 1;
try (CsvParser csvParser = csvReader.parse(file, StandardCharsets.UTF_8)) {
CsvRow row;
while ((row = csvParser.nextRow()) != null) {
if ((linecounter % 3) > 0 ) {
// System.out.println("Read line: " + row);
//System.out.println("First column of line: " + row.getField(0));
System.out.println(row);
}
linecounter ++;
}
System.out.println("Execution Time in ms: " + elapsedTime);
csvParser.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Any insight would be greatly appreciated.

univocity-parsers supports field selection and can do this very easily. It's also faster than the library you are using.
Here's how you can use it to select columns of interest:
Input
String input = "X, X2, Symbol, Date, Open, High, Low, Close, Volume\n" +
" 5, 9, AAPL, 01-Jan-2015, 110.38, 110.38, 110.38, 110.38, 0\n" +
" 2710, 289, AAPL, 01-Jan-2015, 110.38, 110.38, 110.38, 110.38, 0\n" +
" 5415, 6500, AAPL, 02-Jan-2015, 111.39, 111.44, 107.35, 109.33, 53204600";
Configure
CsvParserSettings settings = new CsvParserSettings(); //many options here, check the tutorial
settings.setHeaderExtractionEnabled(true); //tells the parser to use the first row as the header row
settings.selectFields("X", "X2"); //selects the fields
Parse and print results
CsvParser parser = new CsvParser(settings);
for(String[] row : parser.iterate(new StringReader(input))){
System.out.println(Arrays.toString(row));
}
}
Output
[5, 9]
[2710, 289]
[5415, 6500]
On the field selection, you can use any sequence of fields, and have rows with different column sizes, and the parser will handle this just fine. No need to write complex logic to handle that.
The process the File in your code, change the example above to do this:
for(String[] row : parser.iterate(new File(filePath))){
... //your logic goes here.
}
If you want a more usable record (with typed values), use this instead:
for(Record record : parser.iterateRecords(new File(filePath))){
... //your logic goes here.
}
Speeding up
The fastest way of processing the file is through a RowProcessor. That's a callback that received the rows parsed from the input:
settings.setProcessor(new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
System.out.println(Arrays.toString(row));
context.skipLines(3); //use the context object to control the parser
}
});
CsvParser parser = new CsvParser(settings);
//`parse` doesn't return anything. Rows go to the `rowProcessed` method.
parser.parse(new StringReader(input));
You should be able to parse very large files pretty quickly. If things are slowing down look in your code (avoid adding values to lists or collections in memory, or at least pre-allocate the collections to a good size, and give the JVM a large amount of memory to work with using Xms and Xmx flags).
Right now this parser is the fastest you can find. I made this performance comparison a while ago you can use for reference.
Hope this helps
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license)

Do you know which fields/columns you want to keep, and what you'd like the "header" value to be ? , ie you want columns the first and third columns and you want them called "first" and "third" ? If so, you could build a HashMap of string/objects (or other appropriate type, depends on your actual data and needs), and add the HashMap to an ArrayList - this should get you going, just be sure to change the HashMap types as needed
ArrayList<HashMap<String,String>> arr=new ArrayList<>();
HashMap<String,String> hm=new HashMap<>();
while ((row = csvParser.nextRow()) != null) {
if ((linecounter % 3) > 0 ) {
// System.out.println("Read line: " + row);
//System.out.println("First column of line: " + row.getField(0));
// keep col1 and col3
hm.clear();
hm.put("first",row.getField(0));
hm.put("third",row.getField(2));
arr.add(hm);
}
linecounter ++;
}
If you want to capture all columns, you can use a similar technique but I'd build a mapping data structure so that you can match field indexes to column header names in a loop to add each column to the HashMap that is then stored in the ArrayList

Filtering fields in RFC_READ_TABLE with SAP JCo

I am attempting to write a simple java utility that extracts data from SAP into a MySQL database, using JCo. I have understood the JCo documentation and tried out the relevant examples mentioned in SAP help portal, I am able to retrieve data from Table and insert into MySQL DB.
What I would like to have is a facility to filter data in following two ways :
I would like to fetch only the required fields.
I would like to fetch rows only if the value of the a particular field matches certain pattern.
After doing some research I didn't find any way to specify query parameters so that it retrieves only the filtered data, it basically queries all the fields from a Table, I think I will have to filter out the data that I don't want in my java-client layer. Please let me know if I am missing out something here.
Here is a code example :
public static void readTables() throws JCoException, IOException {
final JCoDestination destination = JCoDestinationManager
.getDestination(DESTINATION_NAME2);
final JCoFunction function = destination.getRepository().getFunction(
"RFC_READ_TABLE");
function.getImportParameterList().setValue("QUERY_TABLE", "DD02L");
function.getImportParameterList().setValue("DELIMITER", ",");
if (function == null) {
throw new RuntimeException("BAPI RFC_READ_TABLE not found in SAP.");
}
try {
function.execute(destination);
} catch (final AbapException e) {
System.out.println(e.toString());
return;
}
final JCoTable codes = function.getTableParameterList().getTable(
"FIELDS");
String header = "SN";
for (int i = 0; i < codes.getNumRows(); i++) {
codes.setRow(i);
header += "," + codes.getString("FIELDNAME");
}
final FileWriter outFile = new FileWriter("out.csv");
outFile.write(header + "\n");
final JCoTable rows = function.getTableParameterList().getTable("DATA");
for (int i = 0; i < rows.getNumRows(); i++) {
rows.setRow(i);
outFile.write(i + "," + rows.getString("WA") + "\n");
outFile.flush();
}
outFile.close();
}
This method tries to read a table where SAP stores meta data or data dictionary and writes the output to a csv file. This works fine but takes 30-40 secs and returns around 4 hundred thousand records with 32 columns. My intention was to ask if there is a way I can restrict my query to return only a particular field, instead of reading all the fields and discarding them in the client layer.
Thanks.

This works fine :
JCoTable table = function.getTableParameterList().getTable("FIELDS");
table.appendRow();
table.setValue("FIELDNAME", "TABNAME");
table.appendRow();
table.setValue("FIELDNAME", "TABCLASS");
Please check this Thread
Thanks.

Calculate client-server time difference in Borland Starteam server 8

Problem. I need a way to find Starteam server time through Starteam Java SDK 8.0. Version of server is 8.0.172 so method Server.getCurrentTime() is not available since it was added only in server version 9.0.
Motivation. My application needs to use views at specific dates. So if there's some difference in system time between client (where the app is running) and server then obtained views are not accurate. In the worst case the client's requested date is in the future for server so the operation results in exception.

After some investigation I haven't found any cleaner solution than using a temporary item. My app requests the item's time of creation and compares it with local time. Here's the method I use to get server time:
public Date getCurrentServerTime() {
Folder rootFolder = project.getDefaultView().getRootFolder();
Topic newItem = (Topic) Item.createItem(project.getTypeNames().TOPIC, rootFolder);
newItem.update();
newItem.remove();
newItem.update();
return newItem.getCreatedTime().createDate();
}

If your StarTeam server is on a Windows box and your code will be executing on a Windows box, you could shell out and execute the NET time command to fetch the time on that machine and then compare it to the local time.
net time \\my_starteam_server_machine_name
which should return:
"Current time at \\my_starteam_server_machine_name is 10/28/2008 2:19 PM"
"The command completed successfully."

We needed to come up with a way of finding the server time for use with CodeCollab. Here is a (longish) C# code sample of how to do it without creating a temporary file. Resolution is 1 second.
static void Main(string[] args)
{
// ServerTime replacement for pre-2006 StarTeam servers.
// Picks a date in the future.
// Gets a view, sets the configuration to the date, and tries to get a property from the root folder.
// If it cannot retrieve the property, the date is too far in the future. Roll back the date to an earlier time.
DateTime StartTime = DateTime.Now;
Server s = new Server("serverAddress", 49201);
s.LogOn("User", "Password");
// Getting a view - doesn't matter which, as long as it is not deleted.
Project p = s.Projects[0];
View v = p.AccessibleViews[0]; // AccessibleViews saves checking permissions.
// Timestep to use when searching. One hour is fairly quick for resolution.
TimeSpan deltaTime = new TimeSpan(1, 0, 0);
deltaTime = new TimeSpan(24 * 365, 0, 0);
// Invalid calls return faster - start a ways in the future.
TimeSpan offset = new TimeSpan(24, 0, 0);
// Times before the view was created are invalid.
DateTime minTime = v.CreatedTime;
DateTime localTime = DateTime.Now;
if (localTime < minTime)
{
System.Console.WriteLine("Current time is older than view creation time: " + minTime);
// If the dates are so dissimilar that the current date is before the creation date,
// it is probably a good idea to use a bigger delta.
deltaTime = new TimeSpan(24 * 365, 0, 0);
// Set the offset to the minimum time and work up from there.
offset = minTime - localTime;
}
// Storage for calculated date.
DateTime testTime;
// Larger divisors converge quicker, but might take longer depending on offset.
const float stepDivisor = 10.0f;
bool foundValid = false;
while (true)
{
localTime = DateTime.Now;
testTime = localTime.Add(offset);
ViewConfiguration vc = ViewConfiguration.CreateFromTime(testTime);
View tempView = new View(v, vc);
System.Console.Write("Testing " + testTime + " (Offset " + (int)offset.TotalSeconds + ") (Delta " + deltaTime.TotalSeconds + "): ");
// Unfortunately, there is no isValid operation. Attempting to
// read a property from an invalid date configuration will
// throw an exception.
// An alternate to this would be proferred.
bool valid = true;
try
{
string testname = tempView.RootFolder.Name;
}
catch (ServerException)
{
System.Console.WriteLine(" InValid");
valid = false;
}
if (valid)
{
System.Console.WriteLine(" Valid");
// If the last check was invalid, the current check is valid, and
// If the change is this small, the time is very close to the server time.
if (foundValid == false && deltaTime.TotalSeconds <= 1)
{
break;
}
foundValid = true;
offset = offset.Add(deltaTime);
}
else
{
offset = offset.Subtract(deltaTime);
// Once a valid time is found, start reducing the timestep.
if (foundValid)
{
foundValid = false;
deltaTime = new TimeSpan(0,0,Math.Max((int)(deltaTime.TotalSeconds / stepDivisor), 1));
}
}
}
System.Console.WriteLine("Run time: " + (DateTime.Now - StartTime).TotalSeconds + " seconds.");
System.Console.WriteLine("The local time is " + localTime);
System.Console.WriteLine("The server time is " + testTime);
System.Console.WriteLine("The server time is offset from the local time by " + offset.TotalSeconds + " seconds.");
}
Output:
Testing 4/9/2009 3:05:40 PM (Offset 86400) (Delta 31536000): InValid
Testing 4/9/2008 3:05:40 PM (Offset -31449600) (Delta 31536000): Valid
...
Testing 4/8/2009 10:05:41 PM (Offset 25200) (Delta 3): InValid
Testing 4/8/2009 10:05:38 PM (Offset 25197) (Delta 1): Valid
Run time: 9.0933426 seconds.
The local time is 4/8/2009 3:05:41 PM
The server time is 4/8/2009 10:05:38 PM
The server time is offset from the local time by 25197 seconds.

<stab_in_the_dark>
I'm not familiar with that SDK but from looking at the API if the server is in a known timezone why not create and an OLEDate object whose date is going to be the client's time rolled appropriately according to the server's timezone?
</stab_in_the_dark>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Dataframes are slow to parse through small amount of data - java

Related

Last partition taking very long time to save into S3 bucket spark RDD

How can I run a function multiple times with multiple data sets or values?

How do I save a retrieve specific data from a csv file without headers in java?

Filtering fields in RFC_READ_TABLE with SAP JCo

Calculate client-server time difference in Borland Starteam server 8

Categories

Resources