Spark: Ignoring or handling DataSet select errors

Spark: Ignoring or handling DataSet select errors - java

I'm testing some prototype application. We have json data with nested fields. I'm trying to pull some field using following json and code:
Feed: {name: "test",[Record: {id: 1 AllColumns: {ColA: "1",ColB: "2"}}...]}
Dataset<Row> completeRecord = sparkSession.read().json(inputPath);
final Dataset<Row> feed = completeRecord.select(completeRecord.col("Feed.Record.AllColumns"));
I have around 2000 files with such records. I have tested some files individually and they are working fine. But for some file I am getting below error on second line:
org.apache.spark.sql.AnalysisException: Can't extract value from
Feed#8.Record: need struct type but got string;
I'm not sure what is going on here. But I would like to either handle this error gracefully and log which file has that record. Also, is there any way to ignore this and continue with rest of the files?

Answering my own question based on what I have learned. There are couple of ways to solve it. Spark provides options to ignore corrupt files and corrupt records.
To ignore corrupt files one can set following flag to true:
spark.sql.files.ignoreCorruptFiles=true
For more fine grained control and to ignore bad records instead of ignoring the complete file. You can use one of three modes that Spark api provides.
According to DataFrameReader api
mode (default PERMISSIVE): allows a mode for dealing with corrupt
records during parsing.
PERMISSIVE : sets other fields to null when it
meets a corrupted record, and puts the malformed string into a new
field configured by columnNameOfCorruptRecord. When a schema is set by
user, it sets null for extra fields.
DROPMALFORMED : ignores the whole
corrupted records.
FAILFAST : throws an exception when it meets
corrupted records.
PERMISSIVE mode worked really well for me but when I provided my own schema Spark filled missing attributes with null instead of marking it corrupt record.

The exception says that one of the json files differs in its structure and that the path Feed.Record.AllColumns does not exist in this specific file.
Based on this method
private boolean pathExists(Dataset<Row> df, String path) {
try {
df.apply(path);
return true;
}
catch(Exception ex){
return false;
}
}
you can decide if you execute the select or log an error message:
if(pathExists(completeRecord, "Feed.Record.AllColumns") {
final Dataset<Row> feed = completeRecord.select(completeRecord.col("Feed.Record.AllColumns"));
//continue with processing
}
else {
//log error message
}

Related

Aerospike Exception Error Code 4 Parameter Error when doing a put through java client

I am processing an avro file with a list of records and doing a client.put for each record to my local Aerospike store.
For some reason, put for a certain number of records is succeeding and it's not for the rest. I am doing this -
client.put(writePolicy, recordKey, bins);
The related values for the failed call are -
namespace = test
setname = test_set
userkey = some_string
write policy = null
Bins -
is_user:1
prof_loc:530049,530046,530032,530031,530017,530016,500046
rfm:Platinum
store_browsed:some_string
store_purch:some_string
city_id:null
Log Snippet -
com.aerospike.client.AerospikeException: Error Code 4: Parameter error
at com.aerospike.client.command.WriteCommand.parseResult(WriteCommand.java:72)
at com.aerospike.client.command.SyncCommand.execute(SyncCommand.java:56)
at com.aerospike.client.AerospikeClient.put(AerospikeClient.java:338)
What could possibly be the issue?

Finally. Resolved!
I was using the REPLACE RecordsExistsAction in this case. Any bin with null value will fail in this configuration. Aerospike treats a null value in a bin as equivalent to removing that bin value from the record for a key. Thus REPLACE configuration doesn't make sense for such an operation, and hence a parameter error - Invalid DB operation.
UPDATE config on the other hand will work perfectly fine.

Aerospike allows reading and writing with great flexibility. For developers to harness this functionality, Aerospike exposes great number of variables on both Policy and WritePolicy, which at times can be intimidating and error prone to beginners. Parameter error simply means that some of the configurations are not in coherence. An easy start would be to use the default read or write policy, which can be obtained by passing null as the policy.
Eg:
aeroClient.put(null, key, new Bin("binName", object));
Below is aerospike put method code snippet
public final void put(WritePolicy policy, Key key, Bin... bins) throws AerospikeException {
if (policy == null) {
policy = writePolicyDefault;
}
WriteCommand command = new WriteCommand(cluster, policy, key, bins, Operation.Type.WRITE);
command.execute();
}
I recently got this error, because the expiration value that I was using in writePolicy was more than the default expiration time for the cache

Handling non-fatal errors in Java

I've written a program to aid the user in configuring 'mechs for a game. I'm dealing with loading the user's saved data. This data can (and some times does) become partially corrupt (either due to bugs on my side or due to changes in the game data/rules from upstream).
I need to be able to handle this corruption and load as much as possible. To be more specific, the contents of the save file are syntactically correct but semantically corrupt. I can safely parse the file and drop whatever entries that are not semantically OK.
Currently my data parser will just show a modal dialog with an appropriate warning message. However displaying the warning is not the job of the parser and I'm looking for a way of passing this information to the caller.
Some code to show approximately what is going on (in reality there is a bit more going on than this, but this highlights the problem):
class Parser{
public void parse(XMLNode aNode){
...
if(corrupted) {
JOptionPane.showMessageDialog(null, "Corrupted data found",
"error!", JOptionPane.WARNING_MESSAGE);
// Keep calm and carry on
}
}
}
class UserData{
static UserData loadFromFile(File aFile){
UserData data = new UserData();
Parser parser = new Parser();
XMLDoc doc = fromXml(aFile);
for(XMLNode entry : doc.allEntries()){
data.append(parser.parse(entry));
}
return data;
}
}
The thing here is that bar an IOException or a syntax error in the XML, loadFromFile will always succeed in loading something and this is the wanted behavior. Somehow I just need to pass the information of what (if anything) went wrong to the caller. I could return a Pair<UserData,String> but this doesn't look very pretty. Throwing an exception will not work in this case obviously.
Does any one have any ideas on how to solve this?

Depending on what you are trying to represent, you can use a class, like SQLWarning from the java.sql package. When you have a java.sql.Statement and call executeQuery you get a java.sql.ResultSet and you can then call getWarnings on the result set directly, or even on the statement itself.
You can use an enum, like RefUpdate.Result, from the JGit project. When you have a org.eclipse.jgit.api.Git you can create a FetchCommand, which will provide you with a FetchResult, which will provide you with a collection of TrackingRefUpdates, which will each contain a RefUpdate.Result enum, which can be one of:
FAST_FORWARD
FORCED
IO_FAILURE
LOCK_FAILURE
NEW
NO_CHANGE
NOT_ATTEMPTED
REJECTED
REJECTED_CURRENT_BRANCH
RENAMED
In your case, you could even use a boolean flag:
class UserData {
public boolean isCorrupt();
}
But since you mentioned there is a bit more than that going on in reality, it really depends on your model of "corrupt". However, you will probably have more options if you have a UserDataReader that you can instantiate, instead of a static utility method.

Can't use Row Data Resulted from Scripted DataSet

Performing a test with BIRT I was able to create a report and render it in PDF, but unfortunatelly I'm not getting the expected result.
For my DataSource I created a Scripted DataSource and no code was needed in there (as far as I could see the documentation to achieve what I'm trying to do).
For my DataSet I create a Scripted DataSet using my Scripted DataSource as source. In there I defined the script for open like:
importPackage(Packages.org.springframework.context);
importPackage(Packages.org.springframework.web.context.support);
var sc = reportContext.getHttpServletRequest().getSession().getServletContext();
var spring = WebApplicationContextUtils.getWebApplicationContext(sc);
myPojo = spring.getBean("myDao").findById(params["pojoId"]);
And script for fetch like:
if(myPojo != null){
row["title"] = myPojo.getTitle();
myPojo = null;
return true;
}
return false;
As the population of row will be done on runtime, I wasn't able to automatically get the DataSet columns, so I created one with the following configuration: name: columnTitle (as this is the name used to populated row object in fetch code).
Afterwards I edited the layout of my report and added the column to my layout.
I was able to confirm that spring.getBean("myDao").findById(params["pojoId"]); is executed. But my rendered report is not showing the title. If I double click on my column label on report layout I can see there that expression is dataSetRow["columnTitle"] is it right? Even I'm using row in my fetch script? What am I missing here?

Well, what is conctractVersion?
It is obviously not initialized in the open event.
Should this read myPojo.contractVersion or perhaps myPojo.getContractVersion()?
Another point: Is the DS with the column "columnTitle" bound to the layout?
You should also run your report as HTML or in the previewer to check for script errors.
Unfortunately, these are silently ignored when generating the report as PDF...

The problem was the use of batik twice (two different versions), one dependency for BIRT and other for DOCX4J.
The issue is quite difficult to identify because there is no log output rendering PDF files.
Rendering HTML I could see an error message which I could investigate and find the problem.
For my case I could remove the DocX4j from maven POM.

Jena TDB hangs/freezes on named model access

I have a problem with Apache Jena TDB. Basically I create a new Dataset, load data from an RDF/XML file into a named model with the name "http://example.com/model/filename" where filename is the name of the XML/RDF file. After loading the data, all statements from the named model are inserted into the default model. The named model is kept in the dataset for backup reasons.
When I now try to query the named models in the Dataset, TDB freezes and the application seems to run in an infinite loop, so it is not terminated nor does it throw an exception.
What is causing that freeze and how can I prevent it?
Example code:
Dataset ds = TDBFactory.createDataset("tdb");
Model mod = ds.getDefaultModel();
File f = new File("example.rdf");
FileInputStream fis = new FileInputStream(f);
ds.begin(ReadWrite.WRITE);
// Get a new named model to load the data into
Model nm = ds.getNamedModel("http://example.com/model/example.rdf");
nm.read(fis, null);
// Do some queries on the Model using the utility methods of Model, no SPARQL used
// Add all statements from the named model to the default model
mod.add(nm);
ds.commit();
ds.end();
// So far everything works as expected, but the following line causes the freeze
Iterator<String> it = ds.listNames();
Any method call that accesses the existing named models causes the same freeze reaction, so this is the same for getNamedModel("http://example.com/model/example.rdf"); for example. Adding new named models by calling getNamedModel("http://example.com/model/example123.rdf"); works fine, so only access to existing models is broken.
Used environment: Linux 64bit, Oracle Java 1.7.0_09, Jena 2.7.4 (incl. TDB 0.9.4)
Thanks in advance for any help!
Edit: Fixed mistake in code fragment
Edit2: Solution (my last comment to AndyS answer)
Ok, I went through the whole program and added all missing transactions. Not it is working as expected. I suspect Jena throwing an Exception during the shutdown sequence of my program but that Exception was not reported properly and the "freeze" was caused by other threads not terminating correctly. Thanks for pointing the faulty transaction usage out.

Could you turn this into a test case and send it to the jena users mailing list please?
You should get the default model inside the transaction - you got it outside.
Also, if you have used a dataset transactionally, you can't use it untransactionally as you do at ds.listNames. It shouldn't freeze - you should get some kind of warning.

Postgresql 8.4 reading OID style BLOBs with Hibernate

I am getting this weird case when querying Postgres 8.4 for some records with Blobs (of type OIDs) with Hibernate. The query does return all right but when my code wants to read the content of the BLOB with the simple code below, it gets 0 bytes back
public static byte[] readBlob(Blob blob) throws Exception {
InputStream is = null;
try {
is = blob.getBinaryStream();
return org.apache.commons.io.IOUtils.toByteArray(is);
}
finally {
if (is != null)
try {
is.close();
}
catch(Exception e) {}
}
}
Funny think is that I am getting this behavior only since I've started adding more then one such records to the table.
The underlying JDBC library is type 3 (postgresq 8.4-701).
Can someone give me a hint as to how to solve this issue?
Thanks
Peter

Looks like you may have found this bug:
http://opensource.atlassian.com/projects/hibernate/browse/HHH-4876

It was a while since I have run into similar issue, and since I've refreshed my memories about this topic I am thinking of sharing the results. The problem is that Postgres (and a few versions back Oracle too) will not handle the Blob content at the record creation time in the same transaction. Funny think is that one needs to pass the content after the external file (where the content gets stored eventually) has been well created and reserved for the operation. Yes, the record gets created but the Blob is blank. To have the Blob filled out with whatever you need to put in, you need that operation in a second transaction (sort of an update record). That's a funny business (maybe a major bug), ehe

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark: Ignoring or handling DataSet select errors - java

Related

Aerospike Exception Error Code 4 Parameter Error when doing a put through java client

Handling non-fatal errors in Java

Can't use Row Data Resulted from Scripted DataSet

Jena TDB hangs/freezes on named model access

Postgresql 8.4 reading OID style BLOBs with Hibernate

Categories

Resources