XML validation in spark java - java

We have got a 3GB XML which we have to validate and then flatten. We are expected to use Spark-Java to validate it and flattening as well. Flattened data will be ingested into Hive table.
Also, the validation should throw the bad record in XML (so that we can write the same back to Kafka topic to make source system aware of it). And bad record shouldn't get stored inside hive table.
Flattening based on com.databricks.spark.xml is not recommended by client.
Kindly help . If not code, algorithm would also help.

You can use javax.xml.validation.Validator. This api will help you to validate XML.

Related

ElasticSearch : plain text file instead of JSON

Interested in elasticsearch and working with txt files not json. Can elasticsearch support plain text file? If yes, Is there any java API, I can use ( I tested crud operations with postman on JSON document and it's working fine )
Thanks for the help.
No, the elasticsearch document api supports only JSON.
But there is a workaround for this problem using ingestion pipelines running on ingestion nodes in your cluster https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html
. By defaut each elasticsearch server instance is a ingestion node.
Please have a look on this wery well described approach for CSV https://www.elastic.co/de/blog/indexing-csv-elasticsearch-ingest-node which is easily adaptable for flat files.
Another option is to use a second tool like Filebeat or Logstash for file ingestion. Have a look here: https://www.elastic.co/products/beats or here https://www.elastic.co/products/logstash
Having a Filebeat in place will solve many problems with minimal effort. Give it a chance ;)

Apache Solr, SolrJ vs Data Import Handler for parsing XML

I'm hoping to use Solr to run searches from info parsed from XML files.
These XML files are not in Solr's document format, as such I have to parse them and get the fields I need that way.
I am familiar with Java programming and was wondering if SolrJ would be an easier method than using the Data Import Handler. I'm considering running through each XML file I have and parsing the fields that I need from each. Is there any downside to one method over the other? I imagine since I have familiarity with Java it may be easier to parse the XML that way?
I will probably need multiple conditions and regular expressions. If anything, a reliable way to get my fields from relatively unstructured XML.
How would SolrJ work with the interface? That is, if I index using SolrJ, can I do my queries through the interface still?
DIH was designed for prototyping, though some people do use it for production. You can start from it, but be ready to jump to SolrJ or other methods if you hit its limitations. And if you have very complex mappings, you may be better off with SolrJ to start from.
You can also apply XSLT transform on an incoming XML document to map it to the Solr format.
And as said elsewhere, search is a separate issue from indexing.
How you index your content into Solr is orthogonal to how you query it. You can index any way you want, as long as it produces the right docs in the index.
Now, regarding indexing, if DIH will get what you need without much tweaking go for it. But if you need to do a lot of tweaking of the data, in the end you might finish faster if you just write some java with Solr. With Solr you have all the flexibility, with DIH you are more constrained (think of the 80/20 rule).

AWS Lambda for Dumping excel into database

I do have a requirement where I need to copy data from one table of Oracle to another table on daily basis. Currently, I am fetching data from the database and writing them to Excel file through java code. So I have a list of POJO ready with me to insert. But I am open to an approach where I can directly dump data from my Oracle table to the second table(again I am open to the appropriate database for this like Oracle or Amazon dynamoDB etc). Below are the approaches I could think of. I still am searching for different approaches, I will update the post accordingly.
1) The naive approach is to just fire insert queries from java code it self. Yeah, I am using hibernate so it I can do it little easier.
2) Second I thought about using Amazon Lambda. I have not read about it completely, I just have a basic idea of it. But I am opening this question because I am novice and I want to select an efficient approach for this.
Will you please throw some light on my approaches or suggest a completely different one?
As Lambda has different triggers you can use one of those to load the excel. One solution would be setup an API through API gateway which triggers Lambda. Call API gateway with serialised data of excel, which in turn call Lambda and deserialise data in Lambda and save it to DynamoDB. Another solution is S3 which you have mentioned in comments
Best approach is to trigger a lambda function using cloudwatch on daily basis which can copy the data from one table to another in oracle or from oracle to dynamodb. No need of S3 or API gateway which is more complex and will cost you more.

Batch indexing to solr

I have a java class that sends http post requests to a solr instance to index json files. it is implemented in a multithreaded manner. However, I have realized that sending so many http requests (close to 20,000) is causing the network to be a bottle neck. I read online that I can do batch indexing, but I can't find any clear examples. Is there any advice to batch index solr?
Thank you.
For generic JSON, you must have a configuration somewhere in solrconfig.xml that defines how it is treated.
One of the parameters is split. You might be able to use it to combine your JSON documents into a one bigger one that Solr would split and process separately. Notice that the specific format may be a little different for different Solr versions. Get the correct version of the downloadable reference guide PDF, if something is not working.
Or, if you can generate it, use JSON format Solr understands directly and which has full support for multiple documents.

Is it feasible to translate table definitions used by Spring Batch?

We are going to use Spring-Batch in a project that needs to read, convert and write big ammounts of data. So far, everything is fine.
But there is a non-functional requirement that says we can't create DB objects using english words, so the original schema used by Spring Data will not be aproved by client's DBA, unless we translate it.
In docs, I don't see any way to configure or extend the API to achieve this objective, so it seems that we'll have to customize source code to make it work with the equivalent, translated, model. Is that a correct/feasible assumption, or am I missing something?
That is an unusual requirement. However, in order to completely rename the tables and columns in the batch schema, you'll need to re-implement the JDBC based repository DAOs to use your own SQL.

Categories

Resources