S3 Implementation for org.apache.parquet.io.InputFile? - java

I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files.
I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated ways for this class, but could not get them to work with S3 either.
Any solution to this dilemma?
Just in case anyone is interested... why I am doing this in Scala? Well... I cannot figure out another way to do it. The Python implementations for Parquet (pyarrow and fastparquet) both seem to struggle with complicated list/struct based schemas.
Also, I have seen some AvroParquetReader based code (Read parquet data from AWS s3 bucket) that might be a different solution, but I could not get these to work without a known schema. but maybe I am missing something there.
I'd really like to get the ParquetFileReader class to work, as it seem clean.
Appreciate any ideas.

Hadoop uses its own filesystem abstraction layer, which has an implementation for s3 (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A).
The setup should look someting like the following (java, but same should work with scala):
Configuration conf = new Configuration();
conf.set(Constants.ENDPOINT, "https://s3.eu-central-1.amazonaws.com/");
conf.set(Constants.AWS_CREDENTIALS_PROVIDER,
DefaultAWSCredentialsProviderChain.class.getName());
// maybe additional configuration properties depending on the credential provider
URI uri = URI.create("s3a://bucketname/path");
org.apache.hadoop.fs.Path path = new Path(uri);
ParquetFileReader pfr = ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))

Related

Read a toml file in java 2022

After a quick research, I found the following three libraries for parsing Toml files in java.
toml4j
tomlj
jackson-dataformats-text
What I am looking for is a library that can parse toml files without a corresponding POJO class. While both toml4j, and tomlj can achieve that, they do not seem to be maintained.
jackson-dataformats-text on the other is actively maintained but I can not parse a toml file without the corresponding POJO class.
Is there a way to create a dynamic class in java that I can use to parse any toml file?
If you just need to read a TOML file without a POJO, FasterXML Jackson libraries are a great choice. The simplest method is to just read the content as a java.util.Map:
final var tomlMapper = new TomlMapper();
final var data = tomlMapper.readValue(new File("config.toml"), Map.class);
After that, the content of the file will be available in data.
If you need even lower level parsing, all formats supported by FasterXML Jackson can be read using the stream API. In case you need that, read about the stream API on the core module: FasterXML/jackson-core, just make sure you use the right factory class (TomlFactory instead of JsonFactory).

How to run analytic in spark?

I'm new to Spark. I'm still learning it. I have questions that would like opinion.
I have to prepare jar file for the analytic method that should be suitable to run as spark job.
is it necessary for jar to be executable / runnable?
Can I prepare jar as library with few methods
For my case,I have input and output of the analytic
Here, can I pass input json and get output json in the spark?
What are the steps?
Any help or links to read will be helpful?
Your first question basically asked how to run Spark with java API. Here is some code I think you'll find useful
SparkLauncher launcher = new SparkLauncher()
setAppName(config.getString("appName"))
.setSparkHome(sparkHomePath)
.setAppResource(pathToYourJar)
.setMaster(masterUrl)
.setMainClass(fullNameOfMainClass);
You might need to add launcher.addJar(...)
Create an instance of SparkAppHandle.Listener
SparkAppHandle handle = launcher.startApplication(sparkJobListener);
"can I pass input json and get output json in the spark?"
If you wish to read a JSON as the input you can follow the instructions in this link

How to use Weka JSONLoader in Java IDE?

I want to use Weka in order to parse an existing json file in java eclipse. I believe this can be done using the JSONLoader class. After I read the classes' specifications (http://weka.sourceforge.net/doc.dev/weka/core/converters/JSONLoader.html#JSONLoader--) I thought that this could be easily done by doing this:
JSONLoader jsonLoader = new JSONLoader(jsonFile);
Then I thought by just doing jsonLoader.getFileDescription() or jsonLoader.getSource() would give me results. This is not how it's done though and I can't find anywhere how to use the JSONLoader class in my java code. So in order not to make this question too broad, how can I create a JSONLoader object that reads a source that is in JSON format?
First of all it has nothing to do with eclipse so you should edit your question.
A brief look at the documentation of JSONLoader(in the link you provided) can tell that you need to set the data source you want to parse using setSource (the constructor is empty):
JSONLoader jsonLoader = new JSONLoader();
File f = new File("PATH_TO_YOUR_JSON_FILE");
jsonLoader.setSource(f); //you can also use InputStream instead of a File
After doing that you can use other methods that parse your JSON:
Instances dataset = jsonLoader.getDataSet();
jsonLoader.getFileDescription();
...

saveAsHadoopFile - files extension

I'm using saveAsHadoopFile of JavaPairRDD to save RDD as avro file with snappy compression. Is it possible to force extension of output files to be snappy?
AvroOutputFormat has hardcoded .avro extension and does not allow to change it.
I've uploaded a patch to the Avro JIRA with proper changes.
If you have similar problem you have to (as for now) simply subclass AvroOutputFormat and use it in saveAsHadoopFile method. For example in Scala:
rdd.saveAsHadoopFile("output/path",
classOf[AvroWrapper[GenericRecord]],
classOf[NullWritable],
classOf[YourOutputFormatClassName[GenericRecord]])

Validate an XML using Apache Spark

I know that Apache Spark was primarly developed to analyze unstructured data. However, I have to read and process a huge XML file (greater than 1GB) and I have to use Apache Spark as a requirement.
Googling a little, I found how an XML file can be read by a Spark process, using partitioning in a proper way. As it is described here, it can be used the hadoop-streaming library, such this:
val jobConf = new JobConf()
jobConf.set("stream.recordreader.class",
"org.apache.hadoop.streaming.StreamXmlRecordReader")
jobConf.set("stream.recordreader.begin", "<page")
jobConf.set("stream.recordreader.end", "</page>")
org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf, s"hdfs://$master:9000/data.xml")
// Load documents, splitting wrt <page> tag.
val documents = sparkContext.hadoopRDD(jobConf, classOf[org.apache.hadoop.streaming.StreamInputFormat], classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text])
Every chunk of information can then be processed in a Scala / Java object using dom4j or JAXB (more complex).
Now, the problem is the following: the XML file should be validated, before processing it. How can I do in a way that conforms to Spark? As far as I know, the StreamXmlRecordReader used to split the file does not perform any validation.

Categories

Resources