I'm using saveAsHadoopFile of JavaPairRDD to save RDD as avro file with snappy compression. Is it possible to force extension of output files to be snappy?
AvroOutputFormat has hardcoded .avro extension and does not allow to change it.
I've uploaded a patch to the Avro JIRA with proper changes.
If you have similar problem you have to (as for now) simply subclass AvroOutputFormat and use it in saveAsHadoopFile method. For example in Scala:
rdd.saveAsHadoopFile("output/path",
classOf[AvroWrapper[GenericRecord]],
classOf[NullWritable],
classOf[YourOutputFormatClassName[GenericRecord]])
Related
After a quick research, I found the following three libraries for parsing Toml files in java.
toml4j
tomlj
jackson-dataformats-text
What I am looking for is a library that can parse toml files without a corresponding POJO class. While both toml4j, and tomlj can achieve that, they do not seem to be maintained.
jackson-dataformats-text on the other is actively maintained but I can not parse a toml file without the corresponding POJO class.
Is there a way to create a dynamic class in java that I can use to parse any toml file?
If you just need to read a TOML file without a POJO, FasterXML Jackson libraries are a great choice. The simplest method is to just read the content as a java.util.Map:
final var tomlMapper = new TomlMapper();
final var data = tomlMapper.readValue(new File("config.toml"), Map.class);
After that, the content of the file will be available in data.
If you need even lower level parsing, all formats supported by FasterXML Jackson can be read using the stream API. In case you need that, read about the stream API on the core module: FasterXML/jackson-core, just make sure you use the right factory class (TomlFactory instead of JsonFactory).
I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files.
I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated ways for this class, but could not get them to work with S3 either.
Any solution to this dilemma?
Just in case anyone is interested... why I am doing this in Scala? Well... I cannot figure out another way to do it. The Python implementations for Parquet (pyarrow and fastparquet) both seem to struggle with complicated list/struct based schemas.
Also, I have seen some AvroParquetReader based code (Read parquet data from AWS s3 bucket) that might be a different solution, but I could not get these to work without a known schema. but maybe I am missing something there.
I'd really like to get the ParquetFileReader class to work, as it seem clean.
Appreciate any ideas.
Hadoop uses its own filesystem abstraction layer, which has an implementation for s3 (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A).
The setup should look someting like the following (java, but same should work with scala):
Configuration conf = new Configuration();
conf.set(Constants.ENDPOINT, "https://s3.eu-central-1.amazonaws.com/");
conf.set(Constants.AWS_CREDENTIALS_PROVIDER,
DefaultAWSCredentialsProviderChain.class.getName());
// maybe additional configuration properties depending on the credential provider
URI uri = URI.create("s3a://bucketname/path");
org.apache.hadoop.fs.Path path = new Path(uri);
ParquetFileReader pfr = ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))
I know that Apache Spark was primarly developed to analyze unstructured data. However, I have to read and process a huge XML file (greater than 1GB) and I have to use Apache Spark as a requirement.
Googling a little, I found how an XML file can be read by a Spark process, using partitioning in a proper way. As it is described here, it can be used the hadoop-streaming library, such this:
val jobConf = new JobConf()
jobConf.set("stream.recordreader.class",
"org.apache.hadoop.streaming.StreamXmlRecordReader")
jobConf.set("stream.recordreader.begin", "<page")
jobConf.set("stream.recordreader.end", "</page>")
org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf, s"hdfs://$master:9000/data.xml")
// Load documents, splitting wrt <page> tag.
val documents = sparkContext.hadoopRDD(jobConf, classOf[org.apache.hadoop.streaming.StreamInputFormat], classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text])
Every chunk of information can then be processed in a Scala / Java object using dom4j or JAXB (more complex).
Now, the problem is the following: the XML file should be validated, before processing it. How can I do in a way that conforms to Spark? As far as I know, the StreamXmlRecordReader used to split the file does not perform any validation.
I'm trying to read an N-Quads file with Jena, but all I get is an empty model. The file I'm trying to read is taken from the example in N-Quads documentation:
<http://example.org/#spiderman> <http://www.perceive.net/schemas/relationship/enemyOf> <http://example.org/#green-goblin> <http://example.org/graphs/spiderman> .
(I saved it as a file named file.nq).
The way I'm loading the model is using the RDFDataMgr. But it didn't work with Model.read either.
RDFDataMgr.loadModel("file.nq", Lang.NQUADS)
yields an empty model.
What am I missing? Doesn't Jena support N-Quads out-of-the-box?
Yes, Jena supports N-Quads. Try loadDataset.
N-Quads is for multiple graphs and you have read it into one graph. What you get is just the default graph triples, in this case, none.
There is a warning emitted:
WARN riot :: Only triples or default graph data expected : named graph data ignored
If you didn't get that then (1) you are running an old copy (2) you have turned logging off (3) the file is empty.
I have been working on a subtitling system on java.
the normal .srt file can be saved and the subtitles are seen fine.
i want the subtitles to have different properties like diff font/color/size all these properties are not encoded in a normal .srt, the file has to be saved as .ssa(substation alpha) with extra fields like [v4+ style] and events..
i want to know that are there any libraries which i can use to export directly to .ssa or do i have to write a method which includes the [v4+ style]
Thank you.
jubler is an open source library that seems to support substation alpha format.