Read Parquet File with illegal characters (Apache-Avro)

Read Parquet File with illegal characters (Apache-Avro) - java

I have some Parquet files written in Python using PyArrow. Now I want to read them using a Java program. I tried the following, using Apache Avro:
import java.io.IOException;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.avro.AvroReadSupport;
import org.apache.parquet.hadoop.ParquetReader;
public class Main {
private static Path path = new Path("D:\\pathToFile\\review.parquet");
public static void main(String[] args) throws IllegalArgumentException {
try {
Configuration conf = new Configuration();
Schema schema = SchemaBuilder.record("lineitem")
.fields()
.name("reviewID")
.aliases("review_id$str")
.type().stringType()
.noDefault()
.endRecord();
conf.set(AvroReadSupport.AVRO_REQUESTED_PROJECTION, schema.toString());
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(path)
.withConf(conf)
.build();
GenericRecord r;
while (null != (r = reader.read())) {
r.getSchema().getField("reviewID").addAlias("review_id$str");
Object review_id = r.get("review_id$str");
String review_id_str = review_id != null ? ("'" + review_id.toString() + "'") : "-";
System.out.println("review_id: " + review_id_str);
}
} catch (IOException e) {
System.out.println("Error reading parquet file.");
e.printStackTrace();
}
}
}
My Parquet File contains columns whose name contain the symbols [, ], ., \ and $. (In this case, the Parquet file contains a column review_id$str, whose values I want to read). However, these characters are invalid in Avro (see: https://avro.apache.org/docs/current/spec.html#names). Therefore, I tried to use Aliases (see: http://avro.apache.org/docs/current/spec.html#Aliases). Even though now I don't get any "Invalid Character Errors", I am still unable to get the values, i.e. nothing is getting printed even though the column contains values.
It only prints:
review_id: -
review_id: -
review_id: -
review_id: -
...
And expected would be:
review_id: Q1sbwvVQXV2734tPgoKj4Q
review_id: GJXCdrto3ASJOqKeVWPi6Q
review_id: 2TzJjDVDEuAW6MR5Vuc1ug
review_id: yi0R0Ugj_xUx_Nek0-_Qig
...
Am I using the Aliases wrong? Is it even possible to use aliases in this situation? If so, please explain me how I can fix it. Thank you.
Update 2021:
In the end, I decided not to use Java for this task. I stuck to my solution in Python using PyArrow which works perfectly fine.

Related

Using WEKA Filters in Java - Oversampling and Undersampling

I'm having an issue with finding out how to use WEKA filters in the java code. I've looked up help but it seems a little dated as I'm using WEKA 3.8.5 . I'm doing 3 test. Test 1: No Filter, Test 2: weka.filters.supervised.instance.SpreadSubsample -M 1.0 , and Test 3: weka.filters.supervised.instance.Resample -B 1.0 -Z 130.3.
If my research is correct I should import the filters like this. Now I'm lost on having "-M 1.0 " for SpreadSample(my under sampling Test) and "-B 1.0 -Z 130.3." for Resample(My oversampling test).
import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.supervised.instance.Resample;
import weka.filters.supervised.instance.SpreadSubsample;
And I have Test 1(my no filter Test) coded below
import java.io.FileReader;
import java.util.Random;
import weka.classifiers.Evaluation;
import weka.classifiers.trees.J48;
import weka.core.Instances;
public class Fraud {
public static void main(String args[])
{
try {
// Creating J48 classifier for the tree
J48 j48Classifier = new J48();
// Setting the path for the dataset
String FraudDataset = "C:\\Users\\Owner\\Desktop\\CreditCard\\CreditCard.arff";
BufferedReader bufferedReader
= new BufferedReader(
new FileReader(FraudDataset));
// Creating the data set instances
Instances datasetInstances
= new Instances(bufferedReader);
datasetInstances.setClassIndex(
datasetInstances.numAttributes() - 1);
Evaluation evaluation
= new Evaluation(datasetInstances);
// Cross Validate Model. 10 Folds
evaluation.crossValidateModel(
j48Classifier, datasetInstances, 10,
new Random(1));
System.out.println(evaluation.toSummaryString(
"\nResults", false));
}
// Catching exceptions
catch (Exception e) {
System.out.println("Error Occured!!!! \n"
+ e.getMessage());
}
System.out.print("DT Successfully executed.");
}
}
The results of my code is:
Results
Correctly Classified Instances 284649 99.9445 %
Incorrectly Classified Instances 158 0.0555 %
Kappa statistic 0.8257
Mean absolute error 0.0008
Root mean squared error 0.0232
Relative absolute error 24.2995 %
Root relative squared error 55.9107 %
Total Number of Instances 284807
DT Successfully executed.
Does anyone have an idea on how I can add the filters and the settings I want for the filters to the code for Test 2 and 3? Any help will be appreciated. I will run the 3 tests multiple times and compare the results. I want to see what works best of the 3.

-M 1.0 and -B 1.0 -Z 130.3 are the options that you supply to the filters from the command-line. These filters implement the weka.core.OptionHandler interface, which offers the setOptions and getOptions methods.
For example, SpreadSubsample can be instantiated like this:
import weka.filters.supervised.instance.SpreadSubsample;
import weka.core.Utils;
...
SpreadSubsample spread = new SpreadSubsample();
// Utils.splitOptions generates an array from an option string
spread.setOptions(Utils.splitOptions("-M 1.0"));
// alternatively:
// spread.setOptions(new String[]{"-M", "1.0"});
In order to apply the filters, you should use the FilteredClassifier approach. E.g., for SpreadSubsample you would do something like this:
import weka.classifiers.meta.FilteredClassifier;
import weka.classifiers.trees.J48;
import weka.filters.supervised.instance.SpreadSubsample;
import weka.core.Utils;
...
// base classifier
J48 j48 = new J48();
// filter
SpreadSubsample spread = new SpreadSubsample();
spread.setOptions(Utils.splitOptions("-M 1.0"));
// meta-classifier
FilteredClassifier fc = new FilteredClassifier();
fc.setFilter(spread);
fc.setClassifier(j48);
And then evaluate the fc classifier object on your dataset.

traversing a Map generated from OpenCSV

I have a single column of data, output from Google Sheets as CSV, and also from LibreOffice as CSV as well. I've tried to marshal both files using OpenCSV but am only getting a small portion of data available.
How can I read this file in? I don't really see any commas in this CSV file...but it's only a single column of data.
file:
thufir#dur:~/jaxb$
thufir#dur:~/jaxb$ head input.csv
Field 1
Foo # 16
bar
baz
fdkfdl
fdsfdsfsdfgh
thufir#dur:~/jaxb$
output:
thufir#dur:~/NetBeansProjects/BaseXFromJAXB$
thufir#dur:~/NetBeansProjects/BaseXFromJAXB$ gradle run
> Task :run
Jan 10, 2019 3:36:08 PM net.bounceme.dur.basexfromjaxb.csv.ReaderForCVS printMap
INFO: Foo # 16
Jan 10, 2019 3:36:08 PM net.bounceme.dur.basexfromjaxb.csv.ReaderForCVS printMap
INFO: Field 1
Jan 10, 2019 3:36:08 PM net.bounceme.dur.basexfromjaxb.csv.ReaderForCVS printMap
INFO: Foo # 16
BUILD SUCCESSFUL in 1s
3 actionable tasks: 1 executed, 2 up-to-date
thufir#dur:~/NetBeansProjects/BaseXFromJAXB$
code:
package net.bounceme.dur.basexfromjaxb.csv;
import com.opencsv.CSVReaderHeaderAware;
import java.io.File;
import java.io.FileReader;
import java.net.URI;
import java.util.Collection;
import java.util.Map;
import java.util.logging.Logger;
public class ReaderForCVS {
private static final Logger LOG = Logger.getLogger(ReaderForCVS.class.getName());
private Map<String, String> values;
public ReaderForCVS() {
}
public void unmarshal(URI inputURI) throws Exception {
FileReader f = new FileReader(new File(inputURI));
values = new CSVReaderHeaderAware(f).readMap();
}
public void printMap() {
Collection<String> stringValues = values.values();
for (String s : stringValues) {
LOG.info(s);
}
for (Map.Entry<String, String> item : values.entrySet()) {
String key = item.getKey();
String value = item.getValue();
LOG.info(key);
LOG.info(value);
}
}
}
Frankly, I can't tell whether the library is reading in the file in a funky way, or the file is mangled in someway, or what. I'll be looking for CSV from websites, but not sure what that establishes. I don't see it likely that the library isn't parsing properly, but neither can I see the problem with this data.
There are only so many ways to export data from a spreadsheet as CSV and I've tried a few. The content of the file is immaterial, but that structure: lines with no content, just a single column, special characters, is what I'm dealing with.
Reading in the file as text gives the desired output...

It looks like it works similar to the other CSVReaders OpenCSV Guide Reading. Here is some sample code I used which seemed to work:
CSVReaderHeaderAware csvReaderHeaderAware= new CSVReaderHeaderAware(new StringReader(DAOConstants.IND_DATA));
while (((values = (Map<String, String>) csvReaderHeaderAware.readMap())) != null)
{
for (Map.Entry<String,String> entry : values.entrySet())
System.out.println("Key = " + entry.getKey() + ", Value = " + entry.getValue());
}

Jackson CSV parser chokes on comma separated value files if "," is in a field even if quoting with "

The code:
package org.javautil.salesdata;
import java.io.File;
import java.io.IOException;
import java.util.Map;
import org.javautil.util.ListOfNameValue;
import com.fasterxml.jackson.databind.MappingIterator;
import com.fasterxml.jackson.dataformat.csv.CsvMapper;
import com.fasterxml.jackson.dataformat.csv.CsvSchema;
// https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv
public class Manufacturers {
private static final String fileName= "src/main/resources/pdssr/manufacturers.csv";
ListOfNameValue getManufacturers() throws IOException {
ListOfNameValue lnv = new ListOfNameValue();
File csvFile = new File(fileName);
CsvMapper mapper = new CsvMapper();
CsvSchema schema = CsvSchema.emptySchema().withHeader(); // use first row as header; otherwise defaults are fine
MappingIterator<Map<String,String>> it = mapper.readerFor(Map.class)
.with(schema)
.readValues(csvFile);
while (it.hasNext()) {
Map<String,String> rowAsMap = it.next();
System.out.println(rowAsMap);
}
return lnv;
}
}
The data:
"mfr_id","mfr_cd","mfr_name"
"0000000020","F-L", "Frito-Lay"
"0000000030","GM", "General Mills"
"0000000040","HVEND", "Hershey Vending"
"0000000050","HFUND", "Hershey Fund Raising"
"0000000055","HCONC", "Hershey Concession"
"0000000060","SNYDERS", "Snyder's of Hanover"
"0000000080","KELLOGG", "Kellogg & Keebler"
"0000000115","KARS", "Kar Nut Product (Kar's)"
"0000000135","MARS", "Mars Chocolate "
"0000000145","POORE", "Inventure Group (Poore Brothers)"
"0000000150","WOW", "WOW Foods"
"0000000160","CADBURY", "Cadbury Adam USA, LLC"
"0000000170","MONOGRAM", "Monogram Food"
"0000000185","JUSTBORN", "Just Born"
"0000000190","HOSTESS", "Hostess, Dolly Madison"
"0000000210","SARALEE", "Sara Lee"
The exception is
fasterxml.jackson.databind.exc.RuntimeJsonMappingException: Too many entries: expected at most 3 (value #3 (4 chars) "LLC"")
I thought I would throw out my own CSV parser and adopt a supported project with more functionality, but most of them are far slower, just plain break or have examples all over the web that don't work with current release of the product.

The problem is your file does not meet the CSV standard. The third field always starts with a space
mfr_id","mfr_cd","mfr_name"
"0000000020","F-L", "Frito-Lay"
"0000000030","GM", "General Mills"
"0000000040","HVEND", "Hershey Vending"
"0000000050","HFUND", "Hershey Fund Raising"
From wikipedia:
According to RFC 4180, spaces outside quotes in a field are not allowed; however, the RFC also says that "Spaces are considered part of a field and should not be ignored." and "Implementors should 'be conservative in what you do, be liberal in what you accept from others' (RFC 793, section 2.10) when processing CSV files."
Jackson is being "liberal" in processing most of your records; but when it finds
"0000000160","CADBURY", "Cadbury Adam USA, LLC"
It has no choice but to treat is as 4 fields:
'0000000160'
'CADBURY'
' "Cadbury Adam USA'
' LLC"'
Would suggest fixing the file as that will allow parsing with most CSV libraries. You could try another library, there is no shortage of them.

univocity-parsers can handle that without any issues. It's built to deal with all sorts of tricky and non-standard CSV files and is also faster than the parser you are using.
Try this code:
String fileName= "src/main/resources/pdssr/manufacturers.csv";
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(settings);
for(Record record : parser.iterateRecords(new File(fileName))){
Map<String, String> rowAsMap = record.toFieldMap();
System.out.println(rowAsMap);
}
Hope this helps.
Disclosure: I'm the author of this library. It's open source and free (Apache 2.0 license)

how to get details of the pl sql package after parsing in java

I have a pkb file. It contain a package and under that package it has multiple functions.
I have to get the following details out of it:
package name
function names (for all functions one by one)
params in function
return type of function
Approach: I am parsing the pkb file. I have taken the grammar from these sources:
Presto
Antlrv4 Grammer for plsql
After getting these grammar I downloaded the jar from antlr-4.5.3-complete.jar. Then using
java -cp org.antlr.v4.Tool grammar.g
one by one I execute this command on these grammars separately to generate listener, lexer, parser and other files.
After this I created two project in eclipse one for each grammar. I imported these generated file into the respective and set antlr-4.5.3-complete.jar file into the path. After this I used following code to check if my .pkb file is correct or not?
public static void parse(String file) {
try {
SqlBaseLexer lex = new SqlBaseLexer(new org.antlr.v4.runtime.ANTLRInputStream(file));
CommonTokenStream tokens = new CommonTokenStream(lex);
SqlBaseParser parser = new SqlBaseParser(tokens);
System.err.println(parser.getNumberOfSyntaxErrors()+" Errors");
} catch (RecognitionException e) {
System.err.println(e.toString());
} catch (java.lang.OutOfMemoryError e) {
System.err.println(file + ":");
System.err.println(e.toString());
} catch (java.lang.ArrayIndexOutOfBoundsException e) {
System.err.println(file + ":");
System.err.println(e.toString());
}
}
I am not getting any error in parsing the file.
But after this I am stuck with next steps. I need to get all the package name, functions, params etc.
How to get these details?
Also is my approach is correct to attain the required output.

The Presto grammar is a generic SQL grammar which is not suitable for parsing Oracle packages. The ANTLRv4 grammar for PL/SQL is the right tool for your task.
Generally an ANTLR grammar as such works as a validator. When you want to make some additional processing while parsing you should use ANTLR actions (see overview slide in this presentation). These are blocks of text written in the target language (e.g. Java) and enclosed in curly braces (see documentation).
There are at least two ways to solve your task with ANTLR actions.
Stdout output
The simplest way is to add println()s for certain rules.
To print package name modify package_body rule in plsql.g4 as follows:
package_body
: BODY package_name (IS | AS) package_obj_body*
(BEGIN seq_of_statements | END package_name?)
{System.out.println("Package name is "+$package_name.text);}
;
Similarly to print information about function's arguments and return type: add prinln()s in create_function_body rule. But there is an issue whith printing of parameters. If you use $parameter.text it will return name, type specification and default value according to parameter rule without spaces (as token sequence). If you add println() to parameter rule and use $parameter_name.text it will print all parameter's names (including parameters of procedures, not only functions). So you can add an ANTLR return value for parameter rule and assign $parameter_name.text to the return value:
parameter returns [String p_name]
: parameter_name (IN | OUT | INOUT | NOCOPY)*
type_spec? default_value_part?
{$p_name=$parameter_name.text;}
;
Thus is context of create_function_body we can access the parameter's name by $parameter.p_name:
create_function_body
: (CREATE (OR REPLACE)?)? FUNCTION function_name
{System.out.println("Parameters of function "+$function_name.text+":");}
('(' parameter {System.out.println($parameter.p_name);}
(',' parameter {System.out.println($parameter.p_name);})* ')')?
RETURN type_spec
(invoker_rights_clause|parallel_enable_clause|result_cache_clause|DETERMINISTIC)*
((PIPELINED? (IS | AS) (DECLARE? declare_spec* body | call_spec))
| (PIPELINED | AGGREGATE) USING implementation_type_name) ';'
{System.out.println("Return type of function "
+$function_name.text+" is "
+ $type_spec.text);}
;
Accumulation
Also you can save some calculations to variables and access them as parser class members. E.g. you can accumulate function's name in variable func_name. For this add #members section at beginning of the grammar:
grammar plsql;
#members{
String func_name = "";
}
And modify function_name rule as follows:
function_name
: id ('.' id_expression)? {func_name = func_name+$id.text + " ";}
;
Using lexer and parser classes
Here is an example of application to run your parser parse.java:
import org.antlr.v4.runtime.*;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class parse {
static String readFile(String path) throws IOException
{
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, "UTF-8");
}
public static void main(String[] args) throws Exception {
// create input stream `in`
ANTLRInputStream in = new ANTLRInputStream( readFile(args[0]) );
// create lexer `lex` with `in` at input
plsqlLexer lex = new plsqlLexer(in);
// create token stream `tokens` with `lex` at input
CommonTokenStream tokens = new CommonTokenStream(lex);
// create parser with `tokens` at input
plsqlParser parser = new plsqlParser(tokens);
// call start rule of parser
parser.sql_script();
// print func_name
System.out.println("Function names: "+parser.func_name);
}
}
Compile and run
After this generate java code by ANTLR:
java org.antlr.v4.Tool plsql.g4
and compile your Java code:
javac plsqlLexer.java plsqlParser.java plsqlListener.java parse.java
then run it for some .pkb file:
java parse green_tools.pkb
You can find modified parse.java, plsql.g4 and green_tools.pkb here.

NoSuchMethod when trying to create a SPARQL query with jena

I am trying to make some SPARQL queries using vc-db-1.rdf and q1.rq from ARQ examples. Here is my java code:
import com.hp.hpl.jena.rdf.model.*;
import com.hp.hpl.jena.util.FileManager;
import com.hp.hpl.jena.query.* ;
import com.hp.hpl.jena.query.ARQ;
import com.hp.hpl.jena.iri.*;
import java.io.*;
public class querier extends Object
{
static final String inputFileName = "vc-db-1.rdf";
public static void main (String args[])
{
// Create an empty in-memory model
Model model = ModelFactory.createDefaultModel();
// use the FileManager to open the bloggers RDF graph from the filesystem
InputStream in = FileManager.get().open(inputFileName);
if (in == null)
{
throw new IllegalArgumentException( "File: " + inputFileName + " not found");
}
// read the RDF/XML file
model.read( in, "");
// Create a new query
String queryString = "PREFIX vcard: <http://www.w3.org/2001/vcard-rdf/3.0#> SELECT ?y ?givenName WHERE { ?y vcard:Family \"Smith\" . ?y vcard:Given ?givenName . }";
QueryFactory.create(queryString);
}
}
Compilation passes just fine.
The problem is that the query is not even executed, but I am getting an error during creating it at line
QueryFactory.create(queryString);
with the following explanation:
C:\Wallet\projects\java\ARQ_queries>java querier
Exception in thread "main" java.lang.NoSuchMethodError: com.hp.hpl.jena.iri.IRI.
resolve(Ljava/lang/String;)Lcom/hp/hpl/jena/iri/IRI;
at com.hp.hpl.jena.n3.IRIResolver.resolveGlobal(IRIResolver.java:191)
at com.hp.hpl.jena.sparql.mgt.SystemInfo.createIRI(SystemInfo.java:31)
at com.hp.hpl.jena.sparql.mgt.SystemInfo.<init>(SystemInfo.java:23)
at com.hp.hpl.jena.query.ARQ.init(ARQ.java:373)
at com.hp.hpl.jena.query.ARQ.<clinit>(ARQ.java:385)
at com.hp.hpl.jena.query.Query.<clinit>(Query.java:53)
at com.hp.hpl.jena.query.QueryFactory.create(QueryFactory.java:68)
at com.hp.hpl.jena.query.QueryFactory.create(QueryFactory.java:40)
at com.hp.hpl.jena.query.QueryFactory.create(QueryFactory.java:28)
at querier.main(querier.java:24)
How can i solve this? Thank you.

It looks like you're missing the IRI library on the classpath (the IRI library is separate from the main Jena JAR). Jena has runtime dependencies on several other libraries which are included in the lib directory of the Jena distribution. All of these need to be on your classpath at runtime (but not necessarily at compile time).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read Parquet File with illegal characters (Apache-Avro) - java

Related

Using WEKA Filters in Java - Oversampling and Undersampling

traversing a Map generated from OpenCSV

Jackson CSV parser chokes on comma separated value files if "," is in a field even if quoting with "

how to get details of the pl sql package after parsing in java

NoSuchMethod when trying to create a SPARQL query with jena

Categories

Resources