I'm using Java with JDBC to run MySql code. I want to execute a DDL script, but JDBC can only execute a single statement at a time, which makes it unsuitable to execute a whole .sql file out of the box.
What I'm trying to do is use Antlr4 to parse the .sql file so I can break up each individual statement and then iteratively execute them with JDBC.
I've gotten this far:
InputStream resourceAsStream = Main.class.getClassLoader()
.getResourceAsStream("an-arbitrary-ddl.sql");
CharStream codePointCharStream = CharStreams.fromStream(resourceAsStream);
MySqlLexer tokenSource = new MySqlLexer(new CaseChangingCharStream(codePointCharStream, true));
TokenStream tokenStream = new CommonTokenStream(tokenSource);
MySqlParser mySqlParser = new MySqlParser(tokenStream);
// Where do I go from here?
I'm sure I'm just not searching for the correct terms because I'm new to Antlr and manually parsing code. I can't find any reference from here as to what I need to do to get individual sql statements out of the MySqlParser. What do I need to do next?
A parser is not the right tool for this kind of problem. A statement splitter is pretty easy to write manually and much faster if you do it yourself. I implemented such a splitter in C++ in MySQL Workbench. Shouldn't be difficult to port this to Java. The code is very fast (1 Mio LOC SQL code in under 1 sec on an average machine). A parser would need much longer.
I'm sure this can be improved, however, as the most simple way I could create this was creating a listener and provide the constructor with a Consumer<String> object. The listener looks at individual statements and recursively constructs them. There is probably a more optimal solution, however, I no longer have time to try to optimize this if there is.
/**
* #author Paul Nelson Baker
* #see GitHub
* #see LinkedIn
* #since 2018-09
*/
public class SqlStatementListener extends MySqlParserBaseListener {
private final Consumer<String> sqlStatementConsumer;
public SqlStatementListener(Consumer<String> sqlStatementConsumer) {
this.sqlStatementConsumer = sqlStatementConsumer;
}
#Override
public void enterSqlStatement(MySqlParser.SqlStatementContext ctx) {
if (ctx.getChildCount() > 0) {
StringBuilder stringBuilder = new StringBuilder();
recreateStatementString(ctx.getChild(0), stringBuilder);
stringBuilder.setCharAt(stringBuilder.length() - 1, ';');
String recreatedSqlStatement = stringBuilder.toString();
sqlStatementConsumer.accept(recreatedSqlStatement);
}
super.enterSqlStatement(ctx);
}
private void recreateStatementString(ParseTree currentNode, StringBuilder stringBuilder) {
if (currentNode instanceof TerminalNode) {
stringBuilder.append(currentNode.getText());
stringBuilder.append(' ');
}
for (int i = 0; i < currentNode.getChildCount(); i++) {
recreateStatementString(currentNode.getChild(i), stringBuilder);
}
}
}
Next you need to traverse the statements, the string consumer from earlier allows you to lazily redirect the output wherever you need. This can be as simple as just printing to stdout, however, it can just as easily be used to append to a list.
public List<String> mySqlStatementsFrom(String sourceCode) {
List<String> statements = new ArrayList<>();
mySqlStatementsToConsumer(sourceCode, statements::add);
return statements;
}
public void mySqlStatementsToConsumer(String sourceCode, Consumer<String> mySqlStatementConsumer) {
CharStream codePointCharStream = CharStreams.fromString(sourceCode);
MySqlLexer tokenSource = new MySqlLexer(new CaseChangingCharStream(codePointCharStream, true));
TokenStream tokenStream = new CommonTokenStream(tokenSource);
MySqlParser mySqlParser = new MySqlParser(tokenStream);
SqlStatementListener statementListener = new SqlStatementListener(mySqlStatementConsumer);
ParseTreeWalker.DEFAULT.walk(statementListener, mySqlParser.sqlStatements());
}
Related
I'm using the model builder addon for OpenNLP to create a better NER model.
According to this post, I have used the code posted by markg :
public class ModelBuilderAddonUse {
private static List<String> getSentencesFromSomewhere() throws Exception
{
List<String> list = new ArrayList<String>();
BufferedReader reader = new BufferedReader(new FileReader("D:\\Work\\workspaces\\default\\UpdateModel\\documentrequirements.docx"));
String line;
while ((line = reader.readLine()) != null)
{
list.add(line);
}
reader.close();
return list;
}
public static void main(String[] args) throws Exception {
/**
* establish a file to put sentences in
*/
File sentences = new File("D:\\Work\\workspaces\\default\\UpdateModel\\sentences.text");
/**
* establish a file to put your NER hits in (the ones you want to keep based
* on prob)
*/
File knownEntities = new File("D:\\Work\\workspaces\\default\\UpdateModel\\knownentities.txt");
/**
* establish a BLACKLIST file to put your bad NER hits in (also can be based
* on prob)
*/
File blacklistedentities = new File("D:\\Work\\workspaces\\default\\UpdateModel\\blentities.txt");
/**
* establish a file to write your annotated sentences to
*/
File annotatedSentences = new File("D:\\Work\\workspaces\\default\\UpdateModel\\annotatedSentences.txt");
/**
* establish a file to write your model to
*/
File theModel = new File("D:\\Work\\workspaces\\default\\UpdateModel\\nl-ner-person.bin");
//------------create a bunch of file writers to write your results and sentences to a file
FileWriter sentenceWriter = new FileWriter(sentences, true);
FileWriter blacklistWriter = new FileWriter(blacklistedentities, true);
FileWriter knownEntityWriter = new FileWriter(knownEntities, true);
//set some thresholds to decide where to write hits, you don't have to use these at all...
double keeperThresh = .95;
double blacklistThresh = .7;
/**
* Load your model as normal
*/
TokenNameFinderModel personModel = new TokenNameFinderModel(new File("D:\\Work\\workspaces\\default\\UpdateModel\\nl-ner-person.bin"));
NameFinderME personFinder = new NameFinderME(personModel);
/**
* do your normal NER on the sentences you have
*/
for (String s : getSentencesFromSomewhere()) {
sentenceWriter.write(s.trim() + "\n");
sentenceWriter.flush();
String[] tokens = s.split(" ");//better to use a tokenizer really
Span[] find = personFinder.find(tokens);
double[] probs = personFinder.probs();
String[] names = Span.spansToStrings(find, tokens);
for (int i = 0; i < names.length; i++) {
//YOU PROBABLY HAVE BETTER HEURISTICS THAN THIS TO MAKE SURE YOU GET GOOD HITS OUT OF THE DEFAULT MODEL
if (probs[i] > keeperThresh) {
knownEntityWriter.write(names[i].trim() + "\n");
}
if (probs[i] < blacklistThresh) {
blacklistWriter.write(names[i].trim() + "\n");
}
}
personFinder.clearAdaptiveData();
blacklistWriter.flush();
knownEntityWriter.flush();
}
//flush and close all the writers
knownEntityWriter.flush();
knownEntityWriter.close();
sentenceWriter.flush();
sentenceWriter.close();
blacklistWriter.flush();
blacklistWriter.close();
/**
* THIS IS WHERE THE ADDON IS GOING TO USE THE FILES (AS IS) TO CREATE A NEW MODEL. YOU SHOULD NOT HAVE TO RUN THE FIRST PART AGAIN AFTER THIS RUNS, JUST NOW PLAY WITH THE
* KNOWN ENTITIES AND BLACKLIST FILES AND RUN THE METHOD BELOW AGAIN UNTIL YOU GET SOME DECENT RESULTS (A DECENT MODEL OUT OF IT).
*/
DefaultModelBuilderUtil.generateModel(sentences, knownEntities, blacklistedentities, theModel, annotatedSentences, "person", 3);
}
}
It also runs, but my output quits at :
annotated sentences: 1862
knowns: 58
Building Model using 1862 annotations
reading training data...
But in the example in the post it should go futher like this :
Indexing events using cutoff of 5
Computing event counts... done. 561755 events
Indexing... done.
Sorting and merging events... done. Reduced 561755 events to 127362.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 127362
Number of Outcomes: 3
Number of Predicates: 106490
...done.
Can anyone help me to fix this problem, so I can generate a model?
I have searched realy a lot but cant find any good documutation about it.
Would really appreciat it, thanks.
Correct the path to your training data file like this:
File sentences = new File("D:/Work/workspaces/default/UpdateModel/sentences.text");
instead of
File sentences = new File("D:\\Work\\workspaces\\default\\UpdateModel\\sentences.text");
Update
This is how is used, by adding the files to the project folder. Try it like this -
File sentences = new File("src/training/resources/CreateModel/sentences.txt");
Check my respository for reference on Github
This should help.
I use Spark 2.0.1.
I am trying to find distinct values in a JavaRDD as below
JavaRDD<String> distinct_installedApp_Ids = filteredInstalledApp_Ids.distinct();
I see that this line is throwing the below exception
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.rdd.RDD.checkpointRDD(RDD.scala:226)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:84)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
..........
The same stacktrace is repeated again and again.
The input filteredInstalledApp_Ids has large input with millions of records.Will thh issue be the number of records or is there a efficient way to find distinct values in JavaRDD. Any help would be much appreciated. Thanks in advance. Cheers.
Edit 1:
Adding the filter method
JavaRDD<String> filteredInstalledApp_Ids = installedApp_Ids
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
return v1 != null;
}
}).cache();
Edit 2:
Added the method used to generate installedApp_Ids
public JavaRDD<String> getIdsWithInstalledApps(String inputPath, JavaSparkContext sc,
JavaRDD<String> installedApp_Ids) {
JavaRDD<String> appIdsRDD = sc.textFile(inputPath);
try {
JavaRDD<String> appIdsRDD1 = appIdsRDD.map(new Function<String, String>() {
#Override
public String call(String t) throws Exception {
String delimiter = "\t";
String[] id_Type = t.split(delimiter);
StringBuilder temp = new StringBuilder(id_Type[1]);
if ((temp.indexOf("\"")) != -1) {
String escaped = temp.toString().replace("\\", "");
escaped = escaped.replace("\"{", "{");
escaped = escaped.replace("}\"", "}");
temp = new StringBuilder(escaped);
}
// To remove empty character in the beginning of a
// string
JSONObject wholeventObj = new JSONObject(temp.toString());
JSONObject eventJsonObj = wholeventObj.getJSONObject("eventData");
int appType = eventJsonObj.getInt("appType");
if (appType == 1) {
try {
return (String.valueOf(appType));
} catch (JSONException e) {
return null;
}
}
return null;
}
}).cache();
if (installedApp_Ids != null)
return sc.union(installedApp_Ids, appIdsRDD1);
else
return appIdsRDD1;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
I assume the main dataset is in inputPath. It appears that it's a comma-separated file with JSON-encoded values.
I think you could make your code a bit simpler by combination of Spark SQL's DataFrames and from_json function. I'm using Scala and leave converting the code to Java as a home exercise :)
The lines where you load a inputPath text file and the line parsing itself can be as simple as the following:
import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
val dataset = spark.read.csv(inputPath)
You can display the content using show operator.
dataset.show(truncate = false)
You should see the JSON-encoded lines.
It appears that the JSON lines contain eventData and appType fields.
val jsons = dataset.withColumn("asJson", from_json(...))
See functions object for reference.
With JSON lines, you can select the fields of your interest:
val apptypes = jsons.select("eventData.appType")
And then union it with installedApp_Ids.
I'm sure the code gets easier to read (and hopefully to write too). The migration will give you extra optimizations that you may or may not be able to write yourself using assembler-like RDD API.
And the best is that filtering out nulls is as simple as using na operator that gives DataFrameNaFunctions like drop. I'm sure you'll like them.
It does not necessarily answer your initial question, but this java.lang.StackOverflowError might get away just by doing the code migration and the code gets easier to maintain, too.
I have a large file that I am trying to parse with Antlr in Java, and I would like to show the progress.
It looked like could do the following:
CommonTokenStream tokens = new CommonTokenStream(lexer);
int maxTokenIndex = tokens.size();
and then use maxTokenIndex in a ParseTreeListener as such:
public void exitMyRule(MyRuleContext context) {
int tokenIndex = context.start.getTokenIndex();
myReportProgress(tokenIndex, maxTokenIndex);
}
The second half of that appears to work. I get ever increasing values for tokenIndex. However, tokens.size() is returning 0. This makes it impossible to gauge how much progress I have made.
Is there a good way to get an estimate of how far along I am?
The following appears to work.
File file = getFile();
ANTLRInputStream input = new ANTLRInputStream(new FileReader(file));
ProgressMonitor progress = new ProgressMonitor(null,
"Loading " + file.getName(),
null,
0,
input.size());
Then extend MyGrammarBaseListener with
#Override
public void exitMyRule(MyRuleContext context) {
progress.setProgress(context.stop.getStopIndex());
}
I need to test the below mentioned method by calling it locally by a main method
public TokenFilter create(TokenStream input) {
if (protectedWords != null){
input = new KeywordMarkerFilter(input,protectedWords);
}
return new KStemFilter(input);
}
The problem I'm facing is I need to pass a string as input, but I'm not sure how to parse it as a token stream.
Please help.
To get TokenString from a search text, you have to use Analyzer for that:
Analyzer analyzer = ...; // your analyzer
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(searchText));
Note that it should be the same analyzer that is used to build the index.
I am beginning with Java and testng test cases.
I need to write a class, which reads data from a file and makes an in-memory data structure and uses this data structure for further processing. I would like to test, if this DS is being populated correctly. This would call for dumping the DS into a file and then comparing the input file with the dumped file. Is there any testNG assert available for file matching? Is this a common practice?
I think it would be better to compare the data itself not the written out data.
So I would write a method in the class to return this data structure (let's call it getDataStructure()) and then write a unit test to compare with the correct data.
This only needs a correct equals() method in your data structure class and do:
Assert.assertEquals(yourClass.getDataStructure(), correctData);
Of course if you need to write out the data structure to a file, then you can test the serialization and deserialization separately.
File compare/matching can be extracted to a utility method or something like that.
If you need it only for testing there are addons for jUnit
http://junit-addons.sourceforge.net/junitx/framework/FileAssert.html
If you need file compare outside the testing environment you can use this simple function
public static boolean fileContentEquals(String filePathA, String filePathB) throws Exception {
if (!compareFilesLength(filePathA, filePathB)) return false;
BufferedInputStream streamA = null;
BufferedInputStream streamB = null;
try {
File fileA = new File(filePathA);
File fileB = new File(filePathB);
streamA = new BufferedInputStream(new FileInputStream(fileA));
streamB = new BufferedInputStream(new FileInputStream(fileB));
int chunkSizeInBytes = 16384;
byte[] bufferA = new byte[chunkSizeInBytes];
byte[] bufferB = new byte[chunkSizeInBytes];
int totalReadBytes = 0;
while (totalReadBytes < fileA.length()) {
int readBytes = streamA.read(bufferA);
streamB.read(bufferB);
if (readBytes == 0) break;
MessageDigest digestA = MessageDigest.getInstance(CHECKSUM_ALGORITHM);
MessageDigest digestB = MessageDigest.getInstance(CHECKSUM_ALGORITHM);
digestA.update(bufferA, 0, readBytes);
digestB.update(bufferB, 0, readBytes);
if (!MessageDigest.isEqual(digestA.digest(), digestB.digest()))
{
closeStreams(streamA, streamB);
return false;
}
totalReadBytes += readBytes;
}
closeStreams(streamA, streamB);
return true;
} finally {
closeStreams(streamA, streamB);
}
}
public static void closeStreams(Closeable ...streams) {
for (int i = 0; i < streams.length; i++) {
Closeable stream = streams[i];
closeStream(stream);
}
}
public static boolean compareFilesLength(String filePathA, String filePathB) {
File fileA = new File(filePathA);
File fileB = new File(filePathB);
return fileA.length() == fileB.length();
}
private static void closeStream(Closeable stream) {
try {
stream.close();
} catch (IOException e) {
// ignore exception
}
}
Your choice, but having an utility class with that functionality that can be reused is better imho.
Good luck and have fun.
Personally I would do the opposite. Surely you need a way to compare two of these data structure in the Java world - so the test would read from the file, build the DS, do its processing, and then assert it's equal to an "expected" DS you set up in your test.
(using JUnit4)
#Test
public void testProcessingDoesWhatItShould() {
final DataStructure original = readFromFile(filename);
final DataStructure actual = doTheProcessingYouNeedToDo(original);
final DataStructure expected = generateMyExpectedResult();
Assert.assertEquals("data structure", expected, actual);
}
If this DS is a simple Java Bean. then you can use EqualsBuilder from Apache Commons to compare 2 objects.
compare bytes loaded from file system and bytes you are going to write file system
pseudo code
byte[] loadedBytes = loadFileContentFromFile(file) // maybe apache commons IOUtils.toByteArray(InputStream input)
byte[] writeBytes = constructBytesFromDataStructure(dataStructure)
Assert.assertTrue(java.util.Arrays.equals(writeBytes ,loadedBytes));