I need to test the below mentioned method by calling it locally by a main method
public TokenFilter create(TokenStream input) {
if (protectedWords != null){
input = new KeywordMarkerFilter(input,protectedWords);
}
return new KStemFilter(input);
}
The problem I'm facing is I need to pass a string as input, but I'm not sure how to parse it as a token stream.
Please help.
To get TokenString from a search text, you have to use Analyzer for that:
Analyzer analyzer = ...; // your analyzer
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(searchText));
Note that it should be the same analyzer that is used to build the index.
Related
I'm starting to work with Apache Lucene 8.0. I would want to know how to convert my String text variable into lowercase using Lucene. I'm not really sure about how to do it because I couldn't find any examples. What I want would be something like this:
public class DocumentLowercase {
private Analyzer analyzer;
public Analyzer DocAnalysis(Document d) {
analyzer = new StandardAnalyzer();
String text = d.text();
**Here convert String Text into lowercase**
** maybe using Lower Case Tokenizer? but how? **
return analyzer;
}
}
StandardAnalyzer already converts everything to lower case!
Check the docs here: http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
They say:
Filters StandardTokenizer with LowerCaseFilter and StopFilter, using a
configurable list of stop words.
You can also see in the source code which components a StandardAnalyzer includes:
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new LowerCaseFilter(src);
tok = new StopFilter(tok, stopwords);
return new TokenStreamComponents(r -> {
src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
src.setReader(r);
}, tok);
}
If you want to customize your analyzer anyway you should look into CustomAnalyzer.
I'm using Java with JDBC to run MySql code. I want to execute a DDL script, but JDBC can only execute a single statement at a time, which makes it unsuitable to execute a whole .sql file out of the box.
What I'm trying to do is use Antlr4 to parse the .sql file so I can break up each individual statement and then iteratively execute them with JDBC.
I've gotten this far:
InputStream resourceAsStream = Main.class.getClassLoader()
.getResourceAsStream("an-arbitrary-ddl.sql");
CharStream codePointCharStream = CharStreams.fromStream(resourceAsStream);
MySqlLexer tokenSource = new MySqlLexer(new CaseChangingCharStream(codePointCharStream, true));
TokenStream tokenStream = new CommonTokenStream(tokenSource);
MySqlParser mySqlParser = new MySqlParser(tokenStream);
// Where do I go from here?
I'm sure I'm just not searching for the correct terms because I'm new to Antlr and manually parsing code. I can't find any reference from here as to what I need to do to get individual sql statements out of the MySqlParser. What do I need to do next?
A parser is not the right tool for this kind of problem. A statement splitter is pretty easy to write manually and much faster if you do it yourself. I implemented such a splitter in C++ in MySQL Workbench. Shouldn't be difficult to port this to Java. The code is very fast (1 Mio LOC SQL code in under 1 sec on an average machine). A parser would need much longer.
I'm sure this can be improved, however, as the most simple way I could create this was creating a listener and provide the constructor with a Consumer<String> object. The listener looks at individual statements and recursively constructs them. There is probably a more optimal solution, however, I no longer have time to try to optimize this if there is.
/**
* #author Paul Nelson Baker
* #see GitHub
* #see LinkedIn
* #since 2018-09
*/
public class SqlStatementListener extends MySqlParserBaseListener {
private final Consumer<String> sqlStatementConsumer;
public SqlStatementListener(Consumer<String> sqlStatementConsumer) {
this.sqlStatementConsumer = sqlStatementConsumer;
}
#Override
public void enterSqlStatement(MySqlParser.SqlStatementContext ctx) {
if (ctx.getChildCount() > 0) {
StringBuilder stringBuilder = new StringBuilder();
recreateStatementString(ctx.getChild(0), stringBuilder);
stringBuilder.setCharAt(stringBuilder.length() - 1, ';');
String recreatedSqlStatement = stringBuilder.toString();
sqlStatementConsumer.accept(recreatedSqlStatement);
}
super.enterSqlStatement(ctx);
}
private void recreateStatementString(ParseTree currentNode, StringBuilder stringBuilder) {
if (currentNode instanceof TerminalNode) {
stringBuilder.append(currentNode.getText());
stringBuilder.append(' ');
}
for (int i = 0; i < currentNode.getChildCount(); i++) {
recreateStatementString(currentNode.getChild(i), stringBuilder);
}
}
}
Next you need to traverse the statements, the string consumer from earlier allows you to lazily redirect the output wherever you need. This can be as simple as just printing to stdout, however, it can just as easily be used to append to a list.
public List<String> mySqlStatementsFrom(String sourceCode) {
List<String> statements = new ArrayList<>();
mySqlStatementsToConsumer(sourceCode, statements::add);
return statements;
}
public void mySqlStatementsToConsumer(String sourceCode, Consumer<String> mySqlStatementConsumer) {
CharStream codePointCharStream = CharStreams.fromString(sourceCode);
MySqlLexer tokenSource = new MySqlLexer(new CaseChangingCharStream(codePointCharStream, true));
TokenStream tokenStream = new CommonTokenStream(tokenSource);
MySqlParser mySqlParser = new MySqlParser(tokenStream);
SqlStatementListener statementListener = new SqlStatementListener(mySqlStatementConsumer);
ParseTreeWalker.DEFAULT.walk(statementListener, mySqlParser.sqlStatements());
}
I use Spark 2.0.1.
I am trying to find distinct values in a JavaRDD as below
JavaRDD<String> distinct_installedApp_Ids = filteredInstalledApp_Ids.distinct();
I see that this line is throwing the below exception
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.rdd.RDD.checkpointRDD(RDD.scala:226)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:84)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84)
..........
The same stacktrace is repeated again and again.
The input filteredInstalledApp_Ids has large input with millions of records.Will thh issue be the number of records or is there a efficient way to find distinct values in JavaRDD. Any help would be much appreciated. Thanks in advance. Cheers.
Edit 1:
Adding the filter method
JavaRDD<String> filteredInstalledApp_Ids = installedApp_Ids
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
return v1 != null;
}
}).cache();
Edit 2:
Added the method used to generate installedApp_Ids
public JavaRDD<String> getIdsWithInstalledApps(String inputPath, JavaSparkContext sc,
JavaRDD<String> installedApp_Ids) {
JavaRDD<String> appIdsRDD = sc.textFile(inputPath);
try {
JavaRDD<String> appIdsRDD1 = appIdsRDD.map(new Function<String, String>() {
#Override
public String call(String t) throws Exception {
String delimiter = "\t";
String[] id_Type = t.split(delimiter);
StringBuilder temp = new StringBuilder(id_Type[1]);
if ((temp.indexOf("\"")) != -1) {
String escaped = temp.toString().replace("\\", "");
escaped = escaped.replace("\"{", "{");
escaped = escaped.replace("}\"", "}");
temp = new StringBuilder(escaped);
}
// To remove empty character in the beginning of a
// string
JSONObject wholeventObj = new JSONObject(temp.toString());
JSONObject eventJsonObj = wholeventObj.getJSONObject("eventData");
int appType = eventJsonObj.getInt("appType");
if (appType == 1) {
try {
return (String.valueOf(appType));
} catch (JSONException e) {
return null;
}
}
return null;
}
}).cache();
if (installedApp_Ids != null)
return sc.union(installedApp_Ids, appIdsRDD1);
else
return appIdsRDD1;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
I assume the main dataset is in inputPath. It appears that it's a comma-separated file with JSON-encoded values.
I think you could make your code a bit simpler by combination of Spark SQL's DataFrames and from_json function. I'm using Scala and leave converting the code to Java as a home exercise :)
The lines where you load a inputPath text file and the line parsing itself can be as simple as the following:
import org.apache.spark.sql.SparkSession
val spark: SparkSession = ...
val dataset = spark.read.csv(inputPath)
You can display the content using show operator.
dataset.show(truncate = false)
You should see the JSON-encoded lines.
It appears that the JSON lines contain eventData and appType fields.
val jsons = dataset.withColumn("asJson", from_json(...))
See functions object for reference.
With JSON lines, you can select the fields of your interest:
val apptypes = jsons.select("eventData.appType")
And then union it with installedApp_Ids.
I'm sure the code gets easier to read (and hopefully to write too). The migration will give you extra optimizations that you may or may not be able to write yourself using assembler-like RDD API.
And the best is that filtering out nulls is as simple as using na operator that gives DataFrameNaFunctions like drop. I'm sure you'll like them.
It does not necessarily answer your initial question, but this java.lang.StackOverflowError might get away just by doing the code migration and the code gets easier to maintain, too.
StandardAnalyzer consider space-character as a token, I want StandardAnalyzer to not to make tokens using space-character as a token. So how can I override the tokenizer of StandardAnalyzer. If NOT the please suggest any other Analyzer with example that does not use the space-character as a token.
This code can helpy ou :
Analyzer ana = new StandardAnalyzer(LUCENE_30, Collections.emptySet());
Note that, the answer is version-dependent. For Lucene 4.0, use:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40, CharArraySet.EMPTY_SET);
Edit :
Constructs a StandardTokenizer filtered by a StandardFilter, a org.apache.lucene.analysis.LowerCaseFilter and a org.apache.lucene.analysis.StopFilter.
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
tokenStream.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new StopFilter(enableStopPositionIncrements, result, stopSet);
return result;
}
private static final class SavedStreams {
StandardTokenizer tokenStream;
TokenStream filteredTokenStream;
}
Well I replace StandardAnalyzer with KeywordAnalyzer, so this will be use for indexing and searching ... Then in search method I add these lines
parser.setDefaultOperator(Operator.AND);
if(searchWord.contains(" ")){
searchWord= searchWordreplace(" ", "?");
}
I need to implement email confirmation in my java web application. I am stuck with the email I have to send to the user.
I need to combine a template (of an confirmation email) with the User object and this will be the html content of the confirmation email.
I thought about using xslt as the template engine but I don't have xml form of the User object and don't really know how to create a xml from User instance.
I thought about jsp, but how do I render jsp page with an object and get the html as a result?
Any idea what packages I can use in order to create templae and combine it with an object?
I have used the following before. I seem to recall it wasn't complicated
http://velocity.apache.org/
How complex is the user object? If it's just five string-valued fields (say) you could simply supply these as string parameters to the transformation, avoiding the need to build XML from your Java data.
Alternatively, Java XSLT processors typically provide some way to invoke methods on Java objects from within the XSLT code. So you could supply the Java object as a parameter to the stylesheet and invoke its methods using extension functions. The details are processor-specific.
Instead of learning a new code, debug other's complicate code I decided to write my own small and suitable util:
public class StringTemplate {
private String filePath;
private String charsetName;
private Collection<AbstractMap.SimpleEntry<String, String>> args;
public StringTemplate(String filePath, String charsetName,
Collection<AbstractMap.SimpleEntry<String, String>> args) {
this.filePath = filePath;
this.charsetName=charsetName;
this.args = args;
}
public String generate() throws FileNotFoundException, IOException {
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(
getClass().getResourceAsStream(filePath),charsetName));
try {
String line = null;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append(System.getProperty("line.separator"));
}
} finally {
reader.close();
}
for (AbstractMap.SimpleEntry<String, String> arg : this.args) {
int index = builder.indexOf(arg.getKey());
while (index != -1) {
builder.replace(index, index + arg.getKey().length(), arg.getValue());
index += arg.getValue().length();
index = builder.indexOf(arg.getKey(), index);
}
}
return builder.toString();
}
}