Remove Space Character from Lucene Standard Analyzer

Remove Space Character from Lucene Standard Analyzer - java

StandardAnalyzer consider space-character as a token, I want StandardAnalyzer to not to make tokens using space-character as a token. So how can I override the tokenizer of StandardAnalyzer. If NOT the please suggest any other Analyzer with example that does not use the space-character as a token.

This code can helpy ou :
Analyzer ana = new StandardAnalyzer(LUCENE_30, Collections.emptySet());
Note that, the answer is version-dependent. For Lucene 4.0, use:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40, CharArraySet.EMPTY_SET);
Edit :
Constructs a StandardTokenizer filtered by a StandardFilter, a org.apache.lucene.analysis.LowerCaseFilter and a org.apache.lucene.analysis.StopFilter.
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
tokenStream.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new StopFilter(enableStopPositionIncrements, result, stopSet);
return result;
}
private static final class SavedStreams {
StandardTokenizer tokenStream;
TokenStream filteredTokenStream;
}

Well I replace StandardAnalyzer with KeywordAnalyzer, so this will be use for indexing and searching ... Then in search method I add these lines
parser.setDefaultOperator(Operator.AND);
if(searchWord.contains(" ")){
searchWord= searchWordreplace(" ", "?");
}

Related

Antlr pattern matching and lexer modes

I am trying to compile a pattern for html grammar. The code below shows how to parse a string containing htmlAttributeRule:
String code = "href=\"val\"";
CharStream chars = CharStreams.fromString(code);
Lexer lexer = new HTMLLexer(chars);
lexer.pushMode(HTMLLexer.TAG);
TokenStream tokens = new CommonTokenStream(lexer);
HTMLParser parser = new HTMLParser(tokens);
parser.htmlAttribute();
But when i'm trying to:
ParseTreePatternMatcher matcher = new ParseTreePatternMatcher(lexer, parser);
matcher.compile(code, HTMLParser.RULE_htmlAttribute);
it fails with error:
line 1:0 no viable alternative at input 'href="val"'
org.antlr.v4.runtime.NoViableAltException
at org.antlr.v4.runtime.atn.ParserATNSimulator.noViableAlt(ParserATNSimulator.java:2026)
at org.antlr.v4.runtime.atn.ParserATNSimulator.execATN(ParserATNSimulator.java:467)
at org.antlr.v4.runtime.atn.ParserATNSimulator.adaptivePredict(ParserATNSimulator.java:393)
at org.antlr.v4.runtime.ParserInterpreter.visitDecisionState(ParserInterpreter.java:316)
at org.antlr.v4.runtime.ParserInterpreter.visitState(ParserInterpreter.java:223)
at org.antlr.v4.runtime.ParserInterpreter.parse(ParserInterpreter.java:194)
at org.antlr.v4.runtime.tree.pattern.ParseTreePatternMatcher.compile(ParseTreePatternMatcher.java:205)
When i tried to:
List<? extends Token> tokenList = matcher.tokenize(code);
The result contained a single token, the same as when using the lexer with DEFAULT_MODE. Is there some way to fix this?

The problem was the following code from ParseTreePatternMatcher::tokenize:
TextChunk textChunk = (TextChunk)chunk;
ANTLRInputStream in = new ANTLRInputStream(textChunk.getText());
lexer.setInputStream(in);
Token t = lexer.nextToken();
Lexer::setInputStream clears _modeStack and sets _mode to 0. One possible solution is to extend ParseTreePatternMatcher, override method tokenize and insert lexer.pushMode(lexerMode) after lexer.setInputStream(in):
TextChunk textChunk = (TextChunk)chunk;
ANTLRInputStream in = new ANTLRInputStream(textChunk.getText());
lexer.setInputStream(in);
lexer.pushMode(lexerMode);
Token t = lexer.nextToken();
But method tokenize uses Chunk and TextChunk which cannot be accesses from outsize package, so we are obligated to define the extension class in the same package as ParseTreePatternMatcher.
Another solution i'm considering is to modify byte code of the method using ASM.

Search pattern within String in JAVA

I'm using PDFBox in java and successfully retrieved a pdf. But now I wish to search for a specific word and only retrieve the following number. To be concrete, I want to search for Tax and retrieve the number that is tax. The two strings are separated by a tab it seems.
My code is as following atm
File file = new File("yes.pdf");
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
System.out.println(text);
// search for the word tax
// retrieve the number af the word "Tax"
document.close();
}

I have used similar thing in my project. I hope it will help you.
public class ExtractNumber {
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("yourFile location"));
PDFTextStripper stripper = new PDFTextStripper();
List<String> digitList = new ArrayList<String>();
//Read Text from pdf
String string = stripper.getText(doc);
// numbers follow by string
Pattern mainPattern = Pattern.compile("[a-zA-Z]\\d+");
//Provide actual text
Matcher mainMatcher = mainPattern.matcher(string);
while (mainMatcher.find()) {
//Get only numbers
Pattern subPattern = Pattern.compile("\\d+");
String subText = mainMatcher.group();
Matcher subMatcher = subPattern.matcher(subText);
subMatcher.find();
digitList.add(subMatcher.group());
}
if (doc != null) {
doc.close();
}
if(digitList != null && digitList.size() > 0 ) {
for(String digit: digitList) {
System.out.println(digit);
}
}
}
}
Regular expression [a-zA-Z]\d+ find one or more digit follow by latter from pdf text.
\d+ expression find specific text from above pattern.
you can also use different regular expression for find specific number of digit.
You can get more idea from this tutorial.

The best way to do something like that is to use regular expressions. I often use this tool to write my regular expressions. Your regex should probably look something like: tax\s([0-9]+). You can take a look at this tutorial on how to use regex in Java.

How to apply LowerCase to a String using Lucene

I'm starting to work with Apache Lucene 8.0. I would want to know how to convert my String text variable into lowercase using Lucene. I'm not really sure about how to do it because I couldn't find any examples. What I want would be something like this:
public class DocumentLowercase {
private Analyzer analyzer;
public Analyzer DocAnalysis(Document d) {
analyzer = new StandardAnalyzer();
String text = d.text();
**Here convert String Text into lowercase**
** maybe using Lower Case Tokenizer? but how? **
return analyzer;
}
}

StandardAnalyzer already converts everything to lower case!
Check the docs here: http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
They say:
Filters StandardTokenizer with LowerCaseFilter and StopFilter, using a
configurable list of stop words.
You can also see in the source code which components a StandardAnalyzer includes:
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new LowerCaseFilter(src);
tok = new StopFilter(tok, stopwords);
return new TokenStreamComponents(r -> {
src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
src.setReader(r);
}, tok);
}
If you want to customize your analyzer anyway you should look into CustomAnalyzer.

How to rename Columns via Lambda function - fasterXML

Im using the FasterXML library to parse my CSV file. The CSV file has the column names in its first line. Unfortunately I need the columns to be renamed. I have a lambda function for this, where I can pass the red value from the csv file in and get the new value.
my code looks like this, but does not work.
CsvSchema csvSchema =CsvSchema.emptySchema().withHeader();
ArrayList<HashMap<String, String>> result = new ArrayList<HashMap<String, String>>();
MappingIterator<HashMap<String,String>> it = new CsvMapper().reader(HashMap.class)
.with(csvSchema )
.readValues(new File(fileName));
while (it.hasNext())
result.add(it.next());
System.out.println("changing the schema columns.");
for (int i=0; i < csvSchema.size();i++) {
String name = csvSchema.column(i).getName();
String newName = getNewName(name);
csvSchema.builder().renameColumn(i, newName);
}
csvSchema.rebuild();
when i try to print out the columns later, they are still the same as in the top line of my CSV file.
Additionally I noticed, that csvSchema.size() equals 0 - why?

You could instead use uniVocity-parsers for that. The following solution streams the input rows to the output so you don't need to load everything in memory to then write your data back with new headers. It will be much faster:
public static void main(String ... args) throws Exception{
Writer output = new StringWriter(); // use a FileWriter for your case
CsvWriterSettings writerSettings = new CsvWriterSettings(); //many options here - check the documentation
final CsvWriter writer = new CsvWriter(output, writerSettings);
CsvParserSettings parserSettings = new CsvParserSettings(); //many options here as well
parserSettings.setHeaderExtractionEnabled(true); // indicates the first row of the input are headers
parserSettings.setRowProcessor(new AbstractRowProcessor(){
public void processStarted(ParsingContext context) {
writer.writeHeaders("Column A", "Column B", "... etc");
}
public void rowProcessed(String[] row, ParsingContext context) {
writer.writeRow(row);
}
public void processEnded(ParsingContext context) {
writer.close();
}
});
CsvParser parser = new CsvParser(parserSettings);
Reader reader = new StringReader("A,B,C\n1,2,3\n4,5,6"); // use a FileReader for your case
parser.parse(reader); // all rows are parsed and submitted to the RowProcessor implementation of the parserSettings.
System.out.println(output.toString());
//nothing else to do. All resources are closed automatically in case of errors.
}
You can easily select the columns by using parserSettings.selectFields("B", "A") in case you want to reorder/eliminate columns.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

running java application

I need to test the below mentioned method by calling it locally by a main method
public TokenFilter create(TokenStream input) {
if (protectedWords != null){
input = new KeywordMarkerFilter(input,protectedWords);
}
return new KStemFilter(input);
}
The problem I'm facing is I need to pass a string as input, but I'm not sure how to parse it as a token stream.
Please help.

To get TokenString from a search text, you have to use Analyzer for that:
Analyzer analyzer = ...; // your analyzer
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(searchText));
Note that it should be the same analyzer that is used to build the index.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove Space Character from Lucene Standard Analyzer - java

StandardAnalyzer consider space-character as a token, I want StandardAnalyzer to not to make tokens using space-character as a token. So how can I override the tokenizer of StandardAnalyzer. If NOT the please suggest any other Analyzer with example that does not use the space-character as a token.

Well I replace StandardAnalyzer with KeywordAnalyzer, so this will be use for indexing and searching ... Then in search method I add these lines parser.setDefaultOperator(Operator.AND); if(searchWord.contains(" ")){ searchWord= searchWordreplace(" ", "?"); }

Related

Antlr pattern matching and lexer modes

Search pattern within String in JAVA

How to apply LowerCase to a String using Lucene

How to rename Columns via Lambda function - fasterXML

running java application

Categories

Resources