I'm starting to work with Apache Lucene 8.0. I would want to know how to convert my String text variable into lowercase using Lucene. I'm not really sure about how to do it because I couldn't find any examples. What I want would be something like this:
public class DocumentLowercase {
private Analyzer analyzer;
public Analyzer DocAnalysis(Document d) {
analyzer = new StandardAnalyzer();
String text = d.text();
**Here convert String Text into lowercase**
** maybe using Lower Case Tokenizer? but how? **
return analyzer;
}
}
StandardAnalyzer already converts everything to lower case!
Check the docs here: http://lucene.apache.org/core/8_0_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
They say:
Filters StandardTokenizer with LowerCaseFilter and StopFilter, using a
configurable list of stop words.
You can also see in the source code which components a StandardAnalyzer includes:
#Override
protected TokenStreamComponents createComponents(final String fieldName) {
final StandardTokenizer src = new StandardTokenizer();
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new LowerCaseFilter(src);
tok = new StopFilter(tok, stopwords);
return new TokenStreamComponents(r -> {
src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
src.setReader(r);
}, tok);
}
If you want to customize your analyzer anyway you should look into CustomAnalyzer.
Related
I'm using PDFBox in java and successfully retrieved a pdf. But now I wish to search for a specific word and only retrieve the following number. To be concrete, I want to search for Tax and retrieve the number that is tax. The two strings are separated by a tab it seems.
My code is as following atm
File file = new File("yes.pdf");
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
System.out.println(text);
// search for the word tax
// retrieve the number af the word "Tax"
document.close();
}
I have used similar thing in my project. I hope it will help you.
public class ExtractNumber {
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("yourFile location"));
PDFTextStripper stripper = new PDFTextStripper();
List<String> digitList = new ArrayList<String>();
//Read Text from pdf
String string = stripper.getText(doc);
// numbers follow by string
Pattern mainPattern = Pattern.compile("[a-zA-Z]\\d+");
//Provide actual text
Matcher mainMatcher = mainPattern.matcher(string);
while (mainMatcher.find()) {
//Get only numbers
Pattern subPattern = Pattern.compile("\\d+");
String subText = mainMatcher.group();
Matcher subMatcher = subPattern.matcher(subText);
subMatcher.find();
digitList.add(subMatcher.group());
}
if (doc != null) {
doc.close();
}
if(digitList != null && digitList.size() > 0 ) {
for(String digit: digitList) {
System.out.println(digit);
}
}
}
}
Regular expression [a-zA-Z]\d+ find one or more digit follow by latter from pdf text.
\d+ expression find specific text from above pattern.
you can also use different regular expression for find specific number of digit.
You can get more idea from this tutorial.
The best way to do something like that is to use regular expressions. I often use this tool to write my regular expressions. Your regex should probably look something like: tax\s([0-9]+). You can take a look at this tutorial on how to use regex in Java.
Hi I have a PDF file and I need to search a particular string in that. I tried various methods, and I am able to read all the contents in PDF file but unable to find a particular string.
Here in this file, I need to search string such as Telephone, Garbage, Rent etc individually.
Could you please help me?
I have the below code for reading the file.
public class PDFBoxReader {
private PDFParser parser;
private PDFTextStripper pdfStripper;
private PDDocument pdDoc ;
private COSDocument cosDoc ;
private String Text ;
private String filePath;
private File file;
public PDFBoxReader() {
}
public String ToText() throws IOException
{
this.pdfStripper = null;
this.pdDoc = null;
this.cosDoc = null;
file = new File("D:\\report.pdf");
parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdDoc.getNumberOfPages();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(10);
// reading text from page 1 to 10
// if you want to get text from full pdf file use this code
// pdfStripper.setEndPage(pdDoc.getNumberOfPages());
Text = pdfStripper.getText(pdDoc);
return Text;
}
public void setFilePath(String filePath) {
this.filePath = filePath;
}
}
It would be great if someone could help me with a code that searches for a particular string. Thanks in advance.
Try String.indexOf("substring") with String being what is returned from your ToText() method, and substring the string you wish to search for. (Side note, the custom in Java is camel-case methods, which would be toText() in this case.)
This method should find the first index of the entered substring in your long String of text. So you could do String.indexOf("Telephone") to find the first occurrence of the word Telephone in your String.
If you want the stuff directly after that substring, the index would simply be String.indexOf("substring")+"substring".length()
You can even find the next occurrence (or the next after that) with another variation of this method String.indexOf("substring", indexOfLastOccurrence+"substring".length)
Example:
String myPDF = ToText();
int rentIndex = myPDF.indexOf("Rent")+"Rent".length();
String rent = myPDF.substring(rentIndex); //Find 1st occurrence of "Rent" and get info after it
rent = rent.substring(int beginIndex, int endIndex); //Get endIndex-beginIndex characters after rent. (I assume you only want like a few numbers afterwards or something.)
//process rent e.g. Integer.parseInt(rent) or something
rentIndex = myPDF.indexOf("Rent",rentIndex)+"Rent".length();
rent = myPDF.substring(rentIndex); //Next occurrence of "Rent"
//Repeat to find the next occurrence, and the one after that. (Until rentIndex gets set to a negative, indicating that no more occurrences exist.)
Both methods can be found in the Java API:
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#indexOf(java.lang.String)
StandardAnalyzer consider space-character as a token, I want StandardAnalyzer to not to make tokens using space-character as a token. So how can I override the tokenizer of StandardAnalyzer. If NOT the please suggest any other Analyzer with example that does not use the space-character as a token.
This code can helpy ou :
Analyzer ana = new StandardAnalyzer(LUCENE_30, Collections.emptySet());
Note that, the answer is version-dependent. For Lucene 4.0, use:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40, CharArraySet.EMPTY_SET);
Edit :
Constructs a StandardTokenizer filtered by a StandardFilter, a org.apache.lucene.analysis.LowerCaseFilter and a org.apache.lucene.analysis.StopFilter.
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
tokenStream.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new StopFilter(enableStopPositionIncrements, result, stopSet);
return result;
}
private static final class SavedStreams {
StandardTokenizer tokenStream;
TokenStream filteredTokenStream;
}
Well I replace StandardAnalyzer with KeywordAnalyzer, so this will be use for indexing and searching ... Then in search method I add these lines
parser.setDefaultOperator(Operator.AND);
if(searchWord.contains(" ")){
searchWord= searchWordreplace(" ", "?");
}
I'm attempting to establish a reliable and fast way to transform XML to JSON using Java and I've started to use XStream to perform this task. However, when I run the code below the test fails due to whitespace (including newline), if I remove these characters then the test will pass.
#Test
public void testXmlWithWhitespaceBeforeStartElementCanBeConverted() throws Exception {
String xml =
"<root>\n" +
" <foo>bar</foo>\n" + // remove the newlines and white space to make the test pass
"</root>";
String expectedJson = "{\"root\": {\n" +
" \"foo\": bar\n" +
"}}";
String actualJSON = transformXmlToJson(xml);
Assert.assertEquals(expectedJson, actualJSON);
}
private String transformXmlToJson(String xml) throws XmlPullParserException {
XmlPullParser parser = XppFactory.createDefaultParser();
HierarchicalStreamReader reader = new XppReader(new StringReader(xml), parser, new NoNameCoder());
StringWriter write = new StringWriter();
JsonWriter jsonWriter = new JsonWriter(write);
HierarchicalStreamCopier copier = new HierarchicalStreamCopier();
copier.copy(reader, jsonWriter);
jsonWriter.close();
return write.toString();
}
The test fails the exception:
com.thoughtworks.xstream.io.json.AbstractJsonWriter$IllegalWriterStateException: Cannot turn from state SET_VALUE into state START_OBJECT for property foo
at com.thoughtworks.xstream.io.json.AbstractJsonWriter.handleCheckedStateTransition(AbstractJsonWriter.java:265)
at com.thoughtworks.xstream.io.json.AbstractJsonWriter.startNode(AbstractJsonWriter.java:227)
at com.thoughtworks.xstream.io.json.AbstractJsonWriter.startNode(AbstractJsonWriter.java:232)
at com.thoughtworks.xstream.io.copy.HierarchicalStreamCopier.copy(HierarchicalStreamCopier.java:36)
at com.thoughtworks.xstream.io.copy.HierarchicalStreamCopier.copy(HierarchicalStreamCopier.java:47)
at testConvertXmlToJSON.transformXmlToJson(testConvertXmlToJSON.java:30)
Is there a way to to tell the copy process to ignore the ignorable white space. I cannot find any obvious way to enable this behaviour, but I think it should be there. I know I can pre-process the XML to remove the white space, or maybe just use another library.
update
I can work around the issue using a decorator of the HierarchicalStreamReader interface and suppressing the white space node manually, this still does not feel ideal though. This would look something like the code below, which will make the test pass.
public class IgnoreWhitespaceHierarchicalStreamReader implements HierarchicalStreamReader {
private HierarchicalStreamReader innerHierarchicalStreamReader;
public IgnoreWhitespaceHierarchicalStreamReader(HierarchicalStreamReader hierarchicalStreamReader) {
this.innerHierarchicalStreamReader = hierarchicalStreamReader;
}
public String getValue() {
String getValue = innerHierarchicalStreamReader.getValue();
System.out.printf("getValue = '%s'\n", getValue);
if(innerHierarchicalStreamReader.hasMoreChildren() && getValue.length() >0) {
if(getValue.matches("^\\s+$")) {
System.out.printf("*** White space value suppressed\n");
getValue = "";
}
}
return getValue;
}
// rest of interface ...
Any help is appreciated.
Comparing two XML's as String objects is not a good idea. How are you going to handle case when xml is same but nodes are not in the same order.
e.g.
<xml><node1>1</node1><node2>2</node2></xml>
is similar to
<xml><node2>2</node2><node1>1</node1></xml>
but when you do a String compare it will always return false.
Instead use tools like XMLUnit. Refer to following link for more details,
Best way to compare 2 XML documents in Java
I need to test the below mentioned method by calling it locally by a main method
public TokenFilter create(TokenStream input) {
if (protectedWords != null){
input = new KeywordMarkerFilter(input,protectedWords);
}
return new KStemFilter(input);
}
The problem I'm facing is I need to pass a string as input, but I'm not sure how to parse it as a token stream.
Please help.
To get TokenString from a search text, you have to use Analyzer for that:
Analyzer analyzer = ...; // your analyzer
TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(searchText));
Note that it should be the same analyzer that is used to build the index.