Solr custom Tokenizer Factory works randomly

Solr custom Tokenizer Factory works randomly - java

I am new in Solr and I have to do a filter to lemmatize text to index documents and also to lemmatize querys.
I created a custom Tokenizer Factory for lemmatized text before passing it to the Standard Tokenizer.
Making tests in Solr analysis section works fairly good (on index ok, but on query sometimes analyzes text two times), but when indexing documents it analyzes only the first documment and on querys it analyses randomly (It only analyzes first, and to analyze another you have to wait a bit time). It's not performance problem because I tried modifyng text instead of lemmatizing.
Here is the code:
package test.solr.analysis;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Map;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeSource.AttributeFactory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
//import test.solr.analysis.TestLemmatizer;
public class TestLemmatizerTokenizerFactory extends TokenizerFactory {
//private TestLemmatizer lemmatizer = new TestLemmatizer();
private final int maxTokenLength;
public TestLemmatizerTokenizerFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
maxTokenLength = getInt(args, "maxTokenLength", StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
public String readFully(Reader reader){
char[] arr = new char[8 * 1024]; // 8K at a time
StringBuffer buf = new StringBuffer();
int numChars;
try {
while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
buf.append(arr, 0, numChars);
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("### READFULLY ### => " + buf.toString());
/*
The original return with lemmatized text would be this:
return lemmatizer.getLemma(buf.toString());
To test it I only change the text adding "lemmatized" word
*/
return buf.toString() + " lemmatized";
}
#Override
public StandardTokenizer create(AttributeFactory factory, Reader input) {
// I print this to see when enters to the tokenizer
System.out.println("### Standar tokenizer ###");
StandardTokenizer tokenizer = new StandardTokenizer(luceneMatchVersion, factory, new StringReader(readFully(input)));
tokenizer.setMaxTokenLength(maxTokenLength);
return tokenizer;
}
}
With this, it only indexes the first text adding the word "lemmatized" to the text.
Then on first query if I search "example" word it looks for "example" and "lemmatized" so it returns me the first document.
On next searches it doesn't modify the query. To make a new query adding "lemmatized" word to the query, I have to wait some minutes.
What happens?
Thank you all.

I highly doubt that the create method is invoked on each query (for starters performance issues come to mind). I would take the safe route and create a Tokenizer that wraps a StandardTokenizer, then just override the setReader method and do my work there

Related

Is there a way to read csv file from S3 using Java without downloading it

I was able to connect Java to AWS S3, and I was able to perform basic operations like listing buckets. I need a way to read a CSV file without downloading it. I am attaching my current code here.
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.Bucket;
import com.amazonaws.services.s3.model.CannedAccessControlList;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.PutObjectRequest;
import com.amazonaws.services.s3.model.S3ObjectSummary;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;
import java.util.Properties;
public class test {
public static void main(String args[])throws IOException
{
AWSCredentials credentials =new BasicAWSCredentials("----","----");
AmazonS3 s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.US_EAST_2)
.build();
List<Bucket> buckets = s3client.listBuckets();
for(Bucket bucket : buckets) {
System.out.println(bucket.getName());
}
}
}

There is a way with a code like this. In my code I am trying to get the file which we want to read in my S3Object obj , then I am passing that file to InputStreamReader() :
S3Object Obj = s3client.getObject("<Bucket Name>", "File Name");
BufferedReader reader = new BufferedReader(new InputStreamReader(Obj.getObjectContent()));
// this will store characters of first row in array
String row[] = line.split(",");
// this will fetch number of columns
int length = row.length;
while((line=reader.readLine()) != null) {
// storing characters of corresponding line in an array
String value[] = line.split(",");
for(int i=0;i<length;i++) {
System.out.print(value[i]+" ");
}
System.out.println();
}

The answer by #jay and #Elikill58 is super helpful! This just adds a bit of clarity and accessibility to it.
To get an object from and S3 bucket after you have done all the authentication work is with the .getObject(String bucketName, String fileName) function. Note what it says about file names in the documentation:
An Amazon S3 bucket has no directory hierarchy such as you would find in a typical computer file system. You can, however, create a logical hierarchy by using object key names that imply a folder structure. For example, instead of naming an object sample.jpg, you can name it photos/2006/February/sample.jpg.
To get an object from such a logical hierarchy, specify the full key
name for the object in the GET operation. For a virtual hosted-style
request example, if you have the object
photos/2006/February/sample.jpg, specify the resource as
/photos/2006/February/sample.jpg. For a path-style request example, if
you have the object photos/2006/February/sample.jpg in the bucket
named examplebucket, specify the resource as
/examplebucket/photos/2006/February/sample.jpg.
Once you have an the S3Object that'll be returned, just pass it into this function below (which is just a modified version of #jay's that fixes a few errors)!
private static void parseCSVS3Object(S3Object data) {
BufferedReader reader = new BufferedReader(new InputStreamReader(data.getObjectContent()));
try {
// Get all the csv headers
String line = reader.readLine();
String[] headers = line.split(",");
// Get number of columns and print headers
int length = headers.length;
for (String header : headers) {
System.out.print(header + " ");
}
while((line = reader.readLine()) != null) {
System.out.println();
// get and print the next line (row)
String[] row = line.split(",");
for (String value : row) {
System.out.print(value + " ");
}
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}

For your code to read the file, it needs the contents -- and that means copying it to the local system.
However, you can use "range" (Java) to read just a part.

Replacing all text in powerpoint using Apache POI

I looked at the apache POI documentation and created a function that redacts all the text in a powerpoint. Function works well in replacing texts in slides but not the texts found in grouped textboxes. Is there seperate object that handles the grouped items?
private static void redactText(XMLSlideShow ppt) {
for (XSLFSlide slide : ppt.getSlides()) {
System.out.println("REDACT Slide: " + slide.getTitle());
XSLFTextShape[] shapes = slide.getPlaceholders();
for (XSLFTextShape textShape : shapes) {
List<XSLFTextParagraph> textparagraphs = textShape.getTextParagraphs();
for (XSLFTextParagraph para : textparagraphs) {
List<XSLFTextRun> textruns = para.getTextRuns();
for (XSLFTextRun incomingTextRun : textruns) {
String text = incomingTextRun.getRawText();
System.out.println(text);
if (text.toLowerCase().contains("test")) {
String newText = text.replaceAll("(?i)" + "test", "XXXXXXXX");
incomingTextRun.setText(newText);
}
}
}
}
}
}

If the need is simply getting all text contents independent of in what objects it is, then one could simply do exactly that. Text contents are contained in org.apache.xmlbeans.XmlString elements. In PowerPoint XML they are in a:t tags. Name space a="http://schemas.openxmlformats.org/drawingml/2006/main".
So following code gets all text in all objects in all slides and does replacing case-insensitive string "test" with "XXXXXXXX".
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.xslf.usermodel.*;
import org.openxmlformats.schemas.presentationml.x2006.main.CTSlide;
import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlString;
public class ReadPPTXAllText {
public static void main(String[] args) throws Exception {
XMLSlideShow slideShow = new XMLSlideShow(new FileInputStream("MicrosoftPowerPoint.pptx"));
for (XSLFSlide slide : slideShow.getSlides()) {
CTSlide ctSlide = slide.getXmlObject();
XmlObject[] allText = ctSlide.selectPath(
"declare namespace a='http://schemas.openxmlformats.org/drawingml/2006/main' " +
".//a:t"
);
for (int i = 0; i < allText.length; i++) {
if (allText[i] instanceof XmlString) {
XmlString xmlString = (XmlString)allText[i];
String text = xmlString.getStringValue();
System.out.println(text);
if (text.toLowerCase().contains("test")) {
String newText = text.replaceAll("(?i)" + "test", "XXXXXXXX");
xmlString.setStringValue(newText);
}
}
}
}
FileOutputStream out = new FileOutputStream("MicrosoftPowerPointChanged.pptx");
slideShow.write(out);
slideShow.close();
out.close();
}
}

If one doesn't like the approach of replacing via Xml directly, it is possible to iterate over all slides and their shapes. If a shape is a XSLFTextShape, get the paragraphs and handle them like you did.
If you receive a XSLFGroupShape, iterate over their getShapes() as well. Since they could contain different types of shapes you might use recursion for that. You might handle the shape type XSLFTable also.
But the real trouble starts when you realize, that something you want to replace is divided into several runs ;-)

how to read from a txt file in blackberry eclipse?

i am developing an simple blackberry application in BlackBerry - Java Plug-in for Eclipse. In that, i want to read data from an external text file. I had searched for this, and tried for some tips, like. But failed at last. I will describe my application...
my main file...
package com.nuc;
import net.rim.device.api.ui.UiApplication;
public class Launcher extends UiApplication
{
public static void main(String[] args)
{
Launcher theApp = new Launcher();
theApp.enterEventDispatcher();
}
public Launcher()
{
pushScreen(new MyScreen());
}
}
And then my app class is like....
package com.nuc;
import net.rim.device.api.ui.container.MainScreen;
import net.rim.device.api.ui.component.BasicEditField;
import net.rim.device.api.ui.component.Dialog;
import net.rim.device.api.ui.component.EditField;
import net.rim.device.api.ui.component.LabelField;
import net.rim.device.api.ui.container.GridFieldManager;
import net.rim.device.api.ui.Field;
import net.rim.device.api.ui.FieldChangeListener;
public final class MyScreen extends MainScreen implements FieldChangeListener
{
// declared variables...
public MyScreen()
{
//rest codes...
I want to show some details from a text file before my app starts, like the End User License Agreement.. ie, something which cames as the first line..
my first question is, where i need to put that text file... i got lots of guidance from net, but nothing worked for eclipse..
Secondly, then how can i read the file and put its content in a dialog.
So plz guide me how i can achieve it.. sample code will be appreciable, for i am new to this environment...

To add a file to your Eclipe project
right click on the res folder of your project structure, click on New, click on Untitled Text File and then enter some text and save the file.
To read from a file and display on a dialog try something like the following code snippet:
try {
InputStream is = (InputStream) getClass().getResourceAsStream("/Text");
String str = "";
int ch;
while ((ch = is.read()) != -1) {
str += (char)ch;
}
synchronized (UiApplication.getEventLock()) {
Dialog.alert(str == null ? "Failed to read." : str);
}
} catch (Exception e) {
synchronized (UiApplication.getEventLock()) {
Dialog.alert(e.getMessage() + " + " + e.toString());
}
}
in the above code "/Text" is the file name. And if you got a NullPointerException then check the file name and path.

Rupak's answer is mostly correct, but there's a few problems with it. You definitely don't want to add immutable strings together in a situation like this. When you add 2 strings together (myString += "Another String") Java basically creates a new String object with the values of the two other Strings, because it cannot change the contents of the other strings. Usually this is fine if you just need to add two strings together, but in this case if you have a large file then you're creating a new String object for EVERY character in the file (each object bigger than the last). There's a lot of overhead associated with this object creation AND the garbage collector (very slow) will have to intervene more often because of all these objects that need to be destroyed.
StringBuffer to the rescue! Using a StringBuffer in place of the String concatenation will only require 1 object to be created and will be much faster.
try {
InputStream is = (InputStream) getClass().getResourceAsStream("/Text");
StringBuffer str = new StringBuffer();
int ch;
while ((ch = is.read()) != -1) {
str.append((char)ch);
}
UiApplication.getUiApplication().invokeLater(new Runnable(){
public void run(){
Dialog.alert(str.toString() == null ? "Failed to read." : str.toString());
}
}
} catch (Exception e) {
UiApplication.getUiApplication().invokeLater(new Runnable(){
public void run(){
Dialog.alert(e.getMessage() + " + " + e.toString());
}
}
}
Also several developers on the Blackberry support forums recommend against using UiApplication.getEventLock() because it can be "dangerous". They recommend using invokeLater() instead. See Blackberry Support Forums

'Un'-externalize strings from Eclipse or Intellij

I have a bunch of strings in a properties file which i want to 'un-externalize', ie inline into my code.
I see that both Eclipse and Intellij have great support to 'externalize' strings from within code, however do any of them support inlining strings from a properties file back into code?
For example if I have code like -
My.java
System.out.println(myResourceBundle.getString("key"));
My.properties
key=a whole bunch of text
I want my java code to be replaced as -
My.java
System.out.println("a whole bunch of text");

I wrote a simple java program that you can use to do this.
Dexternalize.java
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map.Entry;
import java.util.Properties;
import java.util.Set;
import java.util.Stack;
import java.util.logging.Level;
import java.util.logging.Logger;
public class Deexternalize {
public static final Logger logger = Logger.getLogger(Deexternalize.class.toString());
public static void main(String[] args) throws IOException {
if(args.length != 2) {
System.out.println("Deexternalize props_file java_file_to_create");
return;
}
Properties defaultProps = new Properties();
FileInputStream in = new FileInputStream(args[0]);
defaultProps.load(in);
in.close();
File javaFile = new File(args[1]);
List<String> data = process(defaultProps,javaFile);
buildFile(javaFile,data);
}
public static List<String> process(Properties propsFile, File javaFile) {
List<String> data = new ArrayList<String>();
Set<Entry<Object,Object>> setOfProps = propsFile.entrySet();
int indexOf = javaFile.getName().indexOf(".");
String javaClassName = javaFile.getName().substring(0,indexOf);
data.add("public class " + javaClassName + " {\n");
StringBuilder sb = null;
// for some reason it's adding them in reverse order so putting htem on a stack
Stack<String> aStack = new Stack<String>();
for(Entry<Object,Object> anEntry : setOfProps) {
sb = new StringBuilder("\tpublic static final String ");
sb.append(anEntry.getKey().toString());
sb.append(" = \"");
sb.append(anEntry.getValue().toString());
sb.append("\";\n");
aStack.push(sb.toString());
}
while(!aStack.empty()) {
data.add(aStack.pop());
}
if(sb != null) {
data.add("}");
}
return data;
}
public static final void buildFile(File fileToBuild, List<String> lines) {
BufferedWriter theWriter = null;
try {
// Check to make sure if the file exists already.
if(!fileToBuild.exists()) {
fileToBuild.createNewFile();
}
theWriter = new BufferedWriter(new FileWriter(fileToBuild));
// Write the lines to the file.
for(String theLine : lines) {
// DO NOT ADD windows carriage return.
if(theLine.endsWith("\r\n")){
theWriter.write(theLine.substring(0, theLine.length()-2));
theWriter.write("\n");
} else if(theLine.endsWith("\n")) {
// This case is UNIX format already since we checked for
// the carriage return already.
theWriter.write(theLine);
} else {
theWriter.write(theLine);
theWriter.write("\n");
}
}
} catch(IOException ex) {
logger.log(Level.SEVERE, null, ex);
} finally {
try {
theWriter.close();
} catch(IOException ex) {
logger.log(Level.SEVERE, null, ex);
}
}
}
}
Basically, all you need to do is call this java program with the location of the property file and the name of the java file you want to create that will contain the properties.
For instance this property file:
test.properties
TEST_1=test test test
TEST_2=test 2456
TEST_3=123456
will become:
java_test.java
public class java_test {
public static final String TEST_1 = "test test test";
public static final String TEST_2 = "test 2456";
public static final String TEST_3 = "123456";
}
Hope this is what you need!
EDIT:
I understand what you requested now. You can use my code to do what you want if you sprinkle a bit of regex magic. Lets say you have the java_test file from above. Copy the inlined properties into the file you want to replace the myResourceBundle code with.
For example,
TestFile.java
public class TestFile {
public static final String TEST_1 = "test test test";
public static final String TEST_2 = "test 2456";
public static final String TEST_3 = "123456";
public static void regexTest() {
System.out.println(myResourceBundle.getString("TEST_1"));
System.out.println(myResourceBundle.getString("TEST_1"));
System.out.println(myResourceBundle.getString("TEST_3"));
}
}
Ok, now if you are using eclipse (any modern IDE should be able to do this) go to the Edit Menu -> Find/Replace. In the window, you should see a "Regular Expressions" checkbox, check that. Now input the following text into the Find text area:
myResourceBundle\.getString\(\"(.+)\"\)
And the back reference
\1
into the replace.
Now click "Replace all" and voila! The code should have been inlined to your needs.
Now TestFile.java will become:
TestFile.java
public class TestFile {
public static final String TEST_1 = "test test test";
public static final String TEST_2 = "test 2456";
public static final String TEST_3 = "123456";
public static void regexTest() {
System.out.println(TEST_1);
System.out.println(TEST_1);
System.out.println(TEST_3);
}
}

You may use Eclipse "Externalize Strings" widget. It can also be used for un-externalization. Select required string(s) and press "Internalize" button. If the string was externalized before, it'll be put back and removed from messages.properties file.

May be if you can explain on how you need to do this, then you could get the correct answer.
The Short answer to your question is no, especially in Intellij (I do not know enough about eclipse). Of course the slightly longer but still not very useful answer is to write a plugin. ( That will take a list of property files and read the key and values in a map and then does a regular expression replace of ResourceBundle.getValue("Key") with the value from Map (for the key). I will write this plugin myself, if you can convince me that, there are more people like you, who have this requirement.)
The more elaborate answer is this.
1_ First I will re-factor all the code that performs property file reading to a single class (or module called PropertyFileReader).
2_ I will create a property file reader module, that iterates across all the keys in property file(s) and then stores those information in a map.
4_ I can either create a static map objects with the populated values or create a constant class out of it. Then I will replace the logic in the property file reader module to use a get on the map or static class rather than the property file reading.
5_ Once I am sure that the application performs ok.(By checking if all my Unit Testing passes), then I will remove my property files.
Note: If you are using spring, then there is a easy way to split out all property key-value pairs from a list of property files. Let me know if you use spring.

I would recommend something else: split externalized strings into localizable and non-localizable properties files. It would be probably easier to move some strings to another file than moving it back to source code (which will hurt maintainability by the way).
Of course you can write simple (to some extent) Perl (or whatever) script which will search for calls to resource bundles and introduce constant in this place...
In other words, I haven't heard about de-externalizing mechanism, you need to do it by hand (or write some automated script yourself).

An awesome oneliner from #potong sed 's|^\([^=]*\)=\(.*\)|s#Messages.getString("\1")#"\2"#g|;s/\\/\\\\/g' messages.properties |
sed -i -f - *.java run this inside your src dir, and see the magic.

Indentation issues with Staxmate API

I am using Staxmate API to generate XML file. After reading the tutorial: http://staxmate.codehaus.org/Tutorial I tried making the changes in my code. At last I added the call
doc.setIndentation("\n ", 1, 1);
Which causes the newly generated XML file to be empty! Without this method call entire XML file gets generated as expected.
Suspecting something fishy in in project setup, I created a Test class in the same package with the code given in tutorial:
package ch.synlogic.iaf.export;
import java.io.File;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import org.codehaus.staxmate.SMOutputFactory;
import org.codehaus.staxmate.out.SMOutputDocument;
import org.codehaus.staxmate.out.SMOutputElement;
public class Test {
public static void main(String[] args) {
main("c:\\tmp\\empl.xml");
}
public static void main(String fname)
{
// 1: need output factory
SMOutputFactory outf = new SMOutputFactory(XMLOutputFactory.newInstance());
SMOutputDocument doc;
try {
doc = outf.createOutputDocument(new File(fname));
// (optional) 3: enable indentation (note spaces after backslash!)
doc.setIndentation("\n ", 1, 1);
// 4. comment regarding generation time
doc.addComment(" generated: "+new java.util.Date().toString());
SMOutputElement empl = doc.addElement("employee");
empl.addAttribute(/*namespace*/ null, "id", 123);
SMOutputElement name = empl.addElement("name");
name.addElement("first").addCharacters("Tatu");
name.addElement("last").addCharacters("Saloranta");
// 10. close the document to close elements, flush output
doc.closeRoot();
} catch (XMLStreamException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Now when I invoke the main(String) method from my code the problem still persists whereas if I just run class Test as it is it works smoothly! My code involves database initializations and some other product specific actions.
I am lost, any thoughts on how should I proceed with this?

Indentation works with Woodstox API
WstxOutputFactory factory = new WstxOutputFactory();
factory.setProperty(WstxOutputFactory.P_AUTOMATIC_EMPTY_ELEMENTS, true);
SMOutputFactory outf = new SMOutputFactory(factory);
doc = outf.createOutputDocument(fout);
doc.setIndentation("\n ", 1, 1);

Below works for me -
context.setIndentation("\r\n\t\t\t\t\t\t\t\t", 2, 1); // indent by windows lf and 1 tab per level

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Solr custom Tokenizer Factory works randomly - java

I highly doubt that the create method is invoked on each query (for starters performance issues come to mind). I would take the safe route and create a Tokenizer that wraps a StandardTokenizer, then just override the setReader method and do my work there

Related

Is there a way to read csv file from S3 using Java without downloading it

Replacing all text in powerpoint using Apache POI

how to read from a txt file in blackberry eclipse?

'Un'-externalize strings from Eclipse or Intellij

Indentation issues with Staxmate API

Categories

Resources