Ignore a Token output in Lucene's IncrementToken() method

Ignore a Token output in Lucene's IncrementToken() method - java

I am trying to make a custom filter in Lucene which simply recognizes whether two consequent words in a text start with a capital letter and have the rest as lower case, in which case the two words are to be joined as one token.
The overriden incrementToken method has the following code
#Override
public boolean incrementToken() throws IOException {
if(!input.incrementToken()){
return false;}
//Case were the previous token WAS NOT starting with capital letter and the rest small
if(previousTokenCanditateMainName==false)
{
if(CheckIfMainName(termAtt.term()))
{
previousTokenCanditateMainName=true;
tempString=this.termAtt.term() ; /*This is the*/
// myToken.offsetAtt=this.offsetAtt; /*Token i need to "delete"*/
tempStartOffset=this.offsetAtt.startOffset();
tempEndOffset=this.offsetAtt.endOffset();
return true;
}
else
{
return true;
}
}
//Case were the previous token WAS a Proper name (starting with Capital and continuiing with small letters)
else
{
if(CheckIfMainName(termAtt.term()))
{
previousTokenCanditateMainName=false;
posIncrAtt.setPositionIncrement(0);
termAtt.setTermBuffer(tempString+TOKEN_SEPARATOR+this.termAtt.term());
offsetAtt.setOffset(tempStartOffset, this.offsetAtt.endOffset());
return true;
}
else
{
previousTokenCanditateMainName=false;
return true;
}
}
}
My question is how once i find the first Token that meets my requirements can i "ignore" it.
Currently the code works perfectly with joining the two tokens but i also get an extra token with the first one of the two that I identified.
I tried using the same method setEnableIncrementsPosition(true) as does the built-in stopFilter but in that case my filter needs to be a TokenFilter type which does not allow me to override the incrementToken method.
I hope i phrased my problem properly

You might have a custom method:
private void tokenize()
where you do the splitting and the custom joins. The resulting List<String> tokens need to be held as an attribute of the tokenizer.
In the incrementToken method you simply check if this attribute is null and initialize it if necessary.
You also need to add the tokens in the incrementToken() method to the termAttribute
termAttribute.append(tokens.get(tokenIndex));
this includes that your Tokenizer needs to have an attribute like this:
private CharTermAttribute termAttribute = addAttribute(CharTermAttribute.class);
Probably you need also some fine tuning. But thats only a draft on how this can be achieved in a pretty simple way.

Related

Can I get the Field value in String into custom TokenFilter in Apache Solr?

I need to write a custom LemmaTokenFilter, which replaces and indexes the words with their lemmatized(base) form. The problem is, that I get the base forms from an external API, meaning I need to call the API, send my text, parse the response and send it as a Map<String, String> to my LemmaTokenFilter. The map contains pairs of <originalWord, baseFormOfWord>. However, I cannot figure out how can I access the full value of the text field, which is being proccessed by the TokenFilters.
One idea is to go through the tokenStream one by one when the LemmaTokenFilter is being created by the LemmaTokenFilterFactory, however I would need to watch out to not edit anything in the tokenStream, somehow reset the current token(since I would need to call the .increment() method on it to get all the tokens), but most importantly this seems unnecessary, since the field value is already there somewhere and I don't want to spend time trying to put it together again from the tokens. This implementation would probably be too slow.
Another idea would be to just process every token separately, however calling an external API with only one word and then parsing the response is definitely too inefficient.
I have found something on using the ResourceLoaderAware interface, however I don't really understand how could I use this to my advantage. I could probably save the map in a text file before every indexing, but writing to a file, opening it and reading from it before every document indexing seems too slow as well.
So the best way would be to just pass the value of the field as a String to the constructor of LemmaTokenFilter, however I don't know how to access it from the create() method of the LemmaTokenFilterFactory.
I could not find any help googling it, so any ideas are welcome.
Here's what I have so far:
public final class LemmaTokenFilter extends TokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private Map<String, String> lemmaMap;
protected LemmaTokenFilter(TokenStream input, Map<String, String> lemmaMap) {
super(input);
this.lemmaMap = lemmaMap;
}
#Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
String term = termAtt.toString();
String lemma;
if ((lemma = lemmaMap.get(term)) != null) {
termAtt.setEmpty();
termAtt.copyBuffer(lemma.toCharArray(), 0, lemma.length());
}
return true;
} else {
return false;
}
}
}
public class LemmaTokenFilterFactory extends TokenFilterFactory implements ResourceLoaderAware {
public LemmaTokenFilterFactory(Map<String, String> args) {
super(args);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
#Override
public TokenStream create(TokenStream input) {
return new LemmaTokenFilter(input, getLemmaMap(getFieldValue(input)));
}
private String getFieldValue(TokenStream input) {
//TODO: how?
return "Šach je desková hra pro dva hráče, v dnešní soutěžní podobě zároveň považovaná i za odvětví sportu.";
}
private Map<String, String> getLemmaMap(String data) {
return UdPipeService.getLemma(data);
}
#Override
public void inform(ResourceLoader loader) throws IOException {
}
}

1. API based approach:
You can create an Analysis Chain with the Custom lemmatizer on top. To design this lemmatizer, I guess you can look at the implementation of the Keyword Tokenizer;
Such that you can read everything whatever is there inside the input and then call your API;
Replace all your tokens from the API response in the input text;
After that in Analysis Chain, use standard or white space tokenizer to tokenized your data.
2. File-Based Approach
It will follow all the same steps, except calling the API it can use the hashmap, from the files mentioned while defining the TokenStream
Now coming to the ResourceLoaderAware:
It is required when you need to indicate your Tokenstream that resource has changed it has inform method which takes care of that. For reference, you can look into StemmerOverrideFilter
Keyword Tokenizer: Emits the entire input as a single token.

So I think I found the answer, or actually two answers.
One would be to write my client application in a way, that incoming requests are first processed - the field value is sent to the external API and the response is stored into some global variable, which can then be accessed from the custom TokenFilters.
Another one would be to use custom UpdateRequestProcessors, which allow us to modify the content of the incoming document, calling the external API and again saving the response so it's somehow globally accessible from custom TokenFilters. Here Erik Hatcher talks about the use of the ScriptUpdateProcessor, which I believe can be used in my case too.
Hope this helps to anyone stumbling upon a similar problem, because I had a hard time looking for a solution to this(could not find any similar threads on SO)

Is there a way I can check detect a repeated input for a hashset (Java)?

I am trying to make a hangman game as follows:
public void guessLetter(String letter) {
HashSet<String> guessedLettersA = new HashSet<>();
guessedLettersA.add(letter);
for (String guessedLetterA : guessedLettersA) {
this.guessedLetters += guessedLetterA;
}
if (!this.word.contains(letter)) {
this.numberOfFaults++;
}
}
public boolean letterCheck() {
if ( = false ) {
System.out.println("You have already guessed this letter!");
}
I am currently working in the letterCheck method and want to see if one of the inputs is a repeat and let the user know that their guess doesn't count. I assume it wont make up their failures or count as another guess because it is never added to the hashset. So where I am struggling with, is how do I do as I want, I was thinking of using the built-in way a hashset returns false to detect this, but I have no idea how to implement this since it needs to refer to another method and I don't know how to make a string hashset return booleans. I would greatly appreciate any help at all, thanks.

The add API would return false for an existing value(that it cannot add to the Set), so your condition can be dealt with
boolean letterCheck = guessedLettersA.add(letter);
if(!letterCheck) {
System.out.println("You have already guessed this letter!");
}
Note: The invocation of this block is solely dependent on the design of your application.

Method retake in Java

I'm developing a project in which i have a method to know if a JTextField is empty or not, but i was wondering if a way to implement that method just once and send several JTextFields components to check if they are empty or not exists, if so, could you please tell me how?, here's my sample code.
public static void Vacio(JTextField txt){
if(txt.getText().trim().equals(null)==true){/*Message*/}
}
Also i would like to know if i could improve the method using some Lambda Expressions, beforehand.

Use :
if(txt.getText().trim().length()==0)
//Do something
Your code will not work because a blank string("") is not a null String. I simply check if the trimmed length() of TextField is 0.
A sample function:
public boolean isEmpty(JTextField jtf)
{
try{
jtf.getText();
}catch(NullPointerException e){return true;}
if(jtf.getText().trim().length() == 0)
return true;
return false;
}

I cannot imagine how adding Lambda expressions can improve what you're trying to do (?).
Anyway, to check for an empty String I'd probably use:
field.getText().trim().isEmpty()
You don't need to check for null but you do need to catch NullPointerException in the event that the underlying document in the JTextField is null.
For the other part of your quesiton, if you really want to check multiple JTextFields in one method you could pass them as a variable length argument list:
public static void vacio(JTextField... fields) {
for(JTextField field : fields) {
try {
if( field.getText().trim().isEmpty() ) {
// do something
}
}
catch(NullPointerException ex) {
// handle exception (maybe log it?)
}
}
}
and call it like:
vacio(field1, field2, field3);
But, generally speaking, keeping functions brief and only doing one thing is usually better than trying to make a function do too much.
One final aside, your method is named Vacio but java naming conventions suggest you should compose method names using mixed case letters, beginning with a lower case letter and starting each subsequent word with an upper case letter.

Check explicitly for null and then compare with "".
public static void Vacio(JTextField txt){
String str = null;
try {
str = txt.getText();
}
catch (NullPointerException npe) {
System.out.println("The document is null!");
return;
}
if(str.trim().equals("")==true){/*Message*/}
}

complex if( ) or enum?

In my app, I need to branch out if the input matches some specific 20 entries.
I thought of using an enum
public enum dateRule { is_on, is_not_on, is_before,...}
and a switch on the enum constant to do a function
switch(dateRule.valueOf(input))
{
case is_on :
case is_not_on :
case is_before :
.
.
.
// function()
break;
}
But the input strings will be like 'is on', 'is not on', 'is before' etc without _ between words.
I learnt that an enum cannot have constants containing space.
Possible ways I could make out:
1, Using if statement to compare 20 possible inputs that giving a long if statement like
if(input.equals("is on") ||
input.equals("is not on") ||
input.equals("is before") ...) { // function() }
2, Work on the input to insert _ between words but even other input strings that don't come under this 20 can have multiple words.
Is there a better way to implement this?

You can define your own version of valueOf method inside the enum (just don't call it valueOf).
public enum State {
IS_ON,
IS_OFF;
public static State translate(String value) {
return valueOf(value.toUpperCase().replace(' ', '_'));
}
}
Simply use it like before.
State state = State.translate("is on");
The earlier switch statement would still work.

It is possible to seperate the enum identifier from the value. Something like this:
public enum MyEnumType
{
IS_BEFORE("is before"),
IS_ON("is on"),
IS_NOT_ON("is not on")
public final String value;
MyEnumType(final String value)
{
this.value = value;
}
}
You can also add methods to the enum-type (the method can have arguments as well), something like this:
public boolean isOnOrNotOn()
{
return (this.value.contentEquals(IS_ON) || this.value.contentEquals(IS_NOT_ON));
}
Use in switch:
switch(dateRule.valueOf(input))
{
case IS_ON: ...
case IS_NOT_ON: ...
case IS_BEFORE: ...
}
And when you get the value of IS_ON like for example System.out.println(IS_ON) it will show is on.

If you're using Java 7, you can also choose the middle road here, and do a switch statement with Strings:
switch (input) {
case "is on":
// do stuff
break;
case "is not on":
// etc
}

You're not really breaking the concept up enough, both solutions are brittle...
Look at your syntax
"is", can remove, seems to be ubiquitous
"not", optional, apply a ! to the output comparison
on, before, after, apply comparisons.
So do a split between spaces. Parse the split words to ensure they exist in the syntax definition and then do a step-by-step evaluation of the expression passed in. This will allow you to easily extend the syntax (without having to add an "is" and "is not" for each combination and keep your code easy to read.
Having multiple conditions munged into one for the purposes of switch statements leads to huge bloat over time.

Thanks for the suggestions. They guided me here.
This is almost same as other answers, just a bit simplified.
To summarize, I need to compare the input string with a set of 20 strings and
if they match, do something. Else, do something else.
Static set of strings to which input needs to be compared :
is on,is not on,is before,is after, etc 20 entries
I created an enum
public enum dateRules
{
is_on
,is_not_on
,is_before
,is_after
.
.
.
}
and switching on formatted value of input
if(isARule(in = input.replace(" ","_"))
{
switch(dateRule.valueOf(in))
{
case is_on,
case is_not_on,
case is_before, ...
}
}
I copied the formatted value of 'input' to 'in' so that I can reuse input without another replace of '_' with ' '.
private static boolean isARule(String value)
{
for(dateRule rule : dateRule.values())
{
if(rule.toString().equals(value))
{
return true;
}
}
return false;
}
Problem solved.
Reference : https://stackoverflow.com/a/4936895/1297564

Fastest way to lookup a String value

I have a simple application that reads data in small strings from large text files and saves them to a database. To actually save each such String, the application calls the following method several (may thousands, or more) times:
setValue(String value)
{
if (!ignore(value))
{
// Save the value in the database
}
}
Currently, I implement the ignore() method by just successively comparing a set of Strings, e.g.
public boolean ignore(String value)
{
if (value.equalsIgnoreCase("Value 1") || (value.equalsIgnoreCase("Value 2"))
{
return true;
}
return false;
}
However, because I need to check against many such "ignorable" values, which will be defined in another part of the code, I need to use a data structure for this check, instead of multiple consecutive if statements.
So, my question is, what would be the fastest data structure from standard Java to to implement this? A HashMap? A Set? Something else?
Initialization time is not an issue, since it will happen statically and once per application invocation.
EDIT: The solutions suggested thus far (including HashSet) appear slower than just using a String[] with all the ignored words and just running "equalsIgnoreCase" against each of these.

Use a HashSet, storing the values in lowercase, and its contains() method, which has better lookup performance than TreeSet (constant-time versus log-time for contains).
Set<String> ignored = new HashSet<String>();
ignored.add("value 1"); // store in lowercase
ignored.add("value 2"); // store in lowercase
public boolean ignore(String value) {
return ignored.contains(value.toLowerCase());
}
Storing the values in lowercase and searching for the lowercased input avoids the hassle of dealing with case during comparison, so you get the full speed of the HashSet implementation and zero collection-related code to write (eg Collator, Comparator etc).
EDITED
Thanks to Jon Skeet for pointing out that certain Turkish characters behave oddly when calling toLowerCase(), but if you're not intending on supporting Turkish input (or perhaps other languages with non-standard case issues) then this approach will work well for you.

In most cases I'd normally start with a HashSet<String> - but as you want case-insensitivity, that makes it slightly harder.
You can try using a TreeSet<Object> using an appropriate Collator for case-insensitivity. For example:
Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.SECONDARY);
TreeSet<Object> set = new TreeSet<Object>(collator);
Note that you can't create a TreeSet<String> as Collator only implements Comparator<Object>.
EDIT: While the above version works with just strings, it may be faster to create a TreeSet<CollationKey>:
Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.SECONDARY);
TreeSet<CollationKey> set = new TreeSet<CollationKey>();
for (String value : valuesToIgnore) {
set.add(collator.getCollationKey(value));
}
Then:
public boolean ignore(String value)
{
return set.contains(collator.getCollationKey(value));
}
It would be nice to have a way of storing the collation keys for all ignored values but then avoid creating new collation keys when testing, but I don't know of a way of doing that.

Add the words to ignore to a list and just check if the word is in that list.
That makes it dynamically.

If using Java 7 this is a fast way to do it:
public boolean ignore(String value) {
switch(value.toLowerCase()) { // see comment Jon Skeet
case "lowercased_ignore_value1":
case "lowercased_ignore_value2":
// etc
return true;
default:
return false;
}
}

It seems that String[] is slightly better (performance-wise) than the other methods proposed, so I will use that.
It is simply something like this:
public boolean ignore(String value)
{
for (String ignore:IGNORED_VALUES)
{
if (ignore.equalsIgnoreCase(value))
{
return true;
}
return false;
}
The IGNORED_VALUES object is just a String[] with all ignored values in there.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Ignore a Token output in Lucene's IncrementToken() method - java

Related

Can I get the Field value in String into custom TokenFilter in Apache Solr?

Is there a way I can check detect a repeated input for a hashset (Java)?

Method retake in Java

complex if( ) or enum?

Fastest way to lookup a String value

Categories

Resources