Tokenize a string using apache lucene

Tokenize a string using apache lucene - java

How to tokenize the string based on a patter?
Example. In following string
arg1:aaa,bbb AND arg2:ccc OR arg3:ddd,eee,fff
First I want to tokenize based on AND and OR
So
Token set 1 arg1:aaa,bbb
Token set 2 arg2:ccc
Token set 3 arg3:ddd,eee,fff
Later i want to pass these individual token sets to a method and tokenize based on ":"
Token set 1
Token 1 aaa
Token 2 bbb
Token set 2
Token 1 ccc
Token set 3
Token 1 ddd
Token 2 eee
Token 3 fff
How to tokenize using custom patter using Lucene?

To perform a custom tokenization implementation, you would generally implement your own Tokenizer. The primary method that needs to be implemented would be TokenStream.incrementToken().
Your Tokenizer can then be incorporated into an Analyzer.

Related

REST parameter validation

Is there a way to specifically mention the length for the request parameter? My Parameter could be of length 4 or 6 ..
But specifying like below :
#Size(min=4, max=6)
#RequestParam String param1
Would allow length 5 too which is invalid in my case ? Is there a way to accomplish this without a customer validator?
Thanks

You could try to use #Pattern annotation which verifies that string follows specific regexp.
Then, you need to build regexp that will be something like this - ^(?=[0-9]*$)(?:.{4}|.{6})$ (checks that string contains only 4 digits or 6 digits).
Pattern annotation docs

Restrict particular domain in email regular expression

I have an existing regex which validates the email input field.
[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-zA-Z0-9!$%&'*+/=?^_`{|}~-]+)*(\\.)?#(?:[a-zA-Z0-9ÄÖÜäöü](?:[a-zA-Z0-9-_ÄÖÜäöü]*[a-zA-Z0-9_ÄÖÜäöü])?\\.)+[a-zA-Z]{2,}
Now, I want this regex to not match for two particular type of email IDs. Which are wt.com and des.net
To do that I made the following changes in the above expression like this.
[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-zA-Z0-9!$%&'*+/=?^_`{|}~-]+)*(\\.)?#(?!wt\\.com)(?!des\\.net)(?:[a-zA-Z0-9ÄÖÜäöü](?:[a-zA-Z0-9-_ÄÖÜäöü]*[a-zA-Z0-9_ÄÖÜäöü])?\\.)+[a-zA-Z]{2,}
After this it does not matches with any email id which ends with the wt.com and des.net which is right.
But the problem is it does not match with wt.comm or any other letter after the restricted string too..
I just want to restrict email which ends with wt.com and des.net
How do I do that?
Below is the sample emails which should match or not.
ajcom#wt.com : no match
ajcom#aa.an : match
ajcom#wt.coms :match
ajcom#des.net : no match
ajcom#des.neta: match

If you want to prevent only wt.com and des.net which have no characters after it you can add $ anchor (which represents end of string) at the end of each negative-look-ahead.
So instead of (?!wt\\.com)(?!des\\.net) use (?!wt\\.com$)(?!des\\.net$)

Regex to remove

Need regex to remove \" and '
String date="\"CCB \\\"E Safety\\\" Internet Banking security components 3.0.7.0\"'Configuration & \\\"Service Tool v3.02.00'"
Reuslt String : CCB E Safety Internet Banking security components 3.0.7.0. Configuration & Service Tool v3.02.00
Im using this
System.out.println(date.replaceAll("[\\W+]", " ").replaceAll("\\s+", " "));
But it removes dot also
CCB E Safety Internet Banking security components 3 0 7 0 Configuration Service Tool v3 02 00

data = date.replaceAll("[\\\\\"'\\s]+", " ").trim();
Result
CCB E Safety Internet Banking security components 3.0.7.0 Configuration & Service Tool v3.02.00

Don't use regex!
The characters you want to remove can be removed without using regex, and it's a whole lot easier to read:
data = data.replace("\"", " ").replace("'", " ").trim();
In case you are wondering, replace() still replaces all occurrences, but the search parameter is just plain text, whereas replaceAll() uses a regex search parameter.

How to extract the session id from an RTSP message's content?

I have a string like this:
RTSP/1.0 200 OK
CSeq: 3
Server: Ants Rtsp Server/1.0
Date: 21 Oct 2016 15:55:30 GMT
Session: 980603187; timeout=60
Transport: RTP/AVP/TCP;unicast;interleaved=0-1;ssrc=F006B800
I want to extract the session number(980603187)
Could someone please provide some help?

Simply use a regular expression with a group, then extract the value of the group as next:
String content ="RTSP/1.0 200 OK\n" +
"CSeq: 3\n" +
"Server: Ants Rtsp Server/1.0\n" +
"Date: 21 Oct 2016 15:55:30 GMT\n" +
"Session: 980603187; timeout=60\n" +
"Transport: RTP/AVP/TCP;unicast;interleaved=0-1;ssrc=F006B800\n";
Pattern pattern = Pattern.compile("Session: ([a-zA-Z0-9$\\-_.+]+)");
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
Output:
980603187
Explanation:
Session: ([a-zA-Z0-9$\\-_.+]+)
Session: matches the characters Session: literally (case sensitive)
([a-zA-Z0-9$\\-_.+]+): Capturing group that matches with several consecutive ALPHA, DIGIT or SAFE characters (at least one) (cf RFC 2326 chapter 3.4 Session Identifiers)

Use Regex! Having String str = .., extract the number needed with the Regex capturing anything between Session: and ;:
Session: (.+);
Feel free to specify only letters \\w+ or digits \\d+. Mind the double escaping in Java. The first matched m.group(1) is your result:
Pattern p = Pattern.compile("Session: (.+);");
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(m.group(1));
}
Outputs 980603187. Check out the Regex101 for the explanation.
In come cases the ; timeout is optional and to need to amend the Regex used:
Session: (.+?)[\n;]

Once you have each header you can look up the specification in RFC 2336 which specifies the RTSP protocol.
First of all, you should split your string into lines. The lines end with CR/LF according to the specification. The first line indicates the response, the other should be header fields.
The definition is:
Session = "Session" ":" session-id [ ";" "timeout" "=" delta-seconds ]
where session-id is specified as:
session-id = 1*( ALPHA | DIGIT | safe )
which means you should not confuse it with a number. The definition of safe is
safe = "\$" | "-" | "_" | "." | "+"
and alpha means all upper- and lowercase numbers. This means it is possible to put in a base 64 url encoded binary session-id, by the way.
OK, now it becomes a question of looking for the session ID. You step through all lines (except the first one) and then look for the line that matches:
^Session[ \t]*:[ \t]*([a-zA-Z0-9\$\-_.+]+).*$
this will match only valid session headers / valid session identifiers. Note that the standard is vague about white-space, so I skipped over space and tab characters before and after the colon ':'. The session identifier is then in group 1 of the regular expression.
You can of course easily extend this by including the timeout in the regular expression, once you need it.
Note that you will have to double escape the backslash characters before using the regular expression in Java. It's also possible to use the Posix character classes defined in the Pattern class to make the regular expression more readable.

If you use apache-commons in your dependencies, then you can do it within one line:
StringUtils.substringBetween(string, "Session: ", ";");

Java Regular Expression to Handle Strings Contains Next Line

I have a text like this
Customer Owned 03/26 04/25 0.00
Modem
Here Modem is in Next line
Now i need to write the data into spreadsheet as
Customer Owned Modem 03/26 04/25 0.00
I wrote a regex as
([a-zA-Z = ]*) ([[0-9]{2}/[0-9]{2} ]*) (-?[0-9]*\\.[0-9]+)
I am getting the description as "Customer Owned" instead of "Customer Owned Modem". Is there any way to handle through Regex?

You could try this regex:
([A-Za-z ]+)([^A-Za-z]+)[\r\n]*([A-Za-z]+)
And replace by:
\1\3 \2
Here's a demo using your example.

To match newline you can try \r?\n
Update your regex accordingly to include newline as well as the text thereafter

Please add the following to your regular expression. This will detect an end of line on all platforms and capture the following line in the last group. You can then concatenate group 1 and the last group together.
(?:\n|\r|\n\r|\r\n)([a-zA-Z = ]*)$

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Tokenize a string using apache lucene - java

To perform a custom tokenization implementation, you would generally implement your own Tokenizer. The primary method that needs to be implemented would be TokenStream.incrementToken(). Your Tokenizer can then be incorporated into an Analyzer.

Related

REST parameter validation

Restrict particular domain in email regular expression

Regex to remove

How to extract the session id from an RTSP message's content?

Java Regular Expression to Handle Strings Contains Next Line

Categories

Resources