Java RegEX and group - java

I need to parse log files and get some values to variable.
The log file will have a string
String logStr = "21:19:03 -[ 8b4]- ERROR - Jhy AlarmOccure::OnAdd - Updated existing alarm: ID [StrValue1:StrValu2|StrValue3], Instance [4053], SetStatus [0], AckStatus [1], SetTime [DateValue4], ClearedTime [DateValue5]";
I need to get StrValue1,StrValue2,StrValue3,DateValue4 and DateValue5 to varaibles these values are changing fields when ever there is an error.
First i was trying to at least get StrValue1. But not getting the expected result.
Pattern twsPattern = Pattern.compile(".*?ID ?[([^]:]*):([^]|]*)|([^]]*)]");//.*ID\\s$.([^]:]*.):.([^]|]*.)|.([^]]*.).]
Matcher twsMatcher = twsPattern.matcher(logStr);
if(twsMatcher.find()){
System.out.println(twsMatcher.start());
System.out.println(twsMatcher.group());
System.out.println(twsMatcher.end());
}
I am not able to understand the grouping stuff, in regex.

Try regexp ([a-zA-z]+) \[([^\]]+)\].
For string 21:19:03 -[ 8b4]- ERROR - Jhy AlarmOccure::OnAdd - Updated existing alarm: ID [StrValue1:StrValu2|StrValue3], Instance [4053], SetStatus [0], AckStatus [1], SetTime [DateValue4], ClearedTime [DateValue5] it returns:
ID and StrValue1:StrValu2|StrValue3
Instance and 4053
SetStatus and 0
AckStatus and 1
SetTime and DateValue4
ClearedTime and DateValue5
You can test it here.

Good on you for the attempt! You're actually doing quite well. You need to escape square brackets that you don't mean as character classes, i.e.
.*?ID ?\[
^
And hopefully you are aware that by ([^]:]*) you are meaning, "The longest possible string of characters without a closing square bracket or colon."
You probably also want to escape the |, as that is an alternation operator in regular expressions, i.e.
\|

Long story short, your regex lacks escaping some chars, like [ and | (this one, if outside a character class - []).
So when you want to actually match the [ char, you have to use \[ (or \\[ inside the java string). Also, the negation in the group ([^]:]*) is not what it seems. You probably want just ([^:]*), which matches everything until a :.
To make it work, then, you would simply use Matcher#group(int) to retrieve the values. This is the adapted code with the final regex:
String logStr = "21:19:03 -[ 8b4]- ERROR - Jhy AlarmOccure::OnAdd - Updated existing alarm: ID [StrValue1:StrValu2|StrValue3], Instance [4053], SetStatus [0], AckStatus [1], SetTime [DateValue4], ClearedTime [DateValue5]";
Pattern twsPattern = Pattern.compile(".*?ID ?\\[([^:]*):([^|]*)\\|([^\\]]*)\\].*?SetTime ?\\[([^\\]]*)\\][^\\[]+\\[([^\\]]*)\\]");
Matcher twsMatcher = twsPattern.matcher(logStr);
if (twsMatcher.find()){
System.out.println(twsMatcher.group(1)); // StrValue1
System.out.println(twsMatcher.group(2)); // StrValu2
System.out.println(twsMatcher.group(3)); // StrValue3
System.out.println(twsMatcher.group(4)); // DateValue4
System.out.println(twsMatcher.group(5)); // DateValue5
}

I like more general solutions, but here is a very specific pattern you can use if it suits you. It will capture all of the values in a string as long as they are follow the same, very specific pattern.
ID (?:\[([^\]:]+):([^\]|]+)\|([^\]]+)\]).*?SetTime \[([^\]]+)\], ClearedTime \[([^\]]+)\]
Here is the result:
1: ID [StrValue1:StrValu2|StrValue3], Instance [4053], SetStatus [0], AckStatus [1], SetTime [DateValue4], ClearedTime [DateValue5]
[1]: StrValue1
[2]: StrValu2
[3]: StrValue3
[4]: DateValue4
[5]: DateValue5
Try it out
Multiple Matches per line
This version will just match each instance in a string of ID, SetTime, or ClearedTime followed by a bracketed value.
(ID|SetTime|ClearedTime) \[([^\]]+)\
Results
1: ID [StrValue1:StrValu2|StrValue3]
[1]: ID
[2]: StrValue1:StrValu2|StrValue3
1: SetTime [DateValue4]
[1]: SetTime
[2]: DateValue4
1: ClearedTime [DateValue5]
[1]: ClearedTime
[2]: DateValue5
Try it out

Related

MongoDB TextCriteria split on specific characters [duplicate]

Example:
> db.stuff.save({"foo":"bar"});
> db.stuff.find({"foo":"bar"}).count();
1
> db.stuff.find({"foo":"BAR"}).count();
0
You could use a regex.
In your example that would be:
db.stuff.find( { foo: /^bar$/i } );
I must say, though, maybe you could just downcase (or upcase) the value on the way in rather than incurring the extra cost every time you find it. Obviously this wont work for people's names and such, but maybe use-cases like tags.
UPDATE:
The original answer is now obsolete. Mongodb now supports advanced full text searching, with many features.
ORIGINAL ANSWER:
It should be noted that searching with regex's case insensitive /i means that mongodb cannot search by index, so queries against large datasets can take a long time.
Even with small datasets, it's not very efficient. You take a far bigger cpu hit than your query warrants, which could become an issue if you are trying to achieve scale.
As an alternative, you can store an uppercase copy and search against that. For instance, I have a User table that has a username which is mixed case, but the id is an uppercase copy of the username. This ensures case-sensitive duplication is impossible (having both "Foo" and "foo" will not be allowed), and I can search by id = username.toUpperCase() to get a case-insensitive search for username.
If your field is large, such as a message body, duplicating data is probably not a good option. I believe using an extraneous indexer like Apache Lucene is the best option in that case.
Starting with MongoDB 3.4, the recommended way to perform fast case-insensitive searches is to use a Case Insensitive Index.
I personally emailed one of the founders to please get this working, and he made it happen! It was an issue on JIRA since 2009, and many have requested the feature. Here's how it works:
A case-insensitive index is made by specifying a collation with a strength of either 1 or 2. You can create a case-insensitive index like this:
db.cities.createIndex(
{ city: 1 },
{
collation: {
locale: 'en',
strength: 2
}
}
);
You can also specify a default collation per collection when you create them:
db.createCollection('cities', { collation: { locale: 'en', strength: 2 } } );
In either case, in order to use the case-insensitive index, you need to specify the same collation in the find operation that was used when creating the index or the collection:
db.cities.find(
{ city: 'new york' }
).collation(
{ locale: 'en', strength: 2 }
);
This will return "New York", "new york", "New york" etc.
Other notes
The answers suggesting to use full-text search are wrong in this case (and potentially dangerous). The question was about making a case-insensitive query, e.g. username: 'bill' matching BILL or Bill, not a full-text search query, which would also match stemmed words of bill, such as Bills, billed etc.
The answers suggesting to use regular expressions are slow, because even with indexes, the documentation states:
"Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes."
$regex answers also run the risk of user input injection.
If you need to create the regexp from a variable, this is a much better way to do it: https://stackoverflow.com/a/10728069/309514
You can then do something like:
var string = "SomeStringToFind";
var regex = new RegExp(["^", string, "$"].join(""), "i");
// Creates a regex of: /^SomeStringToFind$/i
db.stuff.find( { foo: regex } );
This has the benefit be being more programmatic or you can get a performance boost by compiling it ahead of time if you're reusing it a lot.
Keep in mind that the previous example:
db.stuff.find( { foo: /bar/i } );
will cause every entries containing bar to match the query ( bar1, barxyz, openbar ), it could be very dangerous for a username search on a auth function ...
You may need to make it match only the search term by using the appropriate regexp syntax as:
db.stuff.find( { foo: /^bar$/i } );
See http://www.regular-expressions.info/ for syntax help on regular expressions
db.company_profile.find({ "companyName" : { "$regex" : "Nilesh" , "$options" : "i"}});
db.zipcodes.find({city : "NEW YORK"}); // Case-sensitive
db.zipcodes.find({city : /NEW york/i}); // Note the 'i' flag for case-insensitivity
TL;DR
Correct way to do this in mongo
Do not Use RegExp
Go natural And use mongodb's inbuilt indexing , search
Step 1 :
db.articles.insert(
[
{ _id: 1, subject: "coffee", author: "xyz", views: 50 },
{ _id: 2, subject: "Coffee Shopping", author: "efg", views: 5 },
{ _id: 3, subject: "Baking a cake", author: "abc", views: 90 },
{ _id: 4, subject: "baking", author: "xyz", views: 100 },
{ _id: 5, subject: "Café Con Leche", author: "abc", views: 200 },
{ _id: 6, subject: "Сырники", author: "jkl", views: 80 },
{ _id: 7, subject: "coffee and cream", author: "efg", views: 10 },
{ _id: 8, subject: "Cafe con Leche", author: "xyz", views: 10 }
]
)
Step 2 :
Need to create index on whichever TEXT field you want to search , without indexing query will be extremely slow
db.articles.createIndex( { subject: "text" } )
step 3 :
db.articles.find( { $text: { $search: "coffee",$caseSensitive :true } } ) //FOR SENSITIVITY
db.articles.find( { $text: { $search: "coffee",$caseSensitive :false } } ) //FOR INSENSITIVITY
One very important thing to keep in mind when using a Regex based query - When you are doing this for a login system, escape every single character you are searching for, and don't forget the ^ and $ operators. Lodash has a nice function for this, should you be using it already:
db.stuff.find({$regex: new RegExp(_.escapeRegExp(bar), $options: 'i'})
Why? Imagine a user entering .* as his username. That would match all usernames, enabling a login by just guessing any user's password.
Suppose you want to search "column" in "Table" and you want case insensitive search. The best and efficient way is:
//create empty JSON Object
mycolumn = {};
//check if column has valid value
if(column) {
mycolumn.column = {$regex: new RegExp(column), $options: "i"};
}
Table.find(mycolumn);
It just adds your search value as RegEx and searches in with insensitive criteria set with "i" as option.
Mongo (current version 2.0.0) doesn't allow case-insensitive searches against indexed fields - see their documentation. For non-indexed fields, the regexes listed in the other answers should be fine.
For searching a variable and escaping it:
const escapeStringRegexp = require('escape-string-regexp')
const name = 'foo'
db.stuff.find({name: new RegExp('^' + escapeStringRegexp(name) + '$', 'i')})
Escaping the variable protects the query against attacks with '.*' or other regex.
escape-string-regexp
The best method is in your language of choice, when creating a model wrapper for your objects, have your save() method iterate through a set of fields that you will be searching on that are also indexed; those set of fields should have lowercase counterparts that are then used for searching.
Every time the object is saved again, the lowercase properties are then checked and updated with any changes to the main properties. This will make it so you can search efficiently, but hide the extra work needed to update the lc fields each time.
The lower case fields could be a key:value object store or just the field name with a prefixed lc_. I use the second one to simplify querying (deep object querying can be confusing at times).
Note: you want to index the lc_ fields, not the main fields they are based off of.
Using Mongoose this worked for me:
var find = function(username, next){
User.find({'username': {$regex: new RegExp('^' + username, 'i')}}, function(err, res){
if(err) throw err;
next(null, res);
});
}
If you're using MongoDB Compass:
Go to the collection, in the filter type -> {Fieldname: /string/i}
For Node.js using Mongoose:
Model.find({FieldName: {$regex: "stringToSearch", $options: "i"}})
The aggregation framework was introduced in mongodb 2.2 . You can use the string operator "$strcasecmp" to make a case-insensitive comparison between strings. It's more recommended and easier than using regex.
Here's the official document on the aggregation command operator: https://docs.mongodb.com/manual/reference/operator/aggregation/strcasecmp/#exp._S_strcasecmp .
You can use Case Insensitive Indexes:
The following example creates a collection with no default collation, then adds an index on the name field with a case insensitive collation. International Components for Unicode
/* strength: CollationStrength.Secondary
* Secondary level of comparison. Collation performs comparisons up to secondary * differences, such as diacritics. That is, collation performs comparisons of
* base characters (primary differences) and diacritics (secondary differences). * Differences between base characters takes precedence over secondary
* differences.
*/
db.users.createIndex( { name: 1 }, collation: { locale: 'tr', strength: 2 } } )
To use the index, queries must specify the same collation.
db.users.insert( [ { name: "Oğuz" },
{ name: "oğuz" },
{ name: "OĞUZ" } ] )
// does not use index, finds one result
db.users.find( { name: "oğuz" } )
// uses the index, finds three results
db.users.find( { name: "oğuz" } ).collation( { locale: 'tr', strength: 2 } )
// does not use the index, finds three results (different strength)
db.users.find( { name: "oğuz" } ).collation( { locale: 'tr', strength: 1 } )
or you can create a collection with default collation:
db.createCollection("users", { collation: { locale: 'tr', strength: 2 } } )
db.users.createIndex( { name : 1 } ) // inherits the default collation
I'm surprised nobody has warned about the risk of regex injection by using /^bar$/i if bar is a password or an account id search. (I.e. bar => .*#myhackeddomain.com e.g., so here comes my bet: use \Q \E regex special chars! provided in PERL
db.stuff.find( { foo: /^\Qbar\E$/i } );
You should escape bar variable \ chars with \\ to avoid \E exploit again when e.g. bar = '\E.*#myhackeddomain.com\Q'
Another option is to use a regex escape char strategy like the one described here Javascript equivalent of Perl's \Q ... \E or quotemeta()
Use RegExp,
In case if any other options do not work for you, RegExp is a good option. It makes the string case insensitive.
var username = new RegExp("^" + "John" + "$", "i");;
use username in queries, and then its done.
I hope it will work for you too. All the Best.
If there are some special characters in the query, regex simple will not work. You will need to escape those special characters.
The following helper function can help without installing any third-party library:
const escapeSpecialChars = (str) => {
return str.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
}
And your query will be like this:
db.collection.find({ field: { $regex: escapeSpecialChars(query), $options: "i" }})
Hope it will help!
Using a filter works for me in C#.
string s = "searchTerm";
var filter = Builders<Model>.Filter.Where(p => p.Title.ToLower().Contains(s.ToLower()));
var listSorted = collection.Find(filter).ToList();
var list = collection.Find(filter).ToList();
It may even use the index because I believe the methods are called after the return happens but I haven't tested this out yet.
This also avoids a problem of
var filter = Builders<Model>.Filter.Eq(p => p.Title.ToLower(), s.ToLower());
that mongodb will think p.Title.ToLower() is a property and won't map properly.
I had faced a similar issue and this is what worked for me:
const flavorExists = await Flavors.findOne({
'flavor.name': { $regex: flavorName, $options: 'i' },
});
Yes it is possible
You can use the $expr like that:
$expr: {
$eq: [
{ $toLower: '$STRUNG_KEY' },
{ $toLower: 'VALUE' }
]
}
Please do not use the regex because it may make a lot of problems especially if you use a string coming from the end user.
I've created a simple Func for the case insensitive regex, which I use in my filter.
private Func<string, BsonRegularExpression> CaseInsensitiveCompare = (field) =>
BsonRegularExpression.Create(new Regex(field, RegexOptions.IgnoreCase));
Then you simply filter on a field as follows.
db.stuff.find({"foo": CaseInsensitiveCompare("bar")}).count();
These have been tested for string searches
{'_id': /.*CM.*/} ||find _id where _id contains ->CM
{'_id': /^CM/} ||find _id where _id starts ->CM
{'_id': /CM$/} ||find _id where _id ends ->CM
{'_id': /.*UcM075237.*/i} ||find _id where _id contains ->UcM075237, ignore upper/lower case
{'_id': /^UcM075237/i} ||find _id where _id starts ->UcM075237, ignore upper/lower case
{'_id': /UcM075237$/i} ||find _id where _id ends ->UcM075237, ignore upper/lower case
For any one using Golang and wishes to have case sensitive full text search with mongodb and the mgo godoc globalsign library.
collation := &mgo.Collation{
Locale: "en",
Strength: 2,
}
err := collection.Find(query).Collation(collation)
As you can see in mongo docs - since version 3.2 $text index is case-insensitive by default: https://docs.mongodb.com/manual/core/index-text/#text-index-case-insensitivity
Create a text index and use $text operator in your query.

How to trim/cut string in java by symbol?

I'm working on a project where my API returns url with id at the end of it and I want to extract it to be used in another function. Here's example url:
String advertiserUrl = http://../../.../uuid/advertisers/4 <<< this is the ID i want to extract.
At the moment I'm using java's string function called substring() but this not the best approach as IDs could become 3 digit numbers and I would only get part of it. Heres my current approach:
String id = advertiserUrl.substring(advertiserUrl.length()-1,advertiserUrl.length());
System.out.println(id) //4
It works in this case but if id would be e.g "123" I would only get it as "3" after using substring, so my question is: is there a way to cut/trim string using dashes "/"? lets say theres 5 / in my current url so the string would get cut off after it detects fifth dash? Also any other sensible approach would be helpful too. Thanks.
P.s uuid in url may vary in length too
You don't need to use regular expressions for this.
Use String#lastIndexOf along with substring instead:
String advertiserUrl = "http://../../.../uuid/advertisers/4";// <<< this is the ID i want to extract.
// this implies your URLs always end with "/[some value of undefined length]".
// Other formats might throw exception or yield unexpected results
System.out.println(advertiserUrl.substring(advertiserUrl.lastIndexOf("/") + 1));
Output
4
Update
To find the uuid value, you can use regular expressions:
String advertiserUrl = "http://111.111.11.111:1111/api/ppppp/2f5d1a31-878a-438b-a03b-e9f51076074a/adver‌​tisers/9";
// | preceded by "/"
// | | any non-"/" character, reluctantly quantified
// | | | followed by "/advertisers"
Pattern p = Pattern.compile("(?<=/)[^/]+?(?=/adver‌​tisers)");
Matcher m = p.matcher(advertiserUrl);
if (m.find()) {
System.out.println(m.group());
}
Output
2f5d1a31-878a-438b-a03b-e9f51076074a
You can either split the string on slashes and take the last position of the array returned, or use the lastIndexOf("/") to get the index of the last slash and then substring the rest of the string.
Use the lastIndexOf() method, which returns the index of the last occurrence of the specified character.
String id = advertiserUrl.substring(advertiserUrl.lastIndexOf('/') + 1, advertiserUrl.length());

Java named backreferences not matching

I'm writing a simplified SQL parser that's using regexes to match each valid command. I'm stuck on matching the following:
attribute1 type1, attribute2 type2, attribute3 type3, ...
Where attributes are names of table columns, and types can be a CHAR(size), INT, or DEC. This is used in a CREATE TABLE statement:
CREATE TABLE student (id INT, name CHAR(20), gpa DEC);
To debug it, I'm trying to match this:
id INT, name CHAR(20), gpa DEC
with this:
(?<attributepair>[A-Za-z0-9_]+ (INT|(CHAR\([0-9]{1,3}\))|DEC))(, \k<attributepair>)*
I even tried it without naming the backreference:
([A-Za-z0-9_]+ (INT|(CHAR\([0-9]{1,3}\))|DEC))(, \1)*
I tested the latter regex expression with regexpal and it matched, but both don't when I try it in my Java program. Is there something I'm missing? How can I make this work? Perhaps this has something to do with how I'm calling Pattern.compile(), like if I'm missing a flag or not. I'm also have JDK v7.
Update: I've found that although matches() returns false, lookingAt() and find() return true. It's matching each individual attribute. I want to craft my regex so it matches the whole expression rather than each attribute.
There is no "match as many time as possible and join all the groups together" in Java.
You either have to do it yourself using:
while(matcher.find()) {
// ...
}
... or using a regex that already matches everything in a single call to find.
For example, you could try the following regex (as Java String) instead, which will match all your attributes at once.
(?:\\w+ (?:INT|CHAR(?:\\(\\d{1,3}\\))?|DEC)(?:, )?)+
Here is a working example:
final String str = "CREATE TABLE student (id INT, name CHAR(20), gpa DEC);";
final Pattern p = Pattern.compile("(?:\\w+ (?:INT|CHAR(?:\\(\\d{1,3}\\))?|DEC)(?:, )?)+");
final Matcher m = p.matcher(str);
if(m.find()) {
System.out.println(m.group()); // prints "id INT, name CHAR(20), gpa DEC"
};
Output:
id INT, name CHAR(20), gpa DEC
When you do something like ([A-Za-z0-9_]+ (INT|(CHAR\([0-9]{1,3}\))|DEC))(, \1)* the backreference is for what the first group actually matched.
Ie, id INT, id INT, name CHAR(20), gpa DEC would work with the backreference in the sense that id INT, id INT would become part of the same match. (If you stick that in regexpal you'll see the difference quite clearly based on the highlights.)

Java - Capture optional field with regexp?

I've a regex that correctly captures values from the result of a string.
regex is look like;
intGetHatSaatRenk_v22=anyType{SiraNo=(.*?); HatKodu=(.*?) ; GunTipi=(.*?); Gidis=(.*?); ? };
But the problem is the source is like;
intGetHatSaatRenk_v22=anyType{SiraNo=54; HatKodu=502 ; GunTipi=C; Gidis=12:00; RenkGidis=0000FF; };
intGetHatSaatRenk_v22=anyType{SiraNo=55; HatKodu=502 ; GunTipi=C; Gidis=12:07; }; intGetHatSaatRenk_v22=anyType{SiraNo=56; HatKodu=502 ; GunTipi=C; Gidis=12:14; };
as you can see there is an optional field that named RenkGidis, how can i get the value from RenkGidis if it's not null?
with the regex code that i wrote above, i can get if RenkGidis exists in group(4) like 12:00; RenkGidis=0000FF but group(4) must be only 12:00.
I hope that I could explain my problem.
Might want to make the last group optional:
intGetHatSaatRenk_v22=anyType\{SiraNo=([^;\s]*);\s+HatKodu=([^;\s]*)\s*;\s+GunTipi=([^;\s]*);\s+Gidis=([^;\s]*);(?:\s+RenkGidis=([^;\s]*);)?
As a Java string:
"intGetHatSaatRenk_v22=anyType\\{SiraNo=([^;\\s]*);\\s+HatKodu=([^;\\s]*)\\s*;\\s+GunTipi=([^;\\s]*);\\s+Gidis=([^;\\s]*);(?:\\s+RenkGidis=([^;\\s]*);)?"
At the last group ( ?: prevents the group to be captured into output. ( inside ) catpured as usual.
Also changed .*? to [^;\s]* (negation of [;\s] -> any characters, that are no white-space or ;)
As Alan mentioned in the comments, for not getting a null match for the optional part, e.g. just make RenkGidis optional and wrap the value in an alternation with nothing: ([^;\s]*;|)
intGetHatSaatRenk_v22=anyType\{SiraNo=([^;\s]*);\s+HatKodu=([^;\s]*)\s*;\s+GunTipi=([^;\s]*);\s+Gidis=([^;\s]*);(?:\s+RenkGidis=)?([^;\s]*|)
As a Java string:
"intGetHatSaatRenk_v22=anyType\\{SiraNo=([^;\\s]*);\\s+HatKodu=([^;\\s]*)\\s*;\\s+GunTipi=([^;\\s]*);\\s+Gidis=([^;\\s]*);(?:\\s+RenkGidis=)?([^;\\s]*|)"
The regex could look like this
intGetHatSaatRenk_v22=anyType\{SiraNo=(.*?); HatKodu=(.*?) ; GunTipi=(.*?); Gidis=(.*?);( RenkGidis=.*?;\s*|\s*)\};
Group 5 will then be either " RenkGidis=0000FF;" or " ". You can then use a second regex to get 0000FF.

Java Regular Expression to Handle Strings Contains Next Line

I have a text like this
Customer Owned 03/26 04/25 0.00
Modem
Here Modem is in Next line
Now i need to write the data into spreadsheet as
Customer Owned Modem 03/26 04/25 0.00
I wrote a regex as
([a-zA-Z = ]*) ([[0-9]{2}/[0-9]{2} ]*) (-?[0-9]*\\.[0-9]+)
I am getting the description as "Customer Owned" instead of "Customer Owned Modem". Is there any way to handle through Regex?
You could try this regex:
([A-Za-z ]+)([^A-Za-z]+)[\r\n]*([A-Za-z]+)
And replace by:
\1\3 \2
Here's a demo using your example.
To match newline you can try \r?\n
Update your regex accordingly to include newline as well as the text thereafter
Please add the following to your regular expression. This will detect an end of line on all platforms and capture the following line in the last group. You can then concatenate group 1 and the last group together.
(?:\n|\r|\n\r|\r\n)([a-zA-Z = ]*)$

Categories

Resources