java regex matcher results != to notepad++ regex find result

java regex matcher results != to notepad++ regex find result - java

I am trying to extract data out of a website access log as part of a java program. Every entry in the log has a url. I have successfully extracted the url out of each record.
Within the url, there is a parameter that I want to capture so that I can use it to query a database. Unfortunately, it doesn't seem that the web developers used any one standard to write the parameter's name.
The parameter is usually called "course_id", but I have also seen "courseId", "course%3DId", "course%253Did", etc. The format for the parameter name and value is usually course_id=_22222_1, where the number I want is between the "_" and "_1". (The value is always the same, even if the parameter name varies.)
So, my idea was to use the regex /^.*course_id[^_]*_(\d*)_1.*$/i to find and extract the number.
In java, my code is
java.util.regex.Pattern courseIDPattern = java.util.regex.Pattern.compile(".*course[^i]*id[^_]*_(\\d*)_1.*", java.util.regex.Pattern.CASE_INSENSITIVE);
java.util.regex.Matcher courseIDMatcher = courseIDPattern.matcher(_url);
_courseID = "";
if(courseIDMatcher.matches())
{
_courseID = retrieveCourseID(courseIDMatcher.group(1));
return;
}
This works for a lot of the records. However, some records do not record the course_id, even though the parameter is in the url. One such example is the record:
/webapps/contentDetail?course_id=_223629_1&content_id=_3641164_1&rich_content_level=RICH&language=en_US&v=1&ver=4.1.2
However, I used notepad++ to do a regex replace on this (in fact, every) url using the regex above, and the url was successfully replaced by the course ID, implying that the regex is not incorrect.
Am I doing something wrong in the java code, or is the java matcher broken?

Related

Redisearch query with "begin with" instead of "contains"

I am trying to understand on how to perform queries in Redisearch strictly with "begins with" and I keep getting "contains".
For example if I have fields with values like 'football', 'myfootball', 'greenfootball' and would provide a search term like this:
> FT.SEARCH myIdx #myfield:foot*
I want just to get 'football' but I keep getting other fields that contain the word instead of beginning with that word.
Is there a way to avoid this?
I was trying to use VERBATIM and things like #myfield:^foot* but nothing.
I am using JRedisearch as a client but eventually I had to enter the DB and perform these queries manually in order to figure out what's happening. That being said, is this possible to do with this client at the moment?
Thanks
EDIT
A sample of my index setup:
Client client = new Client(INDEX_NAME, url, PORT);
Schema sc = new Schema().addSortableTextField("url", 1.0); // using this field for query
client.dropIndex(true);
client.createIndex(sc, Client.IndexOptions.Default());
return client;
Sample document:
id: // random uuid
urlPath: myfootbal
application: web
market: Europe

After checking the RDB provided I see that when searching foot* you are not getting myfootbal. The replies look like this: /dot-com/plp/football/x/index.html. You are getting those replies because this url is tokenized, and '/' is one of the tokenize chars. If you do not want those urls to be tokenized you need to declare them as TAGS and not as TEXT. This way the entire url will be indexed as is and when search for foot* it will not appear in the results.
For more information about TAGS see the FT.CREATE documentation: https://oss.redislabs.com/redisearch/Commands.html

How to replace a query string in an Apache Velocity template?

In my web application I'm trying to prevent users from inserting JavaScript in the freeText parameter when they're running a search.
To do this, I've written code in the header Velocity file to check whether the query string contains a parameter called freeText, and if so, use the replace method to replace the characters within the parameter value. However, when you load the page, it still displays the original query string - I'm unsure on how to replace the original query string with my new one which has the replaced characters.
This is my code:
#set($freeTextParameter = "$request.getParameter('freeText')")
freeTextParameter: $freeTextParameter
#if($freeTextParameter)
##Do the replacement:
#set($replacedQueryString = "$freeTextParameter.replace('confirm','replaced')")
replacedQueryString after doing the replace: $replacedQueryString
The query string now: $request.getQueryString()
The freeText parameter now: $request.getParameter('freeText')
#end
In the code above, the replacedQueryString variable has changed as expected (ie the replacement has been carried out as expected), but the $request.getQueryString() and $request.getParameter('freeText') are still the same as before, as if the replacement had never happened.
Seeing as there is a request.getParameter method which works fine for getting the parameters, I assumed there would be a request.setParameter method to do the same thing in reverse, but there isn't.

The Java String is an immutable object, which means that the replace() method will return an altered string, without changing the original one.
Since the parameters map given by the HttpServletRequest object cannot be modified, this approach doesn't work well if your templates rely on $request.getParameter('freeText').
Instead, if you rely on VelocityTools, then you can rather rely on $params.freeText in your templates. Then, you can tune your WEB-INF/tools.xml file to make this parameters map alterable:
<?xml version="1.0">
<tools>
<toolbox scope="request">
<tool key="params" readOnly="false"/>
...
</toolbox>
...
</tools>
(Version 2.0+ of the tools is required).
Then, in your header, you can do:
#set($params.freeText = params.freeText.replace('confirm','replaced'))

I managed to fix the issue myself - it turned out that there was another file (which gets called on every page) in which the $!request.getParameter('freeText')" variable is used. I have updated that file so that it uses the new $!replacedQueryString variable (ie the one with the JavaScript stripped out) instead of the existing "$!request.getParameter('freeText')" variable. This now prevents the JavaScript from being executed on every page.
So, this is the final working code in the header Velocity file:
#set($freeTextParameter = "$!m.request.httpRequest.getParameter('freeText')")
#if($freeTextParameter)
#set($replacedQueryString = "$freeTextParameter.replace('confirm','').replace('<','').replace('>','').replace('(','').replace(')','').replace(';','').replace('/','').replace('\"','').replace('&','').replace('+','').replace('script','').replace('prompt','').replace('*','').replace('.','')")
#end

Using regex to find chars in a string and replace

When returning a string value from an incoming request in my network based app, I have a string like this
'post http://a.com\r\nHost: a.com\r\n'
Issue is that the host is always changing so I need to replace it with my defined host. To accomplish that I tried using regex but am stuck trying to find the 'host:a.com' chars in the string and replacing it with a defined valued.
I tried using this example www.javamex.com/tutorials/regular_expressions/search_replace_loop.shtml#.VUWvt541jqB changing the pattern compile to :([\\d]+) but it still remains unchanged.
My goal is to replace given chars in a string with a defined value and returning the new string with the defined value.
Any pointers?
EDIT:
Sample of a typical incoming request:
Post http://example.com\r\nHost: example.com\r\nConnection: close\r\n
Another incoming request might take this form:
GET http://example2.net\r\nContent-Length: 2\r\nConnection: close\r\nHost: example2.net\r\n
I want to replace it to this forms
Post http://example.com\r\nHost: mycustomhostvalue.com\r\nConnection: close\r\n
GET http://example2.net\r\nContent-Length: 2\r\nConnection: close\r\nHost: mycustomhostvalue.com\r\n

Use a regex to replace it, like this:
content = content.replaceAll("Host:\\s*(\\w)*\\.\\w*", "Host: newhost.com")
This will replace anything after Host: with newHost.com.
Note: as per comment by cfqueryparam, you may want to usea regex like this to cover .co.uk and such:
Host:\\s*.*?(?=\\\\r\\\\n)

apache commons-validator alternative for new gTLDS

I need to validate emails and domains. I just need a formal validation, no whois or other forms of domain lookup needed.
Currently I'm using apache's commons-validator v1.4.0
Unfortunately my customers use the new gTLDs, like .bike or .productions that are not yet supported by the DomainValidator class.
See Apache's Jira issue for more details.
Are there any sound alternatives that I may easily include in my Maven POM?

If you are not concerned about internationalized addresses, you could change last part of address, and continue to use Apache commons.
This approach is based on the fact that whatever the TLD is, the validity of the whole domain name is equivalent to the validity of the same domain name with the TLD replaced with com. For example:
abc.def.com is valid. Similarly abc.def.name, abc.def.xx--kput3i, abc.def.uk are valid.
ab,de.com is not valid. Similarly ab,de.name, ab,de.xx-kput3i, ab,de.uk are not valid.
So instead of calling
return EmailValidator.getInstance().isValid(userEmail);
You can call
if ( userEmail == null ) {
return false;
}
return EmailValidator.getInstance().isValid(userEmail.trim().replaceFirst("\\.\\p{Alpha}[\\p{Alnum}-]*\\p{Alnum}$", ".com"));
Explanation
The regular expression "\\.\\p{Alpha}[\\p{Alnum}-]*\\p{Alnum}$" checks for the TLD part: it's at the end of the string (because of the $), it starts with a dot and contains no other dot, and it conforms to the standards: begins with an ASCII Alpha character, followed by zero or more alphanumerics or dashes, and ends with an alphanumeric character.
I am using trim() because until now, if you used EmailValidator, it allows spaces before and after the address. Removing the spaces just makes it easier to replace the TLD, and it shouldn't matter as far as the validity of the address is concerned.
If the string doesn't have a valid TLD at the end, String.replaceFirst() will return it as is. It could still be valid, because email addresses of the format x#[n.n.n.n] where n.n.n.n. is a valid IP address are valid. So basically, if you didn't find a TLD, you let EmailValidator decide the validity issue itself.
Of course, if the TLD is not an IANA recognized TLD, this validation will not tell you that. An e-mail like david#galaxy.hoopie-frood will be accepted as legal,but IANA doesn't have that TLD as yet.
Checking a domain is similar, without the trim() part:
if (userDomain == null ) {
return false;
}
return DomainValidator.getInstance().isValid(userDomain.replaceFirst("\\.\\p{Alpha}[\\p{Alnum}-]*\\p{Alnum}$"));
I have also tried JavaMail's email address validation, but I don't really like it: it allows completely invalid domain names such as net-name.net- (ending with a dash) or IP addresses (which are not allowed for e-mail without square brackets around them), and it's only good for e-mail addresses, not for domains.
Internationalization
If you need to check for internationalized domains and e-mails, it's a bit different. It's easy to check for internationalized domains (for example 元気。テスト). All you need to do is convert them to ASCII with java.net.IDN.toASCII() (yielding xn--z4qx76d.xn--zckzah for my example domain - this is a valid TLD), and then do the same as I wrote above.
Internationalized e-mails are a different story. If the local part is ASCII, you can convert the domain part to ASCII. If you have to display the email address, you need to use the Unicode version, and if you have to send an email message, you use the ASCII version.
But recently a standard has been introduced for internationalized local parts as well, which also allows sending to the unicode version of the domain name without translating it to ASCII first. Whether you want to support that or not requires some thought, as not many mail servers and mail transfer agents support it at the moment.

Copied the implementation from DomainValidator and replaced the TOP_LABEL_REGEX expression with "\\p{Alpha}[\\p{Alnum}-]*\\p{Alpha}".
In addition, I removed validation against the hard coded list of approved gTLDs. This is, basically, quite weak in that it doesn't validate against the actual domains. But I think it's good enough (catches the gTLDs similar to XN--YGBI2AMMX).
See full list of approved gTLDs here.
// Copied from org.apache.commons.validator.routines.DomainValidator
private static final String DOMAIN_LABEL_REGEX = "\\p{Alnum}(?>[\\p{Alnum}-]*\\p{Alnum})*";
// Changed to include new gTLD - http://data.iana.org/TLD/tlds-alpha-by-domain.txt
private static final String TOP_LABEL_REGEX = "\\p{Alpha}[\\p{Alnum}-]*\\p{Alpha}";
// Copied from org.apache.commons.validator.routines.DomainValidator
private static final String DOMAIN_NAME_REGEX = "^(?:" + DOMAIN_LABEL_REGEX + "\\.)+" + "(" + TOP_LABEL_REGEX + ")$";
private static final RegexValidator domainRegex = new RegexValidator(DOMAIN_NAME_REGEX);
private static final EmailValidator EMAIL_VALIDATOR = new EmailValidator();
public static boolean isValidDomain(String domain) {
String[] groups = domainRegex.match(domain);
return groups != null && groups.length > 0;
}

What I often do in this situation is to checkout the source code for the library in question (it's open source remember?), modify it to suit my requirement, and then contribute the patch back to the project.
Your use case certainly sounds like it would be a useful contribution.

I made you a public suffix list Java API. The method PublicSuffixList.getRegistrableDomain() can be used for Domain validation:
PublicSuffixListFactory factory = new PublicSuffixListFactory();
PublicSuffixList suffixList = factory.build();
assertNull(suffixList.getRegistrableDomain("galaxy.hoopie-frood"));
assertNotNull(suffixList.getRegistrableDomain("example.bike"));

While DomainValidator is missing some of the new TLDs, for me the best solution was to update TLD.
DomainValidator.updateTLDOverride(ArrayType.COUNTRY_CODE_PLUS, new String[]{"someTLD"});
And then initiate EmailValidator Instance
EmailValidator.getInstance(false, true)

Java regular expression for file path

I am developing an application a where user need to supply local file location or remote file location. I have to do some validation on this file location.
Below is the requirement to validate the file location.
Path doesn't contain special characters * | " < > ?.
And path like "c:" is also not valid.
Paths like
c:\,
c:\newfolder,
\\casdfhn\share
are valid while
c:
non,
\\casfdhn
are not.
I have implemented the code based on this requirement:
String FILE_LOCATION_PATTERN = "^(?:[\\w]\\:(\\[a-z_\\-\\s0-9\\.]+)*)";
String REMOTE_LOCATION_PATTERN = "\\\\[a-z_\\-\\s0-9\\.]+(\\[a-z_\\-\\s0-9\\.]+)+";
Pattern locationPattern = Pattern.compile(FILE_LOCATION_PATTERN);
Matcher locationMatcher = locationPattern.matcher(iAddress);
if (locationMatcher.matches()) {
return true;
}
locationPattern = Pattern.compile(REMOTE_LOCATION_PATTERN);
locationMatcher = locationPattern.matcher(iAddress);
return locationMatcher.matches();
Test:
worklocation' pass
'C:\dsrasr' didnt pass (but should pass)
'C:\saefase\are' didnt pass (but should pass)
'\\asfd\sadfasf' didnt pass (but should pass)
'\\asfdas' didnt pass (but should not pass)
'\\' didnt pass (but should not pass)
'C:' passed infact should not pass
I tried many regular expression but didn't satisfy the requirement. I am looking for help for this requirement.

The following should work:
([A-Z|a-z]:\\[^*|"<>?\n]*)|(\\\\.*?\\.*)
The lines highlighted in green and red are those that passed. The non-highlighted lines failed.
Bear in mind the regex above is not escaped for java

from your restrictions this seems very simple.
^(C:)?(\\[^\\"|^<>?\\s]*)+$
Starts with C:\ or slash ^(C:)?\\
and can have anything other than those special characters for the rest ([^\\"|^<>?\\s\\\])*
and matches the whole path $
Edit: seems C:/ and / were just examples. to allow anything/anything use this:
^([^\\"|^<>?\\s])*(\\[^\\"|^<>?\\s\\\]*)+$

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java regex matcher results != to notepad++ regex find result - java

Related

Redisearch query with "begin with" instead of "contains"

How to replace a query string in an Apache Velocity template?

Using regex to find chars in a string and replace

apache commons-validator alternative for new gTLDS

Java regular expression for file path

Categories

Resources