Regular expression keyword match in URL

Regular expression keyword match in URL - java

I have a list of URL in a large file (20 mb), and I have a set of keywords. If the set of keywords matches the url then I want to extract the URL.
Example:keyword="contact"
URL:http://www.365media.com/offices-and-contact.html
I need a regular expression to match the keywords with my list of URLs.
My Java code:
public class FileRead {
public static void main(String[] ags) throws FileNotFoundException
{
Scanner in=new Scanner(new File("D:\\Log\\Links.txt"));
String input;
String[] reg=new String[]{".*About.*",".*Available.*",".*Author.*",".*Blog.*",".*Business.*",
".*Career.*",".*category.*",".*City.*",".*Company.*",".*Contain.*",".*Contact.*",".*Download.*",
".*Email.*"};
while(in.hasNext())
{
input=in.nextLine();
//for(String s:reg)
patternFind(input,".*email.*");
}
}
public static void patternFind(String input,String reg)
{
Pattern p=Pattern.compile(reg);
Matcher m=p.matcher(input);
while(m.find())
System.out.println(m.group());
}
}

If you only want to match for the existence of any Keyword in the current line, you can simply use
for (String s: reg) {
if (input.contains(s)) {
// do something
}
}
instead of
patternFind(input,".email.");
Anyways, a regular expression equivalent to match any of the words would be:
.*(About|Available|Author|And|So|On...).*
I'm not sure which one is faster. String.contains() is simpler, a Pattern is precompiled which could perform better when applied many times, as it is the case here.

Why you can't do this:
For all line (URLs) in the file check if some of your pattern works on the URL
the code is pretty obvious

I'm going to give a bit general solution. I think you should be able to adapt the idea to your code.
Supposed you have a list of bare keywords in a file and you read it into a String[], or you hard-code the list of keywords in a String[], for example:
String keywords[] = {"about", "available", "email"};
For all the keywords, use Pattern.quote() to make sure they are recognized as literal string. Then concatenate the keywords with bar character | as separator (OR), and surround everything with parentheses (). The end result will be like this. Alternatively, you can look at the keywords yourself and write the regex without the quoting \Q and \E. You can also just ignore the Pattern.quote() step, if you are sure that the keywords do not contain regex.
(\Qabout\E|\Qavailable\E|\Qemail\E)
Add .* to 2 ends to make it matches the rest of the URL, plus (?i) at the beginning to enable case-insensitive match.
(?i).*(\Qabout\E|\Qavailable\E|\Qemail\E).*
Then you can compile the Pattern and call matcher(inputString).matches() on each line of input to check whether the URL has the keyword.
If the keyword is too common in a URL, such as "com", "net", "www", and you want to make the search more fine grain, more tweaking must be done.

Related

Escape special characters using Regex in java [duplicate]

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.

Since Java 1.5, yes:
Pattern.quote("$5");

Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));

It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);

I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.

First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.

To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}

Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));

^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

Implementing a Negative Lookahead in Regex to exclude a block of code if it contains a certain string

This is a follow up to an original question I posted here, but I would appreciate help in expanding its capabilities a bit. I have the following string I am trying to capture from (let's call it output):
ltm pool TEST_POOL {
Some strings
above headers
records {
baz:1 {
ANY STRING
HERE
session-status enabled
}
foobar:23 {
ALSO ANY
STRING HERE
session-status enabled
}
}
members {
qux:45 {
ALSO ANY
STRINGS HERE
session-status enabled
}
bash:2 {
AND ANY
STRING HERE
session-status user-disabled
}
topaz:789 {
AND ANY
STRING HERE
session-status enabled
}
}
Some strings
below headers
}
Consider each line of output to be separated by a typical line break. For the sake of this question, let's refer to records and members as "titles" and baz, foobar, qux, bash, and topaz as "headers". I am trying to formulate a regex in Java that will capture all headers between the brackets of a given title EXCEPT those that contain the string session-status user-disabled between their own header brackets as can be seen above. For example, given we want to find all headers of title members with this code:
String regex = "(?:\\bmembers\\s*\\{|(?<!^)\\G[^{]+\\{[^}]+\\})\\s*?\\n\\s*([^:{}]+)(?=:\\d)";
final Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(output);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
The output should be just ...
qux
topaz
Thus, it should exclude the bash header because it has session-status user-disabled in between its brackets. I'm having trouble implementing a negative lookahead in the regex I'm using to accomplish this. In addition, baz and foobar should also not match because they are contained within the brackets of a different "title" all together. There can be any number of titles and any number of headers. Some help in modifying my regex to include a negative lookahead to solve this problem would be much appreciated.

I built off of your previous expression and added an alternation that will attempt to match any "header" using a non-capturing group if it contains the string session-status user-disabled. In doing so, those "headers" will be negated because they aren't captured. Only titles of "headers" that contain the string session-status enabled will be matched.
Example Here
(?:\bmembers\s*\{|(?<!^)\G)\s*?\n\s*(?:(?:[^{]*\{[^}]*?session-status user-disabled[^}]*\})|([^:{}]+)(?=:\d)[^{]*\{[^}]*\})

BASIC Lexer with regex written in Java

I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum
public enum TokenType {
INT("-?[0-9]+"),
BOOLEAN("(TRUE|FALSE)"),
PLUS("\\+"),
MINUS("\\-"),
//others.....
}
The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".
But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:
private String pattern() {
StringBuffer tokenPatternsBuffer = new StringBuffer();
for(TokenType token : TokenType.values())
tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")");
String tokenPatternsString = tokenPatternsBuffer.toString().substring(1);
return tokenPatternsString;
}
So it returns a String like:
(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT)....
Now i use this string to create a Pattern
Pattern pattern = Pattern.compile(STRING);
Then I create a Matcher
Matcher match = pattern.match("line of code");
Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.
private void lex() {
ArrayList<Token> tokens = new ArrayList<Token>();
int tokenSize = TokenType.values().length;
int counter = 0;
//Iterate over the arrayLinee (ArrayList of String) to get matches of pattern
for(String linea : arrayLinee) {
counter = 0;
Matcher match = pattern.matcher(linea);
while(match.find()) {
System.out.println(match.group(1));
counter = 0;
for(TokenType token : TokenType.values()) {
counter++;
if(match.group(token.name()) != null) {
tokens.add(new Token(token , match.group(token.name())));
counter = 0;
continue;
}
}
if(counter==tokenSize) {
System.out.println("Syntax Error in line : " + linea);
break;
}
}
tokenList.add("EOL");
}
}
The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?

All you need to do is add an extra "INVALID" token at the end of your enum type with a regex like ".+" (match everything). Because the regexs are evaluated in order, it will only match if no other token was found. You then check to see if the last token in your list was the INVALID token.

If you are working in Java, I recommend trying out ANTLR 4 for creating your lexer. The grammar syntax is much cleaner than regular expressions, and the lexer generated from your grammar will automatically support reporting syntax errors.

If you are writing a full lexer, I'd recommend use an existing grammar builder. Antlr is one solution but I personally recommend parboiled instead, which allows to write grammars in pure Java.

Not sure if this was answered, or you came to an answer, but a lexer is broken into two distinct phases, the scanning phase and the parsing phase. You can combine them into one single pass (regex matching) but you'll find that a single pass lexer has weaknesses if you need to do anything more than the most basic of string translations.
In the scanning phase you're breaking the character sequence apart based on specific tokens that you've specified. What you should have done was included an example of the text you were trying to parse. But Wiki has a great example of a simple text lexer that turns a sentence into tokens (eg. str.split(' ')). So with the scanner you're going to tokenize the block of text into chunks by spaces(this should be the first action almost always) and then you're going to tokenize even further based on other tokens, such as what you're attempting to match.
Then the parsing/evaluation phase will iterate over each token and decide what to do with each token depending on the business logic, syntax rules etc., whatever you set it. This could be expressing some sort of math function to perform (eg. max(3,2)), or a more common example is for query language building. You might make a web app that has a specific query language (SOLR comes to mind, as well as any SQL/NoSQL DB) that is translated into another language to make requests against a datasource. Lexers are commonly used in IDE's for code hinting and auto-completion as well.
This isn't a code-based answer, but it's an answer that should give you an idea on how to tackle the problem.

Pattern syntax error

The following regex works in the find dialog of Eclipse but throws an exception in Java.
I can't find why
(?<=(00|\\+))?[\\d]{1}[\\d]*
The syntax error is at runtime when executing:
Pattern.compile("(?<=(00|\\+))?[\\d]{1}[\\d]*")
In the find I used
(?<=(00|\+))?[\d]{1}[\d]*
I want to match phone numbers with or without the + or 00. But that is not the point because I get a Syntax error at position 13. I don't get the error if I get rid of the second "?"
Pattern.compile("(?<=(00|\\+))[\\d]{1}[\\d]*")
Please consider that instead of 1 sometime I need to use a greater number and anyway the question is about the syntax error

If your data looks like 00ddddd or +ddddd where d is digit you want to get #Bergi's regex (?<=00|\\+)\\d+ will do the trick. But if your data sometimes don't have any part that you want to ignore like ddddd then you probably should use group mechanism like
String[] data={"+123456","00123456","123456"};
Pattern p=Pattern.compile("(?:00|\\+)?(\\d+)");
Matcher m=null;
for (String s:data){
m=p.matcher(s);
if(m.find())
System.out.println(m.group(1));
}
output
123456
123456
123456

Here is an example that works for me:
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=00|\\+)(\\d+)");
Matcher matcher = pattern.matcher("+1123456");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
}

You might shorten your regex a lot. The character classes are not needed when there is only one class inside - just use \d. And {1} is quite useless as well. Also, you can use + for matching "one or more" (it's short for {1,}). Next the additional grouping in your lookbehind should not be needed.
And last, why is that lookbehind optional (with ?)? Just leave it away if you don't need it. This might even be the source of your pattern syntax error - a lookaround must not be optional.
Try this:
/(?<=00|\+)\d+/
Java:
"(?<=00|\\+)\\d+"

Java regular expression for extracting the data between tags

I am trying to a regular expression which extracs the data from a string like
<B Att="text">Test</B><C>Test1</C>
The extracted output needs to be Test and Test1. This is what I have done till now:
public class HelloWorld {
public static void main(String[] args)
{
String s = "<B>Test</B>";
String reg = "<.*?>(.*)<\\/.*?>";
Pattern p = Pattern.compile(reg);
Matcher m = p.matcher(s);
while(m.find())
{
String s1 = m.group();
System.out.println(s1);
}
}
}
But this is producing the result <B>Test</B>. Can anybody point out what I am doing wrong?

Three problems:
Your test string is incorrect.
You need a non-greedy modifier in the group.
You need to specify which group you want (group 1).
Try this:
String s = "<B Att=\"text\">Test</B><C>Test1</C>"; // <-- Fix 1
String reg = "<.*?>(.*?)</.*?>"; // <-- Fix 2
// ...
String s1 = m.group(1); // <-- Fix 3
You also don't need to escape a forward slash, so I removed that.
See it running on ideone.
(Also, don't use regular expressions to parse HTML - use an HTML parser.)

If u are using eclipse there is nice plugin that will help you check your regular expression without writing any class to check it.
Here is link:
http://regex-util.sourceforge.net/update/
You will need to show view by choosing Window -> Show View -> Other, and than Regex Util
I hope it will help you fighting with regular expressions

It almost looks like you're trying to use regex on XML and/or HTML. I'd suggest not using regex and instead creating a parser or lexer to handle this type of arrangement.

I think the bestway to handle and get value of XML nodes is just treating it as an XML.
If you really want to stick to regex try:
<B[^>]*>(.+?)</B\s*>
understanding that you will get always the value of B tag.
Or if you want the value of any tag you will be using something like:
<.*?>(.*?)</.*?>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.