regex to find email address from a String

regex to find email address from a String - java

My intention is to get email address from a web page. I have the page source. I am reading the page source line by line. Now I want to get email address from the current line I am reading. This current line may or may not have email. I saw a lot of regexp examples. But most of them are for validating email address. I want to get the email address from a page source not validate. It should work as http://emailx.discoveryvip.com/ is working
Some examples input lines are :
1)<p>Send details to neeraj#yopmail.com</p>
2)<p>Interested should send details directly to www.abcdef.com/abcdef/. Should you have any questions, please email neeraj#yopmail.com.
3)Note :- Send your queries at neeraj#yopmail.com for more details call Mr. neeraj 012345678901.
I want to get neeraj#yopmail.com from examples 1,2 and 3.
I am using java and I am not good in rexexp. Help me.

You can validate e-mail address formats as according to RFC 2822, with this:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
and here's an explanation from regular-expressions.info:
This regex has two parts: the part before the #, and the part after the #. There are two alternatives for the part before the #: it can either consist of a series of letters, digits and certain symbols, including one or more dots. However, dots may not appear consecutively or at the start or end of the email address. The other alternative requires the part before the # to be enclosed in double quotes, allowing any string of ASCII characters between the quotes. Whitespace characters, double quotes and backslashes must be escaped with backslashes.
And you can check this out here: Rubular example.

The correct code is
Pattern p = Pattern.compile("\\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,4}\\b",
Pattern.CASE_INSENSITIVE);
Matcher matcher = p.matcher(input);
Set<String> emails = new HashSet<String>();
while(matcher.find()) {
emails.add(matcher.group());
}
This will give the list of mail address in your long text / html input.

You need something like this regex:
".*(\\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,4}\\b).*"
When it matches, you can extract the first group and that will be your email.
String regex = ".*(\\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\\.[A-Z]{2,4}\\b).*";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("your text here");
if (m.matches()) {
String email = m.group(1);
//do somethinfg with your email
}

This is a simple way to extract all emails from input String using Patterns.EMAIL_ADDRESS:
public static List<String> getEmails(#NonNull String input) {
List<String> emails = new ArrayList<>();
Matcher matcher = Patterns.EMAIL_ADDRESS.matcher(input);
while (matcher.find()) {
int matchStart = matcher.start(0);
int matchEnd = matcher.end(0);
emails.add(input.substring(matchStart, matchEnd));
}
return emails;
}

Related

How to select specific fragment from the whole text?

After registering on the site, I receive the credentials by mail in the format:
some text /
login: example#mail.com /
password: example123 /
some text
I need to select and copy exactly the login and password without too much text. All text is located in one table . No idea how to do this. I will be very grateful for the idea of how to do this.

You could either split the string (quick and dirty):
String input = "some text / login: example#mail.com / password: example123 / some text";
// iterate over lines if necessary or join using a stream.join("\n")
String username = input.split("login: ").split(" ")[0];
String password = input.split("password: ").split(" ")[0];
There are probably many other ways others can suggest also.
or use a regex and pattern match:
String input = "some text / login: example#mail.com / password: example123 / some text";
String emailRegex = "login: .*#.*\\..* ";
Pattern pattern = Pattern.compile(emailRegex, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String loginEmail = matcher.group();
System.out.println(matchingText); // would need to split it
}
Do the same for password.
If you want something that's scalable, easier to manage, use Regex. If you're not concerned about performance I think the first option is relatively straight forward but if the email format changes majorly you may need to maintain it.

Regex Redirect URL excludes token

I'm trying to create a redirect URL for my client. We have a service that you specify "fromUrl" -> "toUrl" that is using a java regex Matcher. But I can't get it work to include the token in when it converts it. For example:
/fromurl/login?token=7c8Q8grW5f2Kz7RP1%2FWsqpVB%2FEluVOGfXQdW4I0v82siR2Ism1D8VCvEmKJr%2BKhHhicwPey0uIiTxN049Be8TNsypf
Should be:
/tourl/login?token=7c8Q8grW5f2Kz7RP1%2FWsqpVB%2FEluVOGfXQdW4I0v82siR2Ism1D8VCvEmKJr%2BKhHhicwPey0uIiTxN049Be8TNsypf
but it excludes the token so the result I get is:
/fromurl/login/
/tourl/login/
I tried various regex patterns like: " ?.* and [%5E//?]+)/([^/?]+)/(?.*)?$ and (/*) etc" but no one seems to work.
I'm not that familiar with regex. How can I solve this?

This can be easily done using simple string replace but if you insist on using regular expressions:
Pattern p = Pattern.compile("fromurl");
String originalUrlAsString = "/fromurl/login?token=7c8Q8grW5f2Kz7RP1%2FWsqpVB%2FEluVOGfXQdW4I0v82siR2Ism1D8VCvEmKJr%2BKhHhicwPey0uIiTxN049Be8TNsypf ";
String newRedirectedUrlAsString = p.matcher(originalUrlAsString).replaceAll("tourl");
System.out.println(newRedirectedUrlAsString);

If I understand you correctly you need something like this?
String from = "/my/old/url/login?token=7c8Q8grW5f2Kz7RP1%2FWsqpVB%2FEluVOGfXQdW4I0v82siR2Ism1D8VCvEmKJr%2BKhHhicwPey0uIiTxN049Be8TNsypf";
String to = from.replaceAll("\\/(.*)\\/", "/my/new/url/");
System.out.println(to); // /my/new/url/login?token=7c8Q8grW5f2Kz7RP1%2FWsqpVB%2FEluVOGfXQdW4I0v82siR2Ism1D8VCvEmKJr%2BKhHhicwPey0uIiTxN049Be8TNsypf";
This will replace everything between the first and the last forward slash.

Can you detail more exactly what the original expression is like? This is necessary because the regular expression is based on it.
Assuming that the first occurrence of fromurl should simply be replaced with the following code:
String from = "/fromurl/login?token=7c8Q8grW5f2Kz7RP1%2FWsqpVB%2FEluVOGfXQdW4I0v82siR2Ism1D8VCvEmKJr%2BKhHhicwPey0uIiTxN049Be8TNsypf";
String to = from.replaceFirst("fromurl", "tourl");
But if it is necessary to use more complex rules to determine the substring to replace, you can use:
String from = "/fromurl/login?token=7c8Q8grW5f2Kz7RP1%2FWsqpVB%2FEluVOGfXQdW4I0v82siR2Ism1D8VCvEmKJr%2BKhHhicwPey0uIiTxN049Be8TNsypf";
String to = "";
String regularExpresion = "(<<pre>>)(fromurl)(<<pos>>)";
Pattern pattern = Pattern.compile(regularExpresion);
Matcher matcher = pattern.matcher(from);
if (matcher.matches()) {
to = from.replaceAll(regularExpresion, "$1tourl$3");
}
NOTE: pre and pos targets are referencial because I don't know the real expresion of the url
NOTE 2: $1 and $3 refer to the first and the third group

Although existing answers should solve the issue and some are similar, maybe below solution would be of help, with quite an easy regex being used (assuming you get input of same format as your example):
private static String replaceUrl(String inputUrl){
String regex = "/.*(/login\\?token=.*)";
String toUrl = "/tourl";
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(inputUrl);
if (matcher.find()) {
return toUrl + matcher.group(1);
} else
return null;
}
You can write a test if it works for other expected inputs/outputs if you want to change format and adjust regex:
String inputUrl = "/fromurl/login?token=7c8Q8grW5f2Kz7RP1%2FWsqpVB%2FEluVOGfXQdW4I0v82siR2Ism1D8VCvEmKJr%2BKhHhicwPey0uIiTxN049Be8TNsypf";
String expectedUrl = "/tourl/login?token=7c8Q8grW5f2Kz7RP1%2FWsqpVB%2FEluVOGfXQdW4I0v82siR2Ism1D8VCvEmKJr%2BKhHhicwPey0uIiTxN049Be8TNsypf";
if (expectedUrl.equals(replaceUrl(inputUrl))){
System.out.println("Success");
}

Regex: how to extract a JSESSIONID cookie value from cookie string?

I might receive the following cookie string.
hello=world;JSESSIONID=sdsfsf;Path=/ei
I need to extract the value of JSESSIONID
I use the following pattern but it doesn't seem to work. However https://regex101.com shows it's correct.
Pattern PATTERN_JSESSIONID = Pattern.compile(".*JSESSIONID=(?<target>[^;\\n]*)");

You can reach your goal with a simpler approach using regex (^|;)JSESSIONID=(.*);. Here is the demo on Regex101 (you have forgotten to link the regular expression using the save button). Take a look on the following code. You have to extract the matched values using the class Matcher:
String cookie = "hello=world;JSESSIONID=sdsfsf;Path=/ei";
Pattern PATTERN_JSESSIONID = Pattern.compile("(^|;)JSESSIONID=(.*);");
Matcher m = PATTERN_JSESSIONID.matcher(cookie);
if (m.find()) {
System.out.println(m.group(0));
}
Output value:
sdsfsf
Of course the result depends on the all of possible variations of the input text. The snippet above will work in every case the value is between JSESSIONID and ; characters.

You can try below regex:
JSESSIONID=([^;]+)
regex explanation
String cookies = "hello=world;JSESSIONID=sdsfsf;Path=/ei;submit=true";
Pattern pat = Pattern.compile("\\bJSESSIONID=([^;]+)");
Matcher matcher = pat.matcher(cookies);
boolean found = matcher.find();
System.out.println("Sesssion ID: " + (found ? matcher.group(1): "not found"));
DEMO

You can even get what you aiming for with Splitting and Replacing the string aswell, below I am sharing which is working for me.
String s = "hello=world;JSESSIONID=sdsfsf;Path=/ei";
List<String> sarray = Arrays.asList(s.split(";"));
String filterStr = sarray.get(sarray.indexOf("JSESSIONID=sdsfsf"));
System.out.println(filterStr.replace("JSESSIONID=", ""));

How to extract word from string?

Suppose I have a string:
String message = "you should try http://google.com/";
Now, I want to send "http://google.com/" to a new
String url
What I want to do is:
check if a "word" in the string begins with "http://" and extract that word, where a word is
something that's surrounded by spaces (general english definition of word).
I have no idea how to extract the string, and the best I can do is use startsWith on the string. How to I use startsWith on a word, and extract the word?
Sorry if this is a little bit difficult to explain.
Thanks in advance!
EDIT: Also, what should I do to extract the word from the REGEX operation? And how should I handle it if there is more than 1 url in the string?

Use Pattern & Matcher classes.
String str = "blabla http://www.mywebsite.com blabla";
String regex = "((https?:\\/\\/)?(www.)?(([a-zA-Z0-9-]){2,}\\.){1,4}([a-zA-Z]){2,6}(\\/([a-zA-Z-_/.0-9#:+?%=&;,]*)?)?)";
Matcher m = Pattern.compile(regex).matcher(str);
if (m.find()) {
String url = m.group(); //value "http://www.mywebsite.com"
}
This regex will work for http://..., https://... and even www... URLs. Others regex can be easily found on the net.

You can try this:
String str = "blabla http://www.mywebsite.com blabla";
Matcher m = Pattern.compile("(http://.*)").matcher(str);
if (m.find()) {
String url = (new StringTokenizer(m.group(), " ")).nextToken();
}

The "correct" way to perform this task is to split the String by whitespace -- String#split("\s") -- and then pipe it to the URL constructor. If the string starts with your prefix and a MalformedURLException is thrown it is invalid. The URL class constructor is far better tested and more robust than any solution that you or I could come up with. So, use it, please and don't reinvent the wheel.

You can use Java Regex for this:
The following regex catches any string starting with http:// or https:// till the next whitespace character:
Pattern urlPattern = Pattern.compile("(http(s)?://[.^[\\S]]*)");
Matcher matcher = compile.matcher(myString);
if (matcher.find()) {
String url = matcher.group();
}

How to remove dot (.) character using a regex for email addresses of type "abcd.efgh#xyz.com" in java?

I was trying to write a regex to detect email addresses of the type 'abc#xyz.com' in java. I came up with a simple pattern.
String line = // my line containing email address
Pattern myPattern = Pattern.compile("()(\\w+)( *)#( *)(\\w+)\\.com");
Matcher myMatcher = myPattern.matcher(line);
This will however also detect email addresses of the type 'abcd.efgh#xyz.com'.
I went through http://www.regular-expressions.info/ and links on this site like
How to match only strings that do not contain a dot (using regular expressions)
Java RegEx meta character (.) and ordinary dot?
So I changed my pattern to the following to avoid detecting 'efgh#xyz.com'
Pattern myPattern = Pattern.compile("([^\\.])(\\w+)( *)#( *)(\\w+)\\.com");
Matcher myMatcher = myPattern.matcher(line);
String mailid = myMatcher.group(2) + "#" + myMatcher.group(5) + ".com";
If String 'line' contained the address 'abcd.efgh#xyz.com', my String mailid will come back with 'fgh#yyz.com'. Why does this happen? How do I write the regex to detect only 'abc#xyz.com' and not 'abcd.efgh#xyz.com'?
Also how do I write a single regex to detect email addresses like 'abc#xyz.com' and 'efg at xyz.com' and 'abc (at) xyz (dot) com' from strings. Basically how would I implement OR logic in regex for doing something like check for # OR at OR (at)?
After some comments below I tried the following expression to get the part before the # squared away.
Pattern.compile("((([\\w]+\\.)+[\\w]+)|([\\w]+))#(\\w+)\\.com")
Matcher myMatcher = myPattern.matcher(line);
what will the myMatcher.groups be? how are these groups considered when we have nested brackets?
System.out.println(myMatcher.group(1));
System.out.println(myMatcher.group(2));
System.out.println(myMatcher.group(3));
System.out.println(myMatcher.group(4));
System.out.println(myMatcher.group(5));
the output was like
abcd.efgh
abcd.efgh
abcd.
null
xyz
for abcd.efgh#xyz.com
abc
null
null
abc
xyz
for abc#xyz.com
Thanks.

You can use | operator in your regexps to detect #ORAT: #|OR|(at).
You can avoid having dot in email addresses by using ^ at the beginning of the pattern:
Try this:
Pattern myPattern = Pattern.compile("^(\\w+)\\s*(#|at|\\(at\\))\\s*(\\w+)\\.(\\w+)");
Matcher myMatcher = myPattern.matcher(line);
if (myMatcher.matches())
{
String mail = myMatcher.group(1) + "#" + myMatcher.group(3) + "." +myMatcher.group(4);
System.out.println(mail);
}

Your first pattern needs to combine the facts that you want word character and not dots, you currently have it separately, it should be:
[^\\.\W]+
This is 'not dots' and 'not not word characters'
So you have:
Pattern myPattern = Pattern.compile("([^\\.\W]+)( *)#( *)(\\w+)\\.com");
To answer your second question, you can use OR in REGEX with the | character
(#|at)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex to find email address from a String - java

Related

How to select specific fragment from the whole text?

Regex Redirect URL excludes token

Regex: how to extract a JSESSIONID cookie value from cookie string?

How to extract word from string?

How to remove dot (.) character using a regex for email addresses of type "abcd.efgh#xyz.com" in java?

Categories

Resources