I have a scenario where, in a large text, I want to identify a mail signature and remove that. The signature appears like this-
name | some text | some text | some text E-mail:abc#xyz.com
in the paragraph. Please note, the number of pipe delimiters may be three or more but at the end it has Email.
I need a Java code locate these portions using regex and then remove them. Any pointers would help.
Thanks in advance.
Just want to add, the signature pattern mentioned above may occur one or more times in a large text. Also the text (mentioned as some text) inside the pipe delimiters would change along with the name and the E-mail field.
You will find the email with:
[^|]+$
That matches everything that is not a pipe before line end.
Try this:
public static void main(String[] args) {
String str = "name | some text | some text | some text E-mail:abc#xyz.com";
String regex = ".*\\|.*\\s+";
String email = str.replaceAll(regex, "");
System.out.println(str);
}
After splitting the string compare the last element of the string with the email regex, I'm sure you can find it online.
String[] s = yourString.split("\\|");
Related
I have a java string delimited by |-| like below.
Can't find |-| deliter based split any where else this is unique.
String agent = "iOS|-|iPhone|-|18.2.3|-|kuoipo-kjpopoo-kijhloii-kllkijii";
What is the correct regex to split the contents in string Array like below.
String[] dataarray;
dataarray[0]="iOS";
dataarray[1]="iPhone";
dataarray[2]="18.2.3";
dataarray[3]="kuoipo-kjpopoo-kijhloii-kllkijii";
Already tried:
agent.split("\\|-\\|");
Thanks in Advance.
Won't work
agent.split("|-|")
Do
agent.split("\\|-\\|")
I have html code with img src tags pointing to urls. Some have mysite.com/myimage.png as src others have mysite.com/1234/12/12/myimage.png. I want to replace these urls with a cache file path. Im looking for something like this.
String website = "mysite.com"
String text = webContent.replaceAll(website+ "\\d{4}\\/\\d{2}\\/\\d{2}", String.valueOf(cacheDir));
This code however does not work when the url does not have the extra date stamp at the end. Does anyone know how i might achieve this? Thanks!
Try this one
mysite\.com/(\d{4}/\d{2}/\d{2}/)?
here ? means zero or more occurance
Note: use escape character \. for dot match because .(dot) is already used in regex
Sample code :
String[] webContents = new String[] { "mysite.com/myimage.png",
"mysite.com/1234/12/12/myimage.png" };
for (String webContent : webContents) {
String text = webContent.replaceAll("mysite\\.com/(\\d{4}/\\d{2}/\\d{2}/)?",
String.valueOf("mysite.com/abc/"));
System.out.println(text);
}
output:
mysite.com/abc/myimage.png
mysite.com/abc/myimage.png
You are missing a forward slash between the website.com and the first 4 digits.
String text = webContent.replaceAll(Pattern.quote(website) + "/\\d{4}\\/\\d{2}\\/\\d{2}", String.valueOf(cacheDir));
I'd also recommend using a literal for your website.com value (the Pattern.quote part).
Finally you are also missing the last forward slash after the last two digits so it won't be replaced, but that may be on purpose...
Try:
String text = webContent.replaceAll("(?<="+website+")(.*)(?=\\/)",
String.valueOf(cacheDir));
The BNF form of URL is mentioned in the URL:
http://www.w3.org/Addressing/rfc1738.txt
What I need to do is extract the URLs from html text. Now I was wondering can I represent
String alpha = "[a-zA-Z]";
String alphadigit = "[a-zA-Z0-9]";
String domainlabel = alphadigit+"|"+alphadigit+"("+alphadigit+"|-)*?"+alphadigit;
//String toplabel = alpha+"|"+alpha+"("+alphadigit+"|-)*?"+alphadigit;
String toplabel = "com|org|net|mil|edu|(co\\.[a-z]+)";
String hostname = "(("+domainlabel+")\\.)*("+toplabel+")";
String hostport = hostname;
String lowalpha = "([a-z])";
String hialpha = "([A-Z])";
String alpha = "("+lowalpha+"|"+hialpha+")";
String digit = "([0-9])";
String safe = "($|-|_|.|\\+)";
String extra = "(!|\\*|'|\\(|\\)|,)";
//String national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`";
String punctuation = "(<|>|#|%|\")";
String reserved = "(;|/|?|:|#|&|=)";
String hex = "("+digit+"[A-Fa-f]"+")";
String escape = "(%"+hex+hex+")";
String unreserved = "("+alpha+"|"+digit+"|"+safe+"|"+extra+")";
String uchar = "("+unreserved+"|"+escape+")";
String hsegment = "(("+uchar+"|;|:|#|&|=)*)";
String search = "("+uchar+"|;|:|#|&|=)?)";
String hpath = hsegment+"(/"+hsegment+")*";
//String httpurl = "http://"+hostport+"(/"+hpath+"(?"+search+")?)?";
String httpurl = "http://"+hostport+"/"+hpath;
The final regex:
http://(([a-zA-Z0-9]|[a-zA-Z0-9]([a-zA-Z0-9]|-)*?[a-zA-Z0-9])\.)*(com|org|net|mil|edu|(co\.[a-z]+))/(((((([a-z])|([A-Z]))|([0-9])|($|-|_|.|\+)|(!|\*|'|\(|\)|,))|(%(([0-9])[A-Fa-f])(([0-9])[A-Fa-f])))|;|:|#|&|=)*)(/(((((([a-z])|([A-Z]))|([0-9])|($|-|_|.|\+)|(!|\*|'|\(|\)|,))|(%(([0-9])[A-Fa-f])(([0-9])[A-Fa-f])))|;|:|#|&|=)*))*
So you can see I represented the whole BNF to a big regular expression which will be use with javax.util.regex methods to extract the URL out of text. Now is this the correct approach? If it is correct, then why do we need to write a context free grammar? What disadvantages the regex approach have?
Besides, for grammar parser, say for a language, the grammar is used to validate whether the code follows the grammar rules otherwise show some error messages. Also using the grammar we get a syntax tree which is used to evaluate the expression. For the URL thing we didn't evaulate anything. we just need to extract the urls out of the rest of the text.
I got this question, because previously I was trying to parse email address. After exhaustively searching for regular expressions, none of them turned out to be 100% accurate and some comment was made regarding the limitations of regex to match the exact BNF form of email addresses in RFC. Hence a grammar (instead of regex) might be required. Hence I have this question for URLs.
Thanks
Well, I think your issue could be solved more easily using some heuristics about how http link looks like in free text. It could work more faster than such complicated regexp, especially if we are talking about large texts:
http link (url) starts with unique http://
from start to end URL doesn't contains some set of characters (white-spaces for example). When you came cross such character it means that you found end of URL.
If the URL you are extracting is within tags (such as the href property of an anchor tag) then I'd recommend using JSoup to parse and inspect the HTML.
http://jsoup.org/
Within the body of text, I'm certain a more simple regex approach is possible, perhaps matching on the protocol (http://)
I have a String as folder/File Name. I am creating folder , file with that string. This string may or may not contain some charters which may not allow to create desired folder or file
e.g
String folder = "ArslanFolder 20/01/2013";
So I want to remove these characters with "_"
Here are characters
private static final String ReservedChars = "|\?*<\":>+[]/'";
What will be the regular expression for that? I know replaceAll(); but I want to create a regular expression for that.
Use this code:
String folder = "ArslanFolder 20/01/2013 ? / '";
String result = folder.replaceAll("[|?*<\":>+\\[\\]/']", "_");
And the result would be:
ArslanFolder 20_01_2013 _ _ _
you didn't say that space should be replaced, so spaces are there... you could add it if it is necessary to be done.
I used one of this:
String alphaOnly = input.replaceAll("[^\\p{Alpha}]+","");
String alphaAndDigits = input.replaceAll("[^\\p{Alpha}\\p{Digit}]+","");
See this link:
Replace special characters
Try this :
replaceAll("[\\W]", "_");
It will replace all non alphanumeric characters with underscore
This is correct solution:
String result = inputString.replaceAll("[\\\\|?\u0000*<\":>+\\[\\]/']", "_");
Kent answer is good, but he isnt include characters NUL and \.
Also, this is a secure solution for replacing/renaming text of user-input file names, for example.
i have string which contains some value as given below. i want to replace the html img tags containing specific customerId with some new text. i tried small java program which is not giving me expected output.here is the program info
My input string is
String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334¶m1=123/></p>"
+ "<p>someText</p><img src=\"getCustomers.do?custCode=2&customerId=3340¶m2=456/> ..Ending here";
Regex is
String regex = "(?s)\\<img.*?customerId=3340.*?>";
new text i want to put inside input string
EDIT Starts:
String newText = "<img src=\"getCustomerNew.do\">";
EDIT ENDS:
now i am doing
String outputText = inputText.replaceAll(regex, newText);
output is
Starting here.. Replacing Text ..Ending here
but my expected output is
Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334¶m1=123/></p><p>someText</p>Replacing Text ..Ending here
Please note in my expected output only img tag which is containing customerId=3340 got replaced with Replacing Text. i am not getting why in the output i am getting both the img tags are getting replced?
You've got "wildcard"/"any" patterns (.*) in there which will extend the match to the longest possible matching string, and the last fixed text in the pattern is a > character, which therefore matches the last > character in the input text, i.e. the very last one!
You should be able to fix this by changing the .* parts to something like [^>]+ so that the matching won't span past the first > character.
Parsing HTML with regular expressions is bound to cause pain.
As other people have told you in the comments, HTML is not a regular language so using regex for manipulating it is usually painful. Your best option is to use an HTML parser. I haven't used Jsoup before, but googling a little bit it seems you need something like:
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class MyJsoupExample {
public static void main(String args[]) {
String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334¶m1=123\"/></p>"
+ "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340¶m2=456\"/></p></body></html>";
Document doc = Jsoup.parse(inputText);
Elements myImgs = doc.select("img[src*=customerId=3340");
for (Element element : myImgs) {
element.replaceWith(new TextNode("my replaced text", ""));
}
System.out.println(doc.toString());
}
}
Basically the code gets the list of img nodes with a src attribute containing a given string
Elements myImgs = doc.select("img[src*=customerId=3340");
then loop over the list and replace those nodes with some text.
UPDATE
If you don't want to replace the whole img node with text but instead you need to give a new value to its src attribute then you can replace the block of the for loop with:
element.attr("src", "my new value"));
or if you want to change just a part of the src value then you can do:
String srcValue = element.attr("src");
element.attr("src", srcValue.replace("getCustomers.do", "getCustonerNew.do"));
which is very similar to what I posted in this thread.
What happens is that your regex starts matching the first img tag then consumes everything (regardless is greedy or not) until it finds customerId=3340 and then continues consuming everything until it finds >.
If you want it to consume just the img with customerId=3340 think of what makes different this tag from other tags that it may match.
In this particular case, one possible solution is to look at what is behind that img tag using a look-behind operator (which doesn't consume a match). This regex will work:
String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";