Extracting String's from a String JAVA - java

Hello I want to extract "Hello, World!" "and" and the Paragraph "This is a minimal....." from the given string in JAVA. I am having problems in extracting, so can anyone help me with it?
So I always get different Strings and want to extract the string between 2 square brackets []......[].
String s1="[sh1] Hello, World! [/s11] and [pp]This is a minimal "hello world" HTML document. It demonstrates the basic structure of an HTML file and anchors. [/xy]"
Thanks

Use the Pattern & Matcher to match square brackets:
Pattern pattern = Pattern.compile("\\[[^\\]]*\\]([^\\]]*)\\[[^\\]]*\\]");
Matcher matcher = pattern.matcher(s1);
while (matcher.find()) {
System.out.println( "Found value: " + matcher.group(1).trim() );
}
Demo: https://ideone.com/kNKBgg

Please don't use RegEx-es to do this (it's what Pattern and Matcher do) - see here for reason why you shouldn't. While you could use this for the particular bracket example, if you expect full-blown HTML don't do it.
If you want to extract content from HTML use a parser, for example SAXParser or DOMParser - see Oracle documentation for examples.

Related

How to replace a given substring with "" from a given string?

I went through a couple of examples to replace a given sub-string from a given string with "" but could not achieve the result. The String is too long to post and it contains a sub-string which is as follows:-
/image/journal/article?img_id=24810&t=1475128689597
I want to replace this sub-string with "".Here the value of img_id and t can vary, so I would have to use regular expression. I tried with the following code:-
String regex="^/image/journal/article?img_id=([0-9])*&t=([0-9])*$";
content=content.replace(regex,"");
Here content is the original given string. But this code is actually not replacing anything from the content. So please help..any help would be appreciated .thanx in advance.
Use replaceAll works in nice way with regex
content=content.replaceAll("[0-9]*","");
Code
String content="/image/journal/article?img_id=24810&t=1475128689597";
content=content.replaceAll("[0-9]*","");
System.out.println(content);
Output :
/image/journal/article?img_id=&t=
Update : simple, might be little less cozy but easy one
String content="sas/image/journal/article?img_id=24810&t=1475128689597";
content=content.replaceAll("\\/image.*","");
System.out.println(content);
Output:
sas
If there is something more after t=1475128689597/?tag=343sdds and you want to retain ?tag=343sdds then use below
String content="sas/image/journal/article?img_id=24810&t=1475128689597/?tag=343sdds";
content=content.replaceAll("(\\/image.*[0-9]+[\\/])","");
System.out.println(content);
}
Output:
sas?tag=343sdds
If you're trying to replace the substring of the URL with two quotations like so:
/image/journal/article?img_id=""&t=""
Then you need to add escaped quotes \"\" inside your content assignment, edit your regex to only look for the numbers, and change it to replaceAll:
content=content.replaceAll(regex,"\"\"");
You can use Java regex Utility to replace your String with "" or (any desired String literal), based on given pattern (regex) as following:
String content = "ALPHA_/image/journal/article?img_id=24810&t=1475128689597_BRAVO";
String regex = "\\/image\\/journal\\/article\\?img_id=\\d+&t=\\d+";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {
String replacement = matcher.replaceAll("PK");
System.out.println(replacement); // Will print ALPHA_PK_BRAVO
}

How to extract image url from a string?

I am trying to extract image url from inside of a string. I am using Pattern and Matcher. I am using a regular expression to match the same. Whenever I am trying to debug the code, both, matcher.matches() and matcher.find() result into false.
I am attaching the image url and regular expression as well as my code.
Pattern pattern_name;
Matcher matcher_name;
String regex = "(http(s?):/)(/[^/]+)+\" + \"\\.(?:jpg|gif|png)";
String url = "http://www.medivision360.com/pharma/pages/articleImg/thumbnail/thumb3756d839adc5da3.jpg";
pattern_name = Pattern.compile(regex);
matcher_name = pattern_name.matcher(url);
matcher_name.matches();
matcher_name.find();
You seem to have some issue with the regex, the \" + \" should come from some code you mistook for a regex. That subpattern requires a quote, one or more spaces, then a space, and another double quote to appear right before the extension. It matches something like http://www.medivision360.com/pharma/pages/articleImg/thumbnail/thumb3756d839adc5da3" ".jpg.
Also, there are two redundant capture groups at the beginning, you do not need to use them.
Use
String regex = "https?:/(?:/[^/]+)+\\.(?:jpg|gif|png)";
See this demo
Java demo:
String rx = "https?:/(?:/[^/]+)+\\.(?:jpg|gif|png)";
String url = "http://www.medivision360.com/pharma/pages/articleImg/thumbnail/thumb3756d839adc5da3.jpg";
Pattern pat = Pattern.compile(rx);
Matcher matcher = pat.matcher(url);
if (matcher.matches()) {
System.out.println(matcher.group());
}
Note that Matcher#matches() requires a full string match, while Matcher#find() will find a partial match, a match inside a larger string.
You've escaped the double quotes in the string catenation
so the regex engine sees this (http(s?):/)(/[^/]+)+" + "\.(?:jpg|gif|png)
after c++ parses the string.
You can un-escape it "(http(s?):/)(/[^/]+)+" + "\\.(?:jpg|gif|png)"
or just join them together "(http(s?):/)(/[^/]+)+\\.(?:jpg|gif|png)"
If the expression is always at the end, I would suggest:
([^/?]+)(?=/?(?:$|\?))

How to get (split) filenames from string in java?

I have a string that contains file names like:
"file1.txt file2.jpg tricky file name.txt other tricky filenames containing áéíőéáóó.gif"
How can I get the file names, one by one?
I am looking for the most safe most through method, preferably something java standard. There has got to be some regular expression already out there, I am counting on your experience.
Edit: expected results:
"file1.txt", "file2.jpg", "tricky file name.txt", "other tricky filenames containing áéíőéáóó.gif"
Thanks for the help,
Sziro
Regular expresion that enrico.bacis suggested (\S.?.\S+)* will not work if there are filenames without characters before "." like .project.
Correct pattern would be:
(([^ .]+ +)*\S*\.\S+)
You can try it here.
Java program that could extract filenames will look like:
String patternStr = "([^ .]+ +)*\\S*\\.\\S+";
String input = "file1.txt .project file2.jpg tricky file name.txt other tricky filenames containing áéíoéáóó.gif";
Pattern pattern = Pattern.compile(patternStr, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
If you want to use regular expressions you can find all the occurrences of:
(\S.*?\.\S+)
(you can test it here)
If there are spaces in the file names, it makes it trickier.
If you can assume there are no dots (.) in the file names, you can use the dot to find each individual records as has been suggested.
If you can't assume there are no dots in file names, e.g. my file.new something.txt
In this situation, I'd suggest you create a list of acceptable extentions, e.g. .doc, .jpg, .pdf etc.
I know the list may be long, so it's not ideal. Once you have done this you can look for these extensions and assume what's before it is a valid filename.
String txt = "file1.txt file2.jpg tricky file name.txt other tricky filenames containing áéíőéáóó.gif";
Pattern pattern = Pattern.compile("\\S.*?\\.\\S+"); // Get regex from enrico.bacis
Matcher matcher = pattern.matcher(txt);
while (matcher.find()) {
System.out.println(matcher.group().trim());
}

how to choose text from a file

i have a text file like:
"GET /opacial/index.php?op=results&catalog=1&view=1&language=el&numhits=10&query=\xce\x95\xce\xbb\xce\xbb\xce\xac\xce\xb4\xce\xb1%20--%20\xce\x95\xce\xb8\xce\xbd\xce\xb9\xce\xba\xce\xad\xcf\x82%20\xcf\x83\xcf\x87\xce\xad\xcf\x83\xce\xb5\xce\xb9\xcf\x82%20--%20\xce\x99\xcf\x83\xcf\x84\xce\xbf\xcf\x81\xce\xaf\xce\xb1&search_field=11&page=1
And i want to cut all the characters after the word "query" and before "&search". (bolds above).
I am trying to cut the data, using patterns but something is wrong.. Can you give me an example for the example code above?
EDIT:
An other problem , except the one above is that the matcher is used only for charSequences, and i have a file, which can not casted to charSequence... :\
something like that:
String yourNewText=yourOldText.split("query")[1].split("&search")[0];
?
to see how to read a file into a String, you can look here (there are different possiblities)
".*query\\=(.*)\\&search_field.*"
This regex should work to give you a capture of what you want to remove. Then String.replace should do the trick.
Edit - response to comment. The following code...
String s = "GET /opacial/index.php?op=results&catalog=1&view=1&language=el&numhits=10&query=\\xce\\x95\\xce\\xbb\\xce\\xbb\\xce\\xac\\xce\\xb4\\xce\\xb1%20--%20\\xce\\x95\\xce\\xb8\\xce\\xbd\\xce\\xb9\\xce\\xba\\xce\\xad\\xcf\\x82%20\\xcf\\x83\\xcf\\x87\\xce\\xad\\xcf\\x83\\xce\\xb5\\xce\\xb9\\xcf\\x82%20 --%20\\xce\\x99\\xcf\\x83\\xcf\\x84\\xce\\xbf\\xcf\\x81\\xce\\xaf\\xce\\xb1&search_field=11&page=1";
Pattern p = Pattern.compile(".*query\\=(.*)\\&search_field.*");
Matcher m = p.matcher(s);
if (m.matches()){
String betweenQueryAndSearch = m.group(1);
System.out.println(betweenQueryAndSearch);
}
Produced the following output....
\xce\x95\xce\xbb\xce\xbb\xce\xac\xce\xb4\xce\xb1%20--%20\xce\x95\xce\xb8\xce\xbd\xce\xb9\xce\xba\xce\xad\xcf\x82%20\xcf\x83\xcf\x87\xce\xad\xcf\x83\xce\xb5\xce\xb9\xcf\x82%20 --%20\xce\x99\xcf\x83\xcf\x84\xce\xbf\xcf\x81\xce\xaf\xce\xb1

Regex to remove html does not get rid of img tag

I am using a regex to remove HTML tags. I do something like -
result.replaceAll("\<.*?\>", "");
However, it does not help me get rid of the img tags in the html. Any idea what is a good way to do that?
If you cannot use HTML parsers/cleaners then I would at least suggest you to use Pattern.DOTALL flag to take care of multi-line HTML blocks. Consider code like this:
String str = "123 <img \nsrc='ping.png'>abd foo";
Pattern pt = Pattern.compile("<.*?>", Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
System.out.println("Output: " + sb);
OUTPUT
Output: 123 abd foo
To give a more concrete recommendation, use JSoup (or NekoHTML) to parse the HTML into a Java object.
Once you've got a Document object it can easily be traversed to remove the tags. This cookbook recipe shows how to get attributes and text from the DOM.
Another suggestion is HtmlCleaner
I'm just re-iterating what others have said already, but this point cannot be over-stated: DO NOT USE REGEXES TO PARSE HTML. There are a 1,000 similar questions on this on SO. Use a proper HTML parser, it will make your life so much easier, and is far more robust and reliable. Take a look at Dom4j, Jericho, JSoup. Please.
So, a piece of code for you.
I use http://htmlparser.sourceforge.net/ to parse HTML. It is not overcomplicated and quite straightforward to use.
Basically it looks like this:
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
...
String html; /* read your HTML into variable 'html' */
String result=null;
....
try {
Parser p = new Parser(html);
NodeList nodes = p.parse(null);
result = nodes.asString();
} catch (ParserException e) {
e.printStackTrace();
}
That will give you plain text stripped of tags (but no substitutes like & would be fixed). And of course you can do plenty more with this library, like applying filters, visitors, iterating and all the stuff.
use html parser instead. iterate over the object, print however you like and get the best result.
I have been able achieve do this with the below code snippet.
String htmlContent = values.get(position).getContentSnippet();
String plainTextContent = htmlContent.replaceAll("<img .*?/>", "");
I used the above regex to clean the img tags in my RSS content.

Categories

Resources