I'm trying to write a small program that extract information from a website. I only want to get certain information that is in between two strings, "ORIGIN" and "//". Im not getting any errors in the code but I'm unable to print the info to screen for some reason. Could someone point out what I'm doing wrong?
import java.io.IOException;
import java.io.PrintStream;
import java.io.FileOutputStream;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.regex.*;
class main {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=293762&db=nuccore&dopt=genbank&extrafeat=976&fmt_mask=0&retmode=html&withmarkup=on&log$=seqview&maxplex=3&maxdownloadsize=1000000").get();
String text = doc.text();
String pattern1 = "ORIGIN";
String pattern2 = "//";
String regexString = Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2);
Pattern pattern = Pattern.compile(regexString, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String textInBetween = matcher.group(1);
}
Pattern p = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2));
Matcher m = p.matcher(text);
while (m.find()) {
System.out.println(m.group(1));
}
}
}
You need to use the DOTALL flag to match any possible newline characters
Pattern pattern = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" +
Pattern.quote(pattern2), Pattern.DOTALL);
You have to compile the patterns with DOTALL modifier:
Pattern pattern = Pattern.compile(regexString, Pattern.MULTILINE | Pattern.DOTALL);
Pattern p = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2), Pattern.DOTALL);
This modifier allows the period . to match every character including new lines. Without them, dot matches every character except for new lines.
Related
I want to get TY_111.22-L007-C010,Tzo11-L010-C100 and Tff-L010-C110 from this string with regex
"12.5*MAX(\"TY_111.22-L007-C010\";\"Tzo11-L010-C100\";\"Tff-L010-C110\")
I tested this T.*-L\d*-C\d* but it don't give the result I want :
My code java for test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "T.*-L\\d*-C\\d*";
final String string = "\"12.5*MAX(\\\"TY_111.22-L007-C010\\\";\\\"Tzo11-L010-C100\\\";\\\"Tff-L010-C110\\\"";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
You need to use this regex T.*?\-L\d*?\-C\d*
final String regex = "T.*?\\-L\\d*?\\-C\\d*";
Note: you need to escape the hyphens \- and use non-greedy quantifier .*? instead of .*, also you can use only matcher.group() instead of matcher.group(0), in your regex you don't have any groups, so the 0 is useless.
Outputs
Full match: TC_24.00-L010-C090
Full match: TC_24.00-L010-C100
Full match: TC_24.00-L010-C110
Why use a verbose regex pattern matcher when you can handle the problem with one line of code:
String input = "12.5*MAX(\"Txxxx-L007-C010\";\"Txxxx-L010-C100\";\"Txxxx-L010-C110\")";
String[] matches = input.replaceAll("^.*?\"|\"[^\"]*$", "")
.split("\";\"");
System.out.println(Arrays.toString(matches));
This prints:
[Txxxx-L007-C010, Txxxx-L010-C100, Txxxx-L010-C110]
OK...I used three lines of code, but the first and third are just for setting up the data and printing it.
I have this string:
values="[72, 216, 930],[250],[72],[228, 1539],[12]";
am trying to combine two patterns in order to get the last number in first [] type and the number in the second [] type.
pattern="\\, ([0-9]+)\\]|\\[([0-9]+)\\]"
But it outputs null:
930, null, null, 1539, null
How do I solve this problem?
Here, we might not want to bound it from the left, and simply use the ] from right, then we swipe to left and collect our digits, maybe similar to this expression:
([0-9]+)\]
Graph
This graph shows how it would work:
If you like, we can also bound it from the left, similar to this expression:
([\[\s,])([0-9]+)(\])
Graph
This graph shows how the second one would work:
Try this.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = ", ([0-9]+)]";
final String string = "[72, 216, 930],[250],[72],[228, 1539],[12]";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
Output:
Full match: , 930]
Group 1: 930
Full match: , 1539]
Group 1: 1539
package Sample;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class StackOverFlow{
final static String regex = "\\d*]";
final static String string = "[72, 216, 930],[250],[72],[228, 1539],[12]";
final static Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final static Matcher matcher = pattern.matcher(string);
public static void main(String[] args) {
while (matcher.find()) {
String val = matcher.group(0).replace("]", "");
System.out.println(val);
}
}
}
output
930
250
72
1539
12
To make sure that the data is actually in between square brackets, you could use a capturing group, start the match with [ and end the match with ]
\[(?:\d+,\h+)*(\d+)]
In Java
\\[(?:\\d+,\\h+)*(\\d+)]
\[ Match [
(?:\d+,\h+)* Repeat 0+ times matching 1+ digit, comma and 1+ horizontal whitespace chars
(\d+) Capture in group 1 matching 1+ digit
] Match closing square bracket
Regex demo | Java demo
For example:
String regex = "\\[(?:\\d+,\\h+)*(\\d+)]";
String string = "[72, 216, 930],[250],[72],[228, 1539],[12]";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Result:
930
250
72
1539
12
It seems like for a structure such as this, it's likely beneficial to parse the whole thing into memory, then index into the elements you're particularly interested in to your heart's content. Should the structure change unexpectedly/dynamically, you won't need to rewrite your regex, just index as needed as many times as you wish:
import java.util.*;
class Main {
public static void main(String[] args) {
String values = "[72, 216, 930],[250],[72],[228, 1539],[12]";
String[] data = values.substring(1, values.length() - 1).split("\\]\\s*,\\s*\\[");
ArrayList<String[]> result = new ArrayList<>();
for (String d : data) {
result.add(d.split("\\s*,\\s*"));
}
System.out.println(result.get(0)[result.get(0).length-1]); // => 930
System.out.println(result.get(1)[0]); // => 250
}
}
I am trying to extract a url from the string. But I am unable to skip the double quotes in the output.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Main {
public static void main(String[] args) {
String s1 = "<a id=\"BUTTON_LINK\" style=\"%%BUTTON_LINK%%\" target=\"_blank\" href=\"https://||domainName||/basketReviewPageLoadAction.do\">%%CHECKOUT%%</a>";
//System.out.println(s1);
Pattern pattern = Pattern.compile("\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))");
Matcher matcher = pattern.matcher(s1);
if(matcher.find()){
String url = matcher.group(1);
System.out.println(url);
}
}
}
My Output is:
"https://||domainName||/basketReviewPageLoadAction.do"
Expected Output is:
https://||domainName||/basketReviewPageLoadAction.do
I cannot do string replace. I have add few get param in this output and attach back it to original string.
Regex: (?<=href=")([^\"]*) Substitution: $1?params...
Details:
(?<=) Positive Lookbehind
() Capturing group
[^] Match a single character not present in the list
* Matches between zero and unlimited times
$1 Group 1.
Java code:
By using function replaceAll you can add your params ?abc=12 to the end of the capturing group $1 in this case href.
String text = "<a id=\"BUTTON_LINK\" style=\"%%BUTTON_LINK%%\" target=\"_blank\" href=\"https://||domainName||/basketReviewPageLoadAction.do\">%%CHECKOUT%%</a>";
text = text.replaceAll("(?<=href=\")([^\"]*)", String.format("$1%s", "?abc=12"));
System.out.print(text);
Output:
<a id="BUTTON_LINK" style="%%BUTTON_LINK%%" target="_blank" href="https://||domainName||/basketReviewPageLoadAction.do?abc=12">%%CHECKOUT%%</a>
Code demo
You can try one of these options:
System.out.println(url.replaceAll("^\"|\"$", ""));
System.out.println(url.substring(1, url.length()-1));
ugly, seems works.Hope this help.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.Stream;
class Main {
public static void main(String[] args) {
String s1 = "<a id=\"BUTTON_LINK\" style=\"%%BUTTON_LINK%%\" target=\"_blank\" href= \"https://||domainName||/basketReviewPageLoadAction.do\">%%CHECKOUT%%</a>";
//System.out.println(s1);
Pattern pattern = Pattern.compile("\\s*(?i)href\\s*=\\s*(\"([^\"]*)\"|'([^']*)'|([^'\">\\s]+))");
Matcher matcher = pattern.matcher(s1);
if (matcher.find()) {
String url = Stream.of(matcher.group(2), matcher.group(3),
matcher.group(4)).filter(s -> s != null).collect(Collectors.joining());
System.out.print(url);
}
}
}
This solution worked for now.
Pattern pattern = Pattern.compile("\\s*(?i)href\\s*=\\s*\"([^\"]*)");
You will try this out,
s1 = s1.Replace("\"", "");
I'm trying to extract a string between '/' and '.' of a URL. For example, I have a URL like "some.com/part1/part2/part3/stringINeed.xyz". I need to extract "stringINeed" from the above URL, the one between last '/' and the '.' nothing else.
So far, I tried the following and it gives an empty output:
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Extract
{
public static void main (String[] args) throws java.lang.Exception
{
String str = "part1/part2/part3/stringINeed.xyz" ;
Pattern pattern = Pattern.compile("/(.*?).");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
What is wrong with my code. Can anyone help?
Use this regex:
[^/.]+(?=\.[^.]+$)
See demo.
In Java:
Pattern regex = Pattern.compile("[^/.]+(?=\\.[^.]+$)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
Explanation
[^/.]+ matches any chars that are not a slash or a dot
The lookahead (?=\.[^.]+) asserts that what follows is a dot followed by non-dots and the end of the string
Without regex
str.substring(str.lastIndexOf("/"), str.lastIndexOf(".")).replaceAll("/", "");
I am getting the compile time error.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class gfile
{
public static void main(String args[]) {
// create a Pattern
Pattern p = Pattern.compile("<div class="dinner">(.*?)</div>");//some prob with this line
// create a Matcher and use the Matcher.group() method
String can="<tr>"+
"<td class="summaryinfo">"+
"<div class="dinner">1,000</div>" +
"<div style="margin-top:5px " +
"font-weight:bold">times</div>"+
"</td>"+
"</tr>";
Matcher matcher = p.matcher(can);
// extract the group
if(matcher.find())
{
System.out.println(matcher.group());
}
else
System.out.println("could not find");
}
}
You have unescaped quotes inside your call to Pattern.compile.
Change:
Pattern p = Pattern.compile("<div class="dinner">(.*?)</div>");
To:
Pattern p = Pattern.compile("<div class=\"dinner\">(.*?)</div>");
Note: I just saw the same problem in your String can.
Change it to:
String can="<tr>"+
"<td class=\"summaryinfo\">"+
"<div class=\"dinner\">1,000</div>" +
"<div style=\"margin-top:5px " +
"font-weight:bold\">times</div>"+
"</td>"+
"</tr>";
I don't know if this fixes it, but it will at least compile now.
But, your Regex is matching (.*?) "Any character, any number of repetitions, as few as possible"
Meaning, it matches nothing...and everything.
...or the fact that your quotes aren't escaped.
You should use an HTML parser to parse and process HTML - not a regular expression.
As already pointed out, you'll need to escape the double quotes inside all of your strings.
And, if you want to have "1,000" as result, you'll need to use group(1), else you'll get the complete match of the pattern.
Resulting code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class gfile
{
public static void main(String args[]) {
// create a Pattern
Pattern p = Pattern.compile("<div class=\"dinner\">(.*?)</div>");
// create a Matcher and use the Matcher.group() method
String can="<tr>"+
"<td class=\"summaryinfo\">"+
"<div class=\"dinner\">1,000</div>" +
"<div style=\"margin-top:5px " +
"font-weight:bold\">times</div>"+
"</td>"+
"</tr>";
Matcher matcher = p.matcher(can);
if(matcher.find())
{
System.out.println(matcher.group(1));
}
else
System.out.println("could not find");
}
}
(.*?) might need to be (.*)?