Getting some data from HTML using regex - java

I was trying to get some data from html. This is my code:
public static void main(String[] args) {
final String str = "<div class=\"b-vacancy-list-salary\">\n" +
" from 50 000\n" +
" to 70 000\n" +
" USD.\n" +
" </div>";
System.out.println(Arrays.toString(getTagValues(str).toArray()));
}
static final String tag = "<div class=\"b-vacancy-list-salary\">\n";
private static final Pattern TAG_REGEX = Pattern.compile(tag+"(.+?)</div>");
private static List<String> getTagValues(final String str) {
System.out.println(tag);
final List<String> tagValues = new ArrayList<String>();
final Matcher matcher = TAG_REGEX.matcher(str);
while (matcher.find()) {
tagValues.add(matcher.group(1));
}
return tagValues;
}
It returns [], but not value. What's wrong?

You can remove line feed.
The better way to parse HTML is to use DOM parser or Xpath.
E.g :
public static void main(String[] args) {
final String str = "<div class=\"b-vacancy-list-salary\">\n"
+ " from 50 000\n"
+ " to 70 000\n"
+ " USD.\n"
+ " </div>";
System.out.println(Arrays.toString(getTagValues(str).toArray()));
}
static final String tag = "<div class=\"b-vacancy-list-salary\">";
private static final Pattern TAG_REGEX = Pattern.compile(tag + "(.+?)</div>");
private static List<String> getTagValues(final String str) {
System.out.println(tag);
final List<String> tagValues = new ArrayList<String>();
final Matcher matcher = TAG_REGEX.matcher(str.replace("\n", ""));
while (matcher.find()) {
tagValues.add(matcher.group(1).trim());
}
return tagValues;
}

Instead of
private static final Pattern TAG_REGEX = Pattern.compile(tag+"(.+?)</div>");
use
private static final Pattern TAG_REGEX = Pattern.compile(tag+"([\\s|\\S]+?)</div>");

Try adding Pattern.DOTALL as the second parameter of Pattern.compile. This enables the dot in the pattern to match newlines. Not sure if this quite gives you what you want, but it may help you get started.
private static final Pattern TAG_REGEX = Pattern.compile(tag + "(.+?)</div>",
Pattern.DOTALL);
Javadoc on DOTALL is here

.* is not include the new line. try this:
Pattern.compile(tag + "((.|\n)*)</div>");

You need to make the "." match newline characters, you can do this by putting "(?s)" at the front of your regular expression; so in your case, do Pattern.compile("(?s)" + tag + "(.+?)");

Related

Converting all float values in String from scientific notation to decimal notation

so i have a xml string that looks like this:
<CONFIG><Setting1><o1>44</o1><o2>1.0E-4</o2><o3>955</o3><o4>1.5E-4</o4><o5>Surname</o5></setting1>....</CONFIG>
How would i go about converting every float in a string from scientific-notion to the decimal-notation?
Edit: To clarify, im not looking to convert only a single float value from scientific to decimal nation. The String is read from a xml file that i serialized from a pojo, so all of the float values in the String would need to be converted. Sadly the XML-Framework i used (SimpleXML) only represents floats in scientific notation.
UPDATE:
Tried finding the float values with RegEx, it works. "found" will be the new converted decimal. How would i go about replacing each of the the found pattern with the "found"-String?
public static void ScientificToDecimal(String text){
String found;
Pattern pattern = Pattern.compile("\\d+[.]\\d+E[+-]\\d");
Matcher matcher = pattern.matcher(text);
while(matcher.find()){
found = new BigDecimal(matcher.group()).toPlainString();
Log.i("Converted: ", matcher.group() + " to " + found);
}
}
UPDATE2: Works good enough for me.
public static String scientificToDecimal(String text){
String replacementText = "";
StringBuffer sb = new StringBuffer();
Pattern pattern = Pattern.compile("\\d+[.]\\d+E[+-]\\d");
Matcher matcher = pattern.matcher(text);
while(matcher.find()){
replacementText = new BigDecimal(matcher.group()).toPlainString();
matcher.appendReplacement(sb,replacementText);
Log.i("Converted: ", matcher.group() + " to " + replacementText);
}
matcher.appendTail(sb);
return sb.toString();
}
Think about those pattern >1.0E-4< or >1.5E-4< and RegEx and String replacement and so on.
Use XMLPullParser (consult the guide) to get the double values, then convert using the technique described here, or here, potentially use your regex.
Just to enhance to handle the following scenarios
a) 1.0E-4
b) 1.0E4
c) 1.0E+4
public static String scientificToDecimal(String text){
String out = "";
boolean found = false;
String replacementText = "";
StringBuffer sb = new StringBuffer();
/*
* 5.0E4
*/
Pattern pattern = Pattern.compile("\\d+[.]\\d+E\\d");
Matcher matcher = pattern.matcher(text);
while(matcher.find()){
replacementText = new BigDecimal(matcher.group()).toPlainString();
matcher.appendReplacement(sb,replacementText);
found = true;
// System.out.println("Converted: " + matcher.group() + " to " + replacementText);
}
if ( found )
{
matcher.appendTail(sb);
out = sb.toString();
return out;
}
/*
* 5.0E-4
*/
pattern = Pattern.compile("\\d+[.]\\d+E[-+]\\d");
matcher = pattern.matcher(text);
while(matcher.find()){
replacementText = new BigDecimal(matcher.group()).toPlainString();
matcher.appendReplacement(sb,replacementText);
// System.out.println("Converted: " + matcher.group() + " to " + replacementText);
}
matcher.appendTail(sb);
out = sb.toString();
return out;
}

Extracting a value from a file name base on regex in Java

Suppose my file name pattern is something like this %#_Report_%$_for_%&.xls and %# and %$ regex can have any character but %& is a date.
Now how can i get the actual values of those regex on filename in java.
For example if actual filename is Genr_Report_123_for_20151105.xls how to get
%# value is Genr
%$ value is 123
%& value is 20151105
You can do it like this:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Rgx {
private String str1 = "", str2 = "", date = "";
public static void main(String[] args) {
String fileName = "Genr_Report_123_for_20151105.xls";
Rgx rgx = new Rgx();
rgx.extractValues(fileName);
System.out.println(rgx.str1 + " " + rgx.str2 + " " + rgx.date);
}
private void extractValues(String fileName) {
Pattern pat = Pattern.compile("([^_]+)_Report_([^_]+)_for_([\\d]+)\\.xls");
Matcher m = pat.matcher(fileName);
if (m.find()) {
str1 = m.group(1);
str2 = m.group(2);
date = m.group(3);
}
}
}

Split mathematical string in Java

I have this string: "23+43*435/675-23". How can I split it? The last result which I want is:
String 1st=23
String 2nd=435
String 3rd=675
String 4th=23
I already used this method:
String s = "hello+pLus-minuss*multi/divide";
String[] split = s.split("\\+");
String[] split1 = s.split("\\-");
String[] split2 = s.split("\\*");
String[] split3 = s.split("\\/");
String plus = split[1];
String minus = split1[1];
String multi = split2[1];
String div = split3[1];
System.out.println(plus+"\n"+minus+"\n"+multi+"\n"+div+"\n");
But it gives me this result:
pLus-minuss*multi/divide
minuss*multi/divide
multi/divide
divide
But I require result in this form
pLus
minuss
multi
divide
Try this:
public static void main(String[] args) {
String s ="23+43*435/675-23";
String[] ss = s.split("[-+*/]");
for(String str: ss)
System.out.println(str);
}
Output:
23
43
435
675
23
I dont know why you want to store in variables and then print . Anyway try below code:
public static void main(String[] args) {
String s = "hello+pLus-minuss*multi/divide";
String[] ss = s.split("[-+*/]");
String first =ss[1];
String second =ss[2];
String third =ss[3];
String forth =ss[4];
System.out.println(first+"\n"+second+"\n"+third+"\n"+forth+"\n");
}
Output:
pLus
minuss
multi
divide
Try this out :
String data = "23+43*435/675-23";
Pattern pattern = Pattern.compile("[^\\+\\*\\/\\-]+");
Matcher matcher = pattern.matcher(data);
List<String> list = new ArrayList<String>();
while (matcher.find()) {
list.add(matcher.group());
}
for (int index = 0; index < list.size(); index++) {
System.out.println(index + " : " + list.get(index));
}
Output :
0 : 23
1 : 43
2 : 435
3 : 675
4 : 23
I think it is only the issue of index. You should have used index 0 to get the split result.
String[] split = s.split("\\+");
String[] split1 = split .split("\\-");
String[] split2 = split1 .split("\\*");
String[] split3 = split2 .split("\\/");
String hello= split[0];//split[0]=hello,split[1]=pLus-minuss*multi/divide
String plus= split1[0];//split1[0]=plus,split1[1]=minuss*multi/divide
String minus= split2[0];//split2[0]=minuss,split2[1]=multi/divide
String multi= split3[0];//split3[0]=multi,split3[1]=divide
String div= split3[1];
If the order of operators matters, change your code to this:
String s = "hello+pLus-minuss*multi/divide";
String[] split = s.split("\\+");
String[] split1 = split[1].split("\\-");
String[] split2 = split1[1].split("\\*");
String[] split3 = split2[1].split("\\/");
String plus = split1[0];
String minus = split2[0];
String multi = split3[0];
String div = split3[1];
System.out.println(plus + "\n" + minus + "\n" + multi + "\n" + div + "\n");
Otherwise, to spit on any operator, and store to variable do this:
public static void main(String[] args) {
String s = "hello+pLus-minuss*multi/divide";
String[] ss = s.split("[-+*/]");
String plus = ss[1];
String minus = ss[2];
String multi = ss[3];
String div = ss[4];
System.out.println(plus + "\n" + minus + "\n" + multi + "\n" + div + "\n");
}

Scanner - parsing code values using delimiter regex

I'm trying to use a Scanner to read in lines of code from a string of the form "p.addPoint(x,y);"
The regex format I'm after is:
*anything*.addPoint(*spaces or nothing* OR ,*spaces or nothing*
What I've tried so far isn't working: [[.]+\\.addPoint(&&[\\s]*[,[\\s]*]]
Any ideas what I'm doing wrong?
I tested this in Python, but the regexp should be transferable to Java:
>>> regex = '(\w+\.addPoint\(\s*|\s*,\s*|\s*\)\s*)'
>>> re.split(regex, 'poly.addPoint(3, 7)')
['', 'poly.addPoint(', '3', ', ', '7', ')', '']
Your regexp seems seriously malformed. Even if it wasn't, matching infinitely many repetitions of the . wildcard character at the beginning of the string would probably result in huge swaths of text matching that aren't actually relevant/desired.
Edit: Misunderstood the original spec., current regexp should be correct.
Another way:
public class MyPattern {
private static final Pattern ADD_POINT;
static {
String varName = "[\\p{Alnum}_]++";
String argVal = "([\\p{Alnum}_\\p{Space}]++)";
String regex = "(" + varName + ")\\.addPoint\\(" +
argVal + "," +
argVal + "\\);";
ADD_POINT = Pattern.compile(regex);
System.out.println("The Pattern is: " + ADD_POINT.pattern());
}
public void findIt(String filename) throws FileNotFoundException {
Scanner s = new Scanner(new FileReader(filename));
while (s.findWithinHorizon(ADD_POINT, 0) != null) {
final MatchResult m = s.match();
System.out.println(m.group(0));
System.out.println(" arg1=" + m.group(2).trim());
System.out.println(" arg2=" + m.group(3).trim());
}
}
public static void main(String[] args) throws FileNotFoundException {
MyPattern p = new MyPattern();
final String fname = "addPoint.txt";
p.findIt(fname);
}
}

JAVA regex failing

I have string which is of format:
;1=2011-10-23T16:16:53+0530;2=2011-10-23T16:16:53+0530;3=2011-10-23T16:16:53+0530;4=2011-10-23T16:16:53+0530;
I have written following code to find string 2011-10-23T16:16:53+0530 from (;1=2011-10-23T16:16:53+0530;)
Pattern pattern = Pattern.compile("(;1+)=(\\w+);");
String strFound= "";
Matcher matcher = pattern.matcher(strindData);
while (matcher.find()) {
strFound= matcher.group(2);
}
But it is not working as expected. Can you please give me any hint?
Can you please give me any hint?
Yes. Neither -, nor :, nor + are part of \w.
Do you have to use a regex? Why not call String.split() to break up the string on semi-colon boundaries. Then call it again to break up the chunks by the equals sign. At that point you'll have an integer and the date in string form. From there you can parse the date string.
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
public final class DateScan {
private static final String INPUT = ";1=2011-10-23T16:16:53+0530;2=2011-10-23T16:16:53+0530;3=2011-10-23T16:16:53+0530;4=2011-10-23T16:16:53+0530;";
public static void main(final String... args) {
final SimpleDateFormat parser = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ");
final String[] pairs = INPUT.split(";");
for (final String pair : pairs) {
if ("".equals(pair)) {
continue;
}
final String[] integerAndDate = pair.split("=");
final Integer integer = Integer.parseInt(integerAndDate[0]);
final String dateString = integerAndDate[1];
try {
final Date date = parser.parse(dateString);
System.out.println(integer + " -> " + date);
} catch (final ParseException pe) {
System.err.println("bad date: " + dateString + ": " + pe);
}
}
}
}
I've change the input a bit, but just for presentation reasons that is
You can try this:
String input = " ;1=2011-10-23T16:16:53+0530; 2=2011-10-23T16:17:53+0530;3=2011-10-23T16:18:53+0530;4=2011-10-23T16:19:53+0530;";
Pattern p = Pattern.compile("(;\\d+?)?=(.+?);");
Matcher m = p.matcher(input);
while(m.find()){
System.out.println(m.group(2));
}

Categories

Resources