Java - Regex for the given string - java

I have the following html code segment:
<br>
Date: 2010-06-20, 1:37AM PDT<br>
<br>
Daddy: www.google.com
<br>
I want to extract
Date: 2010-06-20, 1:37AM PDT
and
Daddy: www.google.com
with the help of java regex.
So what regex I should use?

This should give you a nice starting point:
String text =
" <br>\n" +
" Date: 2010-06-20, 1:37AM PDT<br> \n" +
" <br> \n" +
"Daddy: www.google.com \n" +
"<br>";
String[] parts = text.split("(?:\\s*<br>\\s*)+");
for (String part : parts) {
System.out.println("[" + part + "]");
}
This prints (as seen on ideone.com):
[]
[Date: 2010-06-20, 1:37AM PDT]
[Daddy: www.google.com]
This uses String[] String.split(String regex). The regex pattern is "one or more of <br>, with preceding or trailing whitespaces.
Guava alternative
You can also use Splitter from Guava. It's actually a lot more readable, and can omitEmptyStrings().
Splitter splitter = Splitter.on("<br>").trimResults().omitEmptyStrings();
for (String part : splitter.split(text)) {
System.out.println("[" + part + "]");
}
This prints:
[Date: 2010-06-20, 1:37AM PDT]
[Daddy: www.google.com]
Related questions
String split array

Related

Parse html content for a value

I receive a Http response after a call as Html String and I would like to scrape certain value stored inside the ReportViewer1 variable.
<html>
....................
...........
<script type="text/javascript">
var ReportViewer1 = new ReportViewer('ReportViewer1', 'ReportViewer1_ReportToolbar', 'ReportViewer1_ReportArea_WaitControl', 'ReportViewer1_ReportArea_ReportCell', 'ReportViewer1_ReportArea_PreviewFrame', 'ReportViewer1_ParametersAreaCell', 'ReportViewer1_ReportArea_ErrorControl', 'ReportViewer1_ReportArea_ErrorLabel', 'ReportViewer1_CP', '/app/Telerik.ReportViewer.axd', 'a90a0d41efa6429eadfefa42fc529de1', 'Percent', '100', '', 'ReportViewer1_EditorPlaceholder', 'ReportViewer1_CalendarFrame', 'ReportViewer1_ReportArea_DocumentMapCell', {
CurrentPageToolTip: 'STR_TELERIK_MSG_CUR_PAGE_TOOL_TIP',
ExportButtonText: 'Export',
ExportToolTip: 'Export',
ExportSelectFormatText: 'Export to the selected format',
FirstPageToolTip: 'First page',
LabelOf: 'of',
LastPageToolTip: 'Last Page',
ProcessingReportMessage: 'Generating report...',
NoPageToDisplay: 'No page to display.',
NextPageToolTip: 'Next page',
ParametersToolTip: 'Click to close parameters area|Click to open parameters area',
DocumentMapToolTip: 'Hide document map|Show document map',
PreviousPageToolTip: 'Previous page',
TogglePageLayoutToolTip: 'Switch to interactive view|Switch to print preview',
SessionHasExpiredError: 'Session has expired.',
SessionHasExpiredMessage: 'Please, refresh the page.',
PrintToolTip: 'Print',
RefreshToolTip: 'Refresh',
NavigateBackToolTip: 'Navigate back',
NavigateForwardToolTip: 'Navigate forward',
ReportParametersSelectAllText: '<select all>',
ReportParametersSelectAValueText: '<select a value>',
ReportParametersInvalidValueText: 'Invalid value.',
ReportParametersNoValueText: 'Value required.',
ReportParametersNullText: 'NULL',
ReportParametersPreviewButtonText: 'Preview',
ReportParametersFalseValueLabel: 'False',
ReportParametersInputDataError: 'Missing or invalid parameter value. Please input valid data for all parameters.',
ReportParametersTrueValueLabel: 'True',
MissingReportSource: 'The source of the report definition has not been specified.',
ZoomToPageWidth: 'Page Width',
ZoomToWholePage: 'Full Page'
}, 'ReportViewer1_ReportArea_ReportArea', 'ReportViewer1_ReportArea_SplitterCell', 'ReportViewer1_ReportArea_DocumentMapCell', true, true, 'PDF', 'ReportViewer1_RSID', true);
</script>
...................
...................
</html>
The value is a90a0d41efa6429eadfefa42fc529de1 and this is in the middle of this content:
'/app/Telerik.ReportViewer.axd', 'a90a0d41efa6429eadfefa42fc529de1', 'Percent', '100',
Whats the best way I can parse this value using Java?
Parse the HTML with String class
public class HtmlParser {
public static void main(String args[]){
String result = getValuesProp(html);
System.out.println("Result: "+ result);
}
static String PIVOT = "Telerik.ReportViewer.axd";
public static String getValuesProp(String json) {
String subString;
int i = json.indexOf(PIVOT);
i+= PIVOT.length();
//', chars
i+=2;
subString = json.substring(i);
i = subString.indexOf("'");
i++;
subString = subString.substring(i);
i = subString.indexOf("'");
subString = subString.substring(0,i);
return subString;
}
static String html ="<html>\n" +
"\n" +
"<script type=\"text/javascript\">\n" +
" var ReportViewer1 = new ReportViewer('ReportViewer1', 'ReportViewer1_ReportToolbar', 'ReportViewer1_ReportArea_WaitControl', 'ReportViewer1_ReportArea_ReportCell', 'ReportViewer1_ReportArea_PreviewFrame', 'ReportViewer1_ParametersAreaCell', 'ReportViewer1_ReportArea_ErrorControl', 'ReportViewer1_ReportArea_ErrorLabel', 'ReportViewer1_CP', '/app/Telerik.ReportViewer.axd', 'a90a0d41efa6429eadfefa42fc529de1', 'Percent', '100', '', 'ReportViewer1_EditorPlaceholder', 'ReportViewer1_CalendarFrame', 'ReportViewer1_ReportArea_DocumentMapCell', {\n" +
" CurrentPageToolTip: 'STR_TELERIK_MSG_CUR_PAGE_TOOL_TIP',\n" +
" ExportButtonText: 'Export',\n" +
" ExportToolTip: 'Export',\n" +
" ExportSelectFormatText: 'Export to the selected format',\n" +
" FirstPageToolTip: 'First page',\n" +
" LabelOf: 'of',\n" +
" LastPageToolTip: 'Last Page',\n" +
" ProcessingReportMessage: 'Generating report...',\n" +
" NoPageToDisplay: 'No page to display.',\n" +
" NextPageToolTip: 'Next page',\n" +
" ParametersToolTip: 'Click to close parameters area|Click to open parameters area',\n" +
" DocumentMapToolTip: 'Hide document map|Show document map',\n" +
" PreviousPageToolTip: 'Previous page',\n" +
" TogglePageLayoutToolTip: 'Switch to interactive view|Switch to print preview',\n" +
" SessionHasExpiredError: 'Session has expired.',\n" +
" SessionHasExpiredMessage: 'Please, refresh the page.',\n" +
" PrintToolTip: 'Print',\n" +
" RefreshToolTip: 'Refresh',\n" +
" NavigateBackToolTip: 'Navigate back',\n" +
" NavigateForwardToolTip: 'Navigate forward',\n" +
" ReportParametersSelectAllText: '<select all>',\n" +
" ReportParametersSelectAValueText: '<select a value>',\n" +
" ReportParametersInvalidValueText: 'Invalid value.',\n" +
" ReportParametersNoValueText: 'Value required.',\n" +
" ReportParametersNullText: 'NULL',\n" +
" ReportParametersPreviewButtonText: 'Preview',\n" +
" ReportParametersFalseValueLabel: 'False',\n" +
" ReportParametersInputDataError: 'Missing or invalid parameter value. Please input valid data for all parameters.',\n" +
" ReportParametersTrueValueLabel: 'True',\n" +
" MissingReportSource: 'The source of the report definition has not been specified.',\n" +
" ZoomToPageWidth: 'Page Width',\n" +
" ZoomToWholePage: 'Full Page'\n" +
" }, 'ReportViewer1_ReportArea_ReportArea', 'ReportViewer1_ReportArea_SplitterCell', 'ReportViewer1_ReportArea_DocumentMapCell', true, true, 'PDF', 'ReportViewer1_RSID', true);\n" +
" </script>\n" +
"\n" +
"</html>";
}
I would read the text a line at a time like how most files are read. Because the format will always be the same, you look for a line that begins with the characters "var ReportViewer1." Then you know you have found the line you want. You may need to strip some white space, although it will always be formatted with the same whitespace too (up to you really.)
When you have the line, use the String .split() method to split that line into an array. There are nice delimiters there to split on ... "," or " " or ", " ... again, see what works best for you.
Test the split up line parts for '/app/Telerik.ReportViewer.axd' ... the next member of your split array will be the value you are looking for.
Again, the formatting will always be the same, so you can rely on that to find your variable. Of course, study the html text to make sure it does always follow the same format within the line you are investigating, but looking at it, I assume it probably does.
Again, find your line ... split it on a delimiter ... and use some logic to find the element you are after in the split up line parts.

Read specific line out of String

I have a String that looks like this:
String meta = "1 \n"
+ "Herst \n"
+ "01 Jan 2019 – 31 Dec 2020 \n"
+ "01 Jan 2020 \n"
+ "CONFIG \n"
+ "XML \n"
+ "AES \n"
+ "RSA \n"
+ "256 \n"
+ "16 \n"
+ "128 \n";
What is the smartest way if I want to read a specific line out of this String in Java?
For example, I need in another part of my code the number of the second last line (in this case it's 16). How can I read this number out of the String?
If it's already in String form, just split it into lines using \n as a delimiter to get an array of lines:
String[] lines = meta.split("\n");
Then you can easily get a specific line. For instance, System.out.println(lines[9]) will print 16.
If you need the 16 in the form of an int, you'd need to remove the whitespaces around it, and parse it:
int parsed = Integer.parseInt(lines[9].trim());

Regex match till the end of text in Java

I want to fetch all the email addresses of From field using regex like get all lines of text that starts with "From:" and end with "/n" new line.
Here is the complete text on which I want to apply this regex,
Sent: Tue Mar 05 15:42:11 IST 2019
From: xtest#xyz.co.in
To: akm#xyz.com
Subject: Re: Foausrnisfseur invadlide (030000000000:3143)
Message:
----------------------------
Sent: Tue Mar 05 15:40:51 IST 2019
From: ytest#xyz.com
To: bpcla#xpanxion.com
Subject: Foausrnisfseur invadlide (O4562000888885456:3143)
Message:
This is not right please correct
Termes de paiement Foausrnisfseur non spécifiés
impact potentiel: 3 000,00
You should write From field with abc#xyz.com
and not From: field with abc#xyz.com in the column
Date détecté: 2019-02-26 12:55:03
---- Please do not delete or modify this line. (2423000000000149:3143) ----
-------------------------
Sent: Tue Mar 05 15:40:51 IST 2019
From: ytest#xyz.co.in
To: bpcla#xpanxion.com
Subject: Foausrnisfseur invadlide (O4562000888885456:3143)
I have tried following patterns but it did not work,
[^.?!]*(?<=[.?\s!])string(?:(?=[\s.?!])[^.?!]*(?:[.?!].*)?)?$
/^([\w\s\.]*)string([\w\s\.]*)$/
"^\\w*\\s*((?m)Name.*$)"
The desired result expected from above text is :
xtest#xyz.co.in,
ytest#xyz.com,
ytest#xyz.co.in,
PS. I want regex for Java logic
Try this pattern: ^From:\s*(\S+)$
It first matches beginning of a line with ^, then matches From: literally, then matches 0 or more whitespaces with \s*, then matches one or more non-whitespeaces and stores it in capturing group, $ matches end of a line.
To get e-mail address, just use value of first capturing group.
Demo
String test = " Sent: Tue Mar 05 15:42:11 IST 2019 "
+ " From: xtest#xyz.co.in "
+ " To: akm#xyz.com "
+ " Subject: Re: Foausrnisfseur invadlide (030000000000:3143) "
+ " Message: "
+ " "
+ " "
+ " ---------------------------- "
+ " "
+ " Sent: Tue Mar 05 15:40:51 IST 2019 "
+ " From: ytest#xyz.com "
+ " To: bpcla#xpanxion.com "
+ " Subject: Foausrnisfseur invadlide (O4562000888885456:3143) "
+ " Message: "
+ " This is not right please correct "
+ " Termes de paiement Foausrnisfseur non spécifiés "
+ " impact potentiel: 3 000,00 "
+ " You should write From field with abc#xyz.com "
+ " and not From: field with abc#xyz.com in the column "
+ " Date détecté: 2019-02-26 12:55:03 "
+ " "
+ " "
+ " ---- Please do not delete or modify this line. (2423000000000149:3143) ---- "
+ " " + " ------------------------- "
+ " Sent: Tue Mar 05 15:40:51 IST 2019 " + " From: ytest#xyz.co.in "
+ " To: bpcla#xpanxion.com "
+ " Subject: Foausrnisfseur invadlide (O4562000888885456:3143) ";
String emailRegex = "[a-zA-Z0-9._%+-]+#[A-Za-z0-9.-]+\\.[a-zA-Z]{2,6}";
Pattern pattern = Pattern.compile("From\\:\\s(" + emailRegex + ")");// From\\:\\s same as Form : and () here i added Email Id regex or you also change to (.*\n) but not recommended
Matcher match = pattern.matcher(test);
while (match.find()) {
System.out.println(match.group(1));
}
output :
xtest#xyz.co.in
ytest#xyz.com
ytest#xyz.co.in
Use this regular expression for your case:
From:\s+([\w-]+#([\w-]+\.)+[\w-]+)
I have tried this regular expression with https://www.freeformatter.com/java-regex-tester.html#ad-output and it is matching what you require.
Your required match is in capture Group 1.
Working Demo: https://regex101.com/r/dGaPbD/4
String emailRegex = "[^\\s]+"; // Replace with a better one
Matcher m = Pattern.compile("(?m)^From:\\s*(" + emailRegex + ")\\s*$").matcher(yourString);
List<String> allMatches = new ArrayList<String>();
while(m.find())
System.out.println(m.group(1));

What Regular Expression Will Get Last Price Listed On Receipt?

I have the following expression:
(?!\d+\s+TOTAL\s+)\$+\d+\.?\d+\s+
It produces the result "$23.00$0.03$23.80" from the following text:
SPEEDWAY 3007906
Wallace NC 28466
TRAM: 1086244
9/17/2017 2:12 pm
Pump 08
Regular Unleaded
8,716 # $2,639/6131
GAS TOTAL $23.00
TAX $0.03
TOTAL $23.80
Uisa
What regular expression will pull just $23.80 in this case? If I add positive lookahead, so that the expression is "(?!\d+\s+TOTAL\s+)\$+\d+\.?\d+\s+(?=.*\$\d+\.?\d+)", the result is "$23.00$0.03" and not "$23.80".
Please help. Thanks in advance.
Try this:
(?<=^TOTAL)\s*(\$\s*\d+\.?\d*)\s*$
Make sure you use MULTILINE match.
This will match all the spaces around the value, so you may want to strip those out to get the value
Example:
String in = "SPEEDWAY 3007906\n" +
"Wallace NC 28466 \n" +
"TRAM: 1086244 \n" +
"9/17/2017 2:12 pm \n" +
"Pump 08 \n" +
"Regular Unleaded \n" +
"8,716 # $2,639/6131 \n" +
"GAS TOTAL $23.00\n" +
"TAX $0.03 \n" +
"TOTAL $23.80\n" +
"Uisa ";
Pattern p = Pattern.compile("(?<=^TOTAL)\\s*(\\$\\s*\\d+\\.?\\d*)\\s*$", MULTILINE);
Matcher m = p.matcher(in);
if(m.find()) {
System.out.println(m.group(1));
}
This should print just the matched value
Maybe you could use a negative lookbehind to assert that what is before TOTAL is not GAS and capture your value in group 1.
(?<!GAS )TOTAL\s*(\$\d+\.\d+)
Demo output Java

Digits are getting deleted when splitting a string

I have a string from which I need to remove all mentioned punctuations and spaces. My code looks as follows:
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s+]");
System.out.println("spart[0]: " + spart[0]);
System.out.println("spart[1]: " + spart[1]);
System.out.println("spart[2]: " + spart[2]);
System.out.println("spart[3]: " + spart[3]);
I have a string from which I need to remove all mentioned punctuations and spaces. My code looks as follows:
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s+]");
System.out.println("spart[0]: " + spart[0]);
System.out.println("spart[1]: " + spart[1]);
System.out.println("spart[2]: " + spart[2]);
System.out.println("spart[3]: " + spart[3]);
But, I am getting some elements which are blank. The output is:
spart[0]: s
spart[1]: film
spart[2]:
spart[3]: normal
- is a special character in PHP character classes. For instance, [a-z] matches all chars from a to z inclusive. Note that you've got )-_ in your regex.
- defines a range in regular expressions as used by String.split argument so that needs to be escaped
String[] part = line.toLowerCase().split("[,/?:;\"{}()\\-_+*=|<>!`~##$%^&]");
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s]+");

Categories

Resources