Capture multiples groups in regex - java

I want to parse some CSS in Java.
It doesn't have to be perfect and should only capture a specific style class.
Let's assume the CSS looks something like this:
.someunimportantclass .txt-value .input-suffix {
margin-left: 4px;
}
/* Table Columns */
body.thisone table .column-bic {
min-width: 70px;
}
body.thisone table .column-char35,
body.thisone table .column-somethingdifferent,
body.thisone table .column-somethingdifferent2,
body.thisone table .column-closebutstilldifferent {
{
min-width: 245px;
}
body.thisone table .column-code {
min-width: 25px;
text-align: center;
}
My approach with regex works only partial. Right now I have:
body.thisone table \.([a-z]*-[\w]*) \{[\s]*(.*)\: ([\w]*);
Which captures all the single-line-classes. It doesn't work with multiple classes with the same attribute(s) or with classes with several attributes. I experimented a little bit with the group-"flags" (like + and ?) but couldn't really figure out how to do.
Another problem I haven't really thought about is how to map those groups into Java Objects. With just one attribute to one class it is as easy as
for (int i = 1; i <= matcher.groupCount(); i += 3) {
classes.add(matcher.group(i));
attributes.put(matcher.group(i + 1), matcher.group(i + 2));
}
with classes as List<String> and attributes as Map<String, String>.
But off the top of my mind I cannot come up with a way to do it with several classes and / or attributes.

{
String foo = ".someunimportantclass .txt-value .input-suffix {\n" +
" margin-left: 4px;\n" +
"}\n" +
"\n" +
"/* Table Columns */\n" +
"\n" +
"body.thisone table .column-bic {\n" +
" min-width: 70px;\n" +
"}\n" +
"\n" +
"body.thisone table .column-char35,\n" +
"body.thisone table .column-somethingdifferent,\n" +
"body.thisone table .column-somethingdifferent2,\n" +
"body.thisone table .column-closebutstilldifferent {\n" +
"{\n" +
" min-width: 245px;\n" +
"}\n" +
"\n" +
"body.thisone table .column-code {\n" +
" min-width: 25px;\n" +
" text-align: center;\n" +
"}";
String key = "body.thisone table";
If what I expected of the data structure to be correct, they will be similar to:
HashMap<String, HashMap<String, String>> matchingClasses = new HashMap<>();
The pattern to find a CSS Class name with similar structure would be:
// Pattern pattern = Pattern.compile(key + "\\s\\.([a-z]*-[\\w]*)(?:,[^{]+)?\\s*");
And then we can capture its contents with a lookahead, so that colliding classes can be re-matched as well:
// Pattern pattern = Pattern.compile(key + "\\s\\.([a-z]*-[\\w]*)(?=(?:,[^{]+)?\\s*" +
// "{\\s*(.*?)\\s*)");
Since the CSS class contents are multi-line we have to compile this with DOTALL.
Pattern pattern = Pattern.compile(key + "\\s\\.([a-z]*-[\\w]*)(?=(?:,[^{]+)?\\s*" +
"{\\s*(.*?)\\s*)", Pattern.DOTALL);
From there we can match with the regex, after compiling another pattern to break down the CSS class contents:
Pattern content = Pattern.compile("([\\w-]+)\\s*:\\s*([^;]+);");
Matcher matcher = pattern.matcher(foo);
while (matcher.find()) {
// matcher.group(1); // This is the class name.
// matcher.group(2); // This is the class contents.
We can get the attribute value pairs like this:
HashMap<String, String> attributes = new HashMap<>();
Matcher contents = content.matcher(matcher.group(2));
while (contents.find())
attributes.put(contents.group(1), contents.group(2));
And then add it into our matchingClasses hashmap.
if (! attributes.isEmpty())
matchingClasses.put(matcher.group(1), attributes);
}

Related

Parsing a specific text value with JSoup

Hey does anyone know how to parse the "Light rain", " 7°C", and "Limited"? These are stored as #text so that's kind of throwing me off. For reference, to parse "Temperature:", it would be Element element5 = doc.select("strong").get(3);
Thanks!
The nodes from your example are called text nodes. In Jsoup, you can read the text nodes of a node by using the text() method. So given your example using Jsoup we'd select the td element and then use text() to get it's text value.
However, this would also output the text value from any child nodes, so in your case this would produce Weather: Light rain as a single string. Fortunately, Jsoup also has a ownText() method that only extracts the value from the text nodes that are a direct descendant of the element (and not all children). So given your example code, you could write it like this:
Element element5 = doc.select("td").get(3);
String value = element5.ownText()
You can use variuos ways to extract required text and one of them is td.childNode(1).toString() and complete solution is mentioned below:
public static void main(String[] args) {
// Parse HTML String using JSoup library
String HTMLSTring = "<html>\n" +
" <head></head>\n" +
" <body>\n" +
" <table class=\"table\"> \n" +
" <tbody>\n" +
" <tr> \n" +
" <td><strong>Weather: </strong>Light Rain</td> \n" +
" </tr> \n" +
" <tr> \n" +
" <td><strong>Tempratue: </strong>70 C</td> \n" +
" </tr> \n" +
" <tr> \n" +
" <td><strong>Visibility: </strong>Limited</td> \n" +
" </tr> \n" +
" <tr> \n" +
" <td><strong>Runs open: </strong>0</td> \n" +
" </tr>\n" +
" </tbody>\n" +
" </table>\n" +
" </body>\n" +
"</html>"
+ "<head></head>";
Document html = Jsoup.parse(HTMLSTring);
Elements tds = html.getElementsByTag("td");
for (Element td : tds) {
//String tdStrongText = td.childNode(0).childNodes().get(0).toString();
String tdStrongText = td.select("strong").text();
System.out.print(tdStrongText + " : ");
String tdText = td.childNode(1).toString();
System.out.println(tdText);
}
}
Check out code on github.

How to get the values stored in a span class in selenium

I have a span class like in the attached picture. I want to fetch all three values i.e. 0.413%, 0.012%, -- and --
When I traverse to this span class and get text then all three values stored in the string but i want them one by one.
'--' can be at anywhere. How to fetch these values.
<span class="text-light ng-binding" ng-show="calculatorStatus == 'COMPLETED'" style="font-size: 0.85em;">
0.413%
<br/>
0.012%
<br/>
--
</span>
Actual: 0.413% \n 0.012% \n --
Expected: 0.413%, 0.012%, --
Looks like homework. Hmmm. OK. I keep this in my kitbag:
public String getTextFromElementsTextNodes(WebDriver webDriver, WebElement element) throws IllegalArgumentException {
String text = "";
if (webDriver instanceof JavascriptExecutor) {
text = (String)((JavascriptExecutor) webDriver).executeScript(
"var nodes = arguments[0].childNodes;" +
"var text = '';" +
"for (var i = 0; i < nodes.length; i++) {" +
" if (nodes[i].nodeType == Node.TEXT_NODE) {" +
" text += nodes[i].textContent;" +
" }" +
"}" +
"return text;"
, element);
} else {
throw new IllegalArgumentException("driver is not an instance of JavascriptExecutor");
}
return text;
}
It returns all characters including non-ASCII line breaks. I usually just want the text so I add this
getTextFromElementsTextNodes(driver, anElement).replaceAll("[^\\x00-\\x7F]", " ");

Unable to Select option from dropdown using JavascrtptExecutor

Can anyone provide me a failsafe(ish) method for selecting text from dropdowns on this page I am practicing on?
https://www.club18-30.com/club18-30
Specifically, the 'from' and 'to' airport dropdowns. I am using the following code:
public void selectWhereFrom(String query, String whereFromSelect) throws InterruptedException {
WebElement dropDownContainer = driver.findElement(By.xpath(departureAirportLocator));
dropDownContainer.click();
selectOption(query,whereFromSelect);
}
public void selectOption(String query, String option) {
String script =
"function selectOption(s) {\r\n" +
" var sel = document.querySelector(' " + query + "');\r\n" +
" for (var i = 0; i < sel.options.length; i++)\r\n" +
" {\r\n" +
" if (sel.options[i].text.indexOf(s) > -1)\r\n" +
" {\r\n" +
" sel.options[i].selected = true;\r\n" +
" break;\r\n" +
" }\r\n" +
" }\r\n" +
"}\r\n" +
"return selectOption('" + option + "');";
javaScriptExecutor(script);
}
This seems to successfully populate the box with text but when I hit 'Search' I then receive a message saying I need to select an option, suggesting it has not registered the selection?
I would rather avoid JavaScriptExecutor but haven't been able to make these Selects work with a regular Selenium Select mechanism
I would set up a function for each dropdown, one for setting the departure airport and another for setting the destination airport. I've tested the code below and it works.
The functions
public static void setDepartureAirport(String airport)
{
driver.findElement(By.cssSelector("div.departureAirport div.departurePoint")).click();
String xpath = "//div[contains(#class, 'departurePoint')]//ul//li[contains(#class, 'custom-select-option') and contains(text(), '"
+ airport + "')]";
driver.findElement(By.xpath(xpath)).click();
}
public static void setDestinationAirport(String airport)
{
driver.findElement(By.cssSelector("div.destinationAirport div.airportSelect")).click();
String xpath = "//div[contains(#class, 'destinationAirport')]//ul//li[contains(#class, 'custom-select-option') and contains(text(), '"
+ airport + "')]";
driver.findElement(By.xpath(xpath)).click();
}
and you call them like
driver.get("https://www.club18-30.com/club18-30");
setDepartureAirport("(MAN)");
setDestinationAirport("(IBZ)");
I would suggest that you use the 3-letter airport codes for your search, e.g. "(MAN)" for Manchester. That will be unique to each airport but you can use any unique part of the text.

Extracting Capture Group from Non-Capture Group in Java

I have a string, let's call it output, that's equals the following:
ltm data-group internal str_testclass {
records {
baz {
data "value 1"
}
foobar {
data "value 2"
}
topaz {}
}
type string
}
And I'm trying to extract the substring between the quotes for a given "record" name. So given foobar I want to extract value 2. The substring I want to extract will always come in the form I have prescribed above, after the "record" name, a whitespace, an open bracket, a new line, whitespace, the string data, and then the substring I want to capture is between the quotes from there. The one exception is when there is no value, which will always happen like I have prescribed above with topaz, in which case after the "record" name there will just be an open and closed bracket and I'd just like to get an empty string for this. How could I write a line of Java to capture this? So far I have ......
String myValue = output.replaceAll("(?:foobar\\s{\n\\s*data "([^\"]*)|()})","$1 $2");
But I'm not sure where to go from here.
Let's start extracting "records" structure with following regex ltm\s+data-group\s+internal\s+str_testclass\s*\{\s*records\s*\{\s*(?<records>([^\s}]+\s*\{\s*(data\s*"[^"]*")?\s*\}\s*)*)\}\s*type\s*string\s*\}
Then from "records" group, just find for sucessive match against [^\s}]+\s*\{\s*(?:data\s*"(?<data>[^"]*)")?\s*\}\s*. The "data" group contains what's you're looking for and will be null in "topaz" case.
Java strings:
"ltm\\s+data-group\\s+internal\\s+str_testclass\\s*\\{\\s*records\\s*\\{\\s*(?<records>([^\\s}]+\\s*\\{\\s*(data\\s*\"[^\"]*\")?\\s*\\}\\s*)*)\\}\\s*type\\s*string\\s*\\}"
"[^\\s}]+\\s*\\{\\s*(?:data\\s*\"(?<data>[^\"]*)\")?\\s*\\}\\s*"
Demo:
String input =
"ltm data-group internal str_testclass {\n" +
" records {\n" +
" baz {\n" +
" data \"value 1\"\n" +
" }\n" +
" foobar {\n" +
" data \"value 2\"\n" +
" }\n" +
" topaz {}\n" +
" empty { data \"\"}\n" +
" }\n" +
" type string\n" +
"}";
Pattern language = Pattern.compile("ltm\\s+data-group\\s+internal\\s+str_testclass\\s*\\{\\s*records\\s*\\{\\s*(?<records>([^\\s}]+\\s*\\{\\s*(data\\s*\"[^\"]*\")?\\s*\\}\\s*)*)\\}\\s*type\\s*string\\s*\\}");
Pattern record = Pattern.compile("(?<name>[^\\s}]+)\\s*\\{\\s*(?:data\\s*\"(?<data>[^\"]*)\")?\\s*\\}\\s*");
Matcher lgMatcher = language.matcher(input);
if (lgMatcher.matches()) {
String records = lgMatcher.group();
Matcher rdMatcher = record.matcher(records);
while (rdMatcher.find()) {
System.out.printf("%s:%s%n", rdMatcher.group("name"), rdMatcher.group("data"));
}
} else {
System.err.println("Language not recognized");
}
Output:
baz:value 1
foobar:value 2
topaz:null
empty:
Alernatives: As your parsing a custom language, you can give a try to write an ANTLR grammar or create Groovy DSL.
Your regex shouldn't even compile, because you are not escaping the " inside your regex String, so it is ending your String at the first " inside your regex.
Instead, try this regex:
String regex = key + "\\s\\{\\s*\\n\\s*data\\s*\"([^\"]*)\"";
You can check out how it works here on regex101.
Try something like this getRecord() method where key is the record 'name' you're searching for, e.g. foobar, and the input is the string you want to search through.
public static void main(String[] args) {
String input = "ltm data-group internal str_testclass { \n" +
" records { \n" +
" baz { \n" +
" data \"value 1\" \n" +
" } \n" +
" foobar { \n" +
" data \"value 2\" \n" +
" }\n" +
" topaz {}\n" +
" } \n" +
" type string \n" +
"}";
String bazValue = getRecord("baz", input);
String foobarValue = getRecord("foobar", input);
String topazValue = getRecord("topaz", input);
System.out.println("Record data value for 'baz' is '" + bazValue + "'");
System.out.println("Record data value for 'foobar' is '" + foobarValue + "'");
System.out.println("Record data value for 'topaz' is '" + topazValue + "'");
}
private static String getRecord(String key, String input) {
String regex = key + "\\s\\{\\s*\\n\\s*data\\s*\"([^\"]*)\"";
final Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
//if we find a record with data return it
return matcher.group(1);
} else {
//else see if the key exists with empty {}
final Pattern keyPattern = Pattern.compile(key);
Matcher keyMatcher = keyPattern.matcher(input);
if (keyMatcher.find()) {
//return empty string if key exists with empty {}
return "";
} else {
//else handle error, throw exception, etc.
System.err.println("Record not found for key: " + key);
throw new RuntimeException("Record not found for key: " + key);
}
}
}
Output:
Record data value for 'baz' is 'value 1'
Record data value for 'foobar' is 'value 2'
Record data value for 'topaz' is ''
You could try
(?:foobar\s{\s*data "(.*)")
I think the replaceAll() isn't necessary here. Would something like this work:
String var1 = "foobar";
String regex = '(?:' + var1 + '\s{\n\s*data "([^"]*)")';
You can then use this as your regex to pass into your pattern and matcher to find the substring.
You can simple transform this into a function so that you can pass variables into it for your search string:
public static void SearchString(String str)
{
String regex = '(?:' + str + '\s{\n\s*data "([^"]*)")';
}

How to get the src url and a href html

I've this piece of html code. I want to replace the link placeholders for the content mentioned in three separate attributes. This is what I've tried so far:
String texto2 = "url(\"primeiro url\")\n" +
"url('2 url')\n" +
"href=\"1 href\"\n" +
"src=\"1 src\"\n" +
"src='2 src'\n" +
"url('3 url')\n" +
"\n" +
".camera_target_content .camera_link {\n" +
" background: url(../images/blank.gif);\n" +
" display: block;\n" +
" height: 100%;\n" +
" text-decoration: none;\n" +
"}";
String exp = "(?:href|src)=[\"'](.+)[\"']+|(?:url)\\([\"']*(.*)[\"']*\\)";
// expressão para pegar os links do src e do href
Pattern pattern = Pattern.compile(exp);
// preparando expressao
Matcher matcher = pattern.matcher(texto2);
// pegando urls e guardando na lista
while(matcher.find()) {
System.out.println(texto2.substring(matcher.start(), matcher.end()));
}
So far, so good - It works with find just that I need to get the clean link, something like this:
img/image.gif
and not:
 href = "img/image.gif"
     src = "img/image.gif"
     url (img/image.gif)
I want to replace one placeholder using one variable; this is what I've tried so far:
String texto2 = "url(\"primeiro url\")\n" +
"url('2 url')\n" +
"href=\"1 href\"\n" +
"src=\"1 src\"\n" +
"src='2 src'\n" +
"url('3 url')\n" +
"\n" +
".camera_target_content .camera_link {\n" +
" background: url(../images/blank.gif);\n" +
" display: block;\n" +
" height: 100%;\n" +
" text-decoration: none;\n" +
"}";
String exp = "(?:href|src)=[\"'](.+)[\"']+|(?:url)\\([\"']*(.*)[\"']*\\)";
// expressão para pegar os links do src e do href
Pattern pattern = Pattern.compile(exp);
// preparando expressao
Matcher matcher = pattern.matcher(texto2);
// pegando urls e guardando na lista
while(matcher.find()) {
String s = matcher.group(2);
System.out.println(s);
}
It turns out that this version does not work. It grabs the url perfectly; can someone help me spot the problem?
Use jsoup. Parse the HTML string into a DOM and you can then use CSS selectors to pull out the values as you would with jQuery in JavaScript. Note that this will only work if you're actually working with HTML; the string at the top of your example is not HTML.

Categories

Resources