Splitting string with similar starting pattern

Splitting string with similar starting pattern - java

So, I've been trying to split something I'm reading from a file. But everything that I've tried does not give me only the part that I want.
What I have as string is this:
Scenario:
Bunch of stuf here
Just typing stuff for the example...
Scenario:
More stuff here
A lot more stuff here
XX123
I want to get everything from 'Scenario:' to 'XX123'
Like this:
Scenario:
More stuff here
A lot more stuff here
XX123
The file that I'm reading from have a lot of those 'Scenarios:' and using Pattern from java doesn't give me only the part that I want. Instead it gives from the first 'Scenario:' it finds until 'XX123'
I also tried to use StringUtils.substringBetween, same result.
Thanks in advance

The old-fashioned way to do it would look something like this:
String inputText;
String END_MARKER = "XXX123";
int indexOfEnd = inputText.indexOf(END_MARKER);
// search in reverse
int indexOfScenario = inputText.lastIndexOf("Scenario", indexOfEnd);
String result = inputText.substring(indexOfScenario,
indexOfEnd + END_MARKER.length());

Related

Regex: Read value between multiple brackets

I currently working on translating a website (Smarty) with Poedit. To get all the text from the .tpl files i'm using regex to get the data between the {t} and {/t}. so an example:
{t}Password incorrect, please try again{/t}
The regex will read Password incorrect, please try again and place it in a .po file. This is all working fine. It goes wrong when it gets a little more advanced.
Sometimes the text between the {t} tags uses a parameter. this looks like this:
{t 1=$email|escape 2=$mailbox}No $1 given, please check your $2{/t}
This is also working great.
The real problem start when i use brackets inside the parameter like this:
{t 1={site info='name'} 2=$mailbox}visit %1 or go to your %2{/t}
My regex will close when it sees the first closing brackets so the result will be 2=$mailbox}visit %1 or go to your %2.
My regex looks like this:
\{t.*?\}?[}]([^\{]+)\{\/t\}|\{t\}([^\{]+)\{\/t\}
The regex is used inside a java program.
Does anybody has a way to fix this problem?

The easiest solution I see on this is to normalize the .tpl files. Just use a regex which matches all tags something like this one:
{[^}]*[^{]*}
I had the same issue to solve and it worked pretty good with the normalizing.
The normalizing-method would look like this:
final String regex = "\\{[^\\}]*[^\\{]*\\}";
private String normalizeContent(String content) {
return content.replaceAll(regex, "");
}

Combining Regex in Java

I've some issues with a program which is fetching information out of an html table in Java.
To fetch information out of every column I use the following RegEx:
<td>([^<]*)</td>
This works very nice for me.
For fetching the Linknames I use this:
<a[^>]*>(.*?)</a>
This is also working very very good.
But sometimes I need informations from a column where a link is in. Therefore I wanted to combine these Regex with:
<td>([^<]*)</td>|<a[^>]*>(.*?)</a>
I thought that it would work like this:
It get every thing which is between <td> and </td>
If the thing is a link it get also just the linkname
But this is not working. I'm not the best at RegEx so I need help to combine these two steps.
Thanks very very very much.

The code I'm using:
Pattern pattern = Pattern.compile("<td>([^<]*)</td>|<a[^>]*>(.*?)</a>");
String line = "Here are the lines saved from the HTML downloader";
Matcher matcher = pattern.matcher(line);
for (int startPoint = 0; matcher.find(startPoint); startPoint = matcher.end())
{
System.out.prinln(matcher.group(1));
}
This is just a snippet - but thats how it works in general. (Normally the String is saved in an array).

Matcher.find() only find the last match in JUnit Test

i have this weird problem. I have this Java method that works fine in my program:
/*
* Extract all image urls from the html source code
*/
public void extractImageUrlFromSource(ArrayList<String> imgUrls, String html) {
Pattern pattern = Pattern.compile("\\<[ ]*[iI][mM][gG][\t\n\r\f ]+.*[sS][rR][cC][ ]*=[ ]*\".*\".*>");
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
imgUrls.add(extractImgUrlFromTag(matcher.group()));
}
}
This method works fine in my java application. But whenever I test it in JUnit test, it only adds the last url to the ArrayList
/**
* Test of extractImageUrlFromSource method, of class ImageDownloaderProc.
*/
#Test
public void testExtractImageUrlFromSource() {
System.out.println("extractImageUrlFromSource");
String html = "<html><title>fdjfakdsd</title><body><img kfjd src=\"http://image1.png\">df<img dsd src=\"http://image2.jpg\"></body><img dsd src=\"http://image3.jpg\"></html>";
ArrayList<String> imgUrls = new ArrayList<String>();
ArrayList<String> expimgUrls = new ArrayList<String>();
expimgUrls.add("http://image1.png");
expimgUrls.add("http://image2.jpg");
expimgUrls.add("http://image3.jpg");
ImageDownloaderProc instance = new ImageDownloaderProc();
instance.extractImageUrlFromSource(imgUrls, html);
imgUrls.stream().forEach((x) -> {
System.out.println(x);
});
assertArrayEquals(expimgUrls.toArray(), imgUrls.toArray());
}
Is it the JUnit that has the fault. Remember, it works fine in my application.

I think there is a problem in the regex:
"\\<[ ]*[iI][mM][gG][\t\n\r\f ]+.*[sS][rR][cC][ ]*=[ ]*\".*\".*>"
The problem (or at least one problem) us the first .*. The + and * metacharacters are greedy, which means that they will attempt to match as many characters as possible. In your unit test, I think that what is happening is that the .* is matching everything up to the last 'src' in the input string.
I suspect that the reason that this "works" in your application is that the input data is different. Specifically, I suspect that you are running your application on input files where each img element is on a different line. Why does this make a difference? Well, it turns out that by default, the . metacharacter does not match line breaks.
For what it is worth, using regexes to "parse" HTML is generally thought to be a bad idea. For a start, it is horribly fragile. People who do a lot of this kind of stuff tend to use proper HTML parsers ... like "jsoup".
Reference: RegEx match open tags except XHTML self-contained tags

I wish I could comment as I'm not sure about this, but it might be worth mentioning...
This line looks like it's extracting the URLs from the wrong array...did you mean to extract from expimgUrls instead of imgUrls?
instance.extractImageUrlFromSource(imgUrls, html);
I haven't gotten this far in my Java education so I may be incorrect...I just looked over the code and noticed it. I hope someone else who knows more can actually give you a solid answer!

cannot parse String with Java Regex

I have a string formatted as below:
source1.type1.8371-(12345)->source2.type3.3281-(38270)->source4.type2.903..
It's a path, the number in () is the weight for the edge, I tried to split it using java Pattern as following:
[a-zA-Z.0-9]+-{1}({1}\\d+){1}
[a-zA-Z_]+.[a-zA-Z_]+.(\\d)+-(\\d+)
[a-zA-Z.0-9]+-{1}({1}\\d+){1}-{1}>{1}
hopefully it split the string into fields like
source1.type1.8371-(12345)
source2.type3.3281-(38270)
..
but none of them work, it always return the whole string as the field.

It looks like you just want String.split("->") (javadoc). This splits on the symbol -> and returns an array containing the parts between ->.
String str = "source1.type1.8371-(12345)->source2.type3.3281-(38270)->source4.type2.903..";
for(String s : str.split("->")){
System.out.println(s);
}
Output
source1.type1.8371-(12345)
source2.type3.3281-(38270)
source4.type2.903..

It seems to me like you want to split at the ->'s. So you could use something like str.split("->") If you were more specific about why you need this maybe we could understand why you were trying to use those complicated regexes

Java: reading a string in a particular format

I am not posting any code I am struck with. I am trying this in Java:
Issue:
I have words like:
,xxxx-1223
yyyyy,xxdd-345
$,xxxxr-7
sdsdsdd-18
so what ever format I have I should be able to read the last one:
xxxx-1223
xxdd-345
xxxxr-7
sdsdsdd-18
what so may be the words, all I need to to get the words as shown.

Use String#lastIndexOf(int) to find where the last comma occurs, and use String#substring(int) to get the rest of the string that follows.
String input = /* whatever */;
int lastComma = input.lastIndexOf(',');
String output = input.substring(lastComma + 1);

String[] str=yourWord.split(",");
String output=str[str.length-1];

You can use this Regex: -
(\\w+-\\d+)$
Or this specific problem can simply be solved using String.split() or String.substring(int) methods

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting string with similar starting pattern - java

Related

Regex: Read value between multiple brackets

Combining Regex in Java

Matcher.find() only find the last match in JUnit Test

cannot parse String with Java Regex

Java: reading a string in a particular format

Categories

Resources