Extracting a text in regex - java

I have text in csv format like below.
hi="hello",how="are",,,you="",thank="you"
Please help me to get a regex to extract the above text
hello,are,,,,you
Basically I want to take only values in the above key value pairs and keep the commas as they are (technically to frame a perfect csv)
Note: I am actually looking for a pure regex, not with java functions..please.
Thank you

So starting from your sentence :
String str = "hi=\"hello\",how=\"are\",,,you=\"\",thank=\"you\"";
You can remove all occurences of word=" and of " :
str = str.replaceAll("(\\w*=\"|\")", ""); //remove word=" OR "
Another way, this will catch the group key="value" and replace it by value :
str = str.replaceAll("\\w+=\"(.*?)\"", "$1"); // $1 is the value catch in ()

Related

Java won't replace all strings, because there is text next to the tags (post improved)

I'm working on a program, which formats HTML Code, extracted from a PDF file.
I have a String list, which contains paragraphs and is divided by that.
As the PDF has hyperlinks, I decided to replace them with a foot note number "[1]".
This will be used for citation of sources. I will eventually plan, to put it at the end of a paragraph, or sentence, so you can look up the sources, like you would in a book.
My Problem
For some reason not all the hyperlinks are replaced.
The reason is most likely, that there is text directly next to the tag.
Hell<a href="http://www.example.com">o old chap!
Specifically the "o" part and the "hell" part is blocking the java .replaceAll function, from doing it's job.
Expected Result
Hello [1] old chap!
EDIT:
If I would just add space, before and after the URL, it might split some words like "help", into "hel p", which is also not an option.
My code would have to replace the URL tag (without the ) and create no new extra spaces.
This is some of my code, where the problem occures:
for (int i = 0; i < EN.length; i++) {
Pattern pattern_URL = Pattern.compile("<a(.+?)\">", Pattern.DOTALL);
Matcher matcher_URL = pattern_URL.matcher(EN[i]); //Checks in the curren Array part.
if (matcher_URL.find() == true) {
source_number++;
String extractedURL = matcher_URL.group(0);
//System.out.println(extractedURL);
String extractedURL_fully = extractedURL.replaceAll("href=\"", ""); //Anführungszeichen
//System.out.println(extractedURL_fully);
String nobracketURL = extractedURL.replaceAll("\\)", ""); //Remove round brackets from URL
EN[i] = EN[i].replaceAll("\\)\"", "\""); /*Replace round brackets from URL in Array. (For some reasons there have been href URLs, with an bracket at the end. This was already in the PDF. They were causing massive problems, because it didn't comment them out, so the entire replaceAll command didn't function.)*/
EN[i] = EN[i].replaceAll(nobracketURL, "[" + source_number + "]"); //Replace URL tags with number and Edgy brackets
}
else{
//System.out.println("FALSE: " + "[" + i + "]");
}
}
The whole idea of this is, that it loops through the array and replaces all the URLs, including it's starting tag <a until the end of the starting tag "> (which can also be seen in the pattern regex.)
Correct me if I'm wrong, but what you need is to eliminate all the <a> tags from a given string, right? If that's the case all you needed to do was use a code like the following:
final String string = "<a href=\"http://www.example.com\">Sen";
final Pattern pattern = Pattern.compile("<a(.+?)>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll("");
System.out.println(result); // prints "Sen"
Notice I didn't use the replaceAll from the String object, but from the Matcher object. This replaces all matches for the empty string "".

How can I get non-matching groups using a Matcher in Java?

I'm trying to write a java regex to catch some groups of words from a String using a Matcher.
Say i got this string: "Hello, we are #happy# to see you today".
I would like to get 2 group of matches, one having
Hello, we are
to see you today
and the other
happy
So far, I was only able to match the word between the #s using this Pattern:
Pattern p = Pattern.compile("#(.+?)#");
I've read about negative lookahead and lookaround, played a bit with it but without success.
I assume I should do some sort of negation of the regex so far, but I couldn't come up with anything.
Any help would be really appreciated, thank you.
From comment:
I may incur in a string where I got more than one instances of words wrapped by #, such as "#Hello# kind #stranger#"
From comment:
I need to apply some different style format to both the text inside and outside.
Since you need to apply different stylings, the code need to process each block of text separately, and needs to know if the text is inside or outside a #..# section.
Note, in the following code, it will silently skip the last #, if there is an odd number of them.
String input = ...
for (Matcher m = Pattern.compile("([^#]+)|#([^#]+)#").matcher(input); m.find(); ) {
if (m.start(1) != -1) {
String outsideText = m.group(1);
System.out.println("Outside: \"" + outsideText + "\"");
} else {
String insideText = m.group(2);
System.out.println("Inside: \"" + insideText + "\"");
}
}
Output for input = "Hello, we are #happy# to see you today"
Outside: "Hello, we are "
Inside: "happy"
Outside: " to see you today"
Output for input = "#Hello# kind #stranger#"
Inside: "Hello"
Outside: " kind "
Inside: "stranger"
Output for input = "This #text# has unpaired # characters"
Outside: "This "
Inside: "text"
Outside: " has unpaired "
Outside: " characters"
The best I could do is splitting in 3 groups, then merging the group 1 and 4 :
(^.*)(\#(.+?)\#)(.*)
Test it here
EDIT: Taking remarks from the comments :
(^[^\#]*)(?:\#(.+?)\#)([^\#]*)
Thanks to #Lino we don't capture the useless group with # anymore, and we capture anything except #, instead of any non whitespace character in the 1st and 2nd groups.
Test it here
Is this solution fine?
Pattern pattern =
Pattern.compile("([^#]+)|#([^#]*)#");
Matcher matcher =
pattern.matcher("Hello, we are #happy# to see you today");
List<String> notBetween = new ArrayList<>(); // not surrounded by #
List<String> between = new ArrayList<>(); // surrounded by #
while (matcher.find()) {
if (Objects.nonNull(matcher.group(1))) notBetween.add(matcher.group(1));
if (Objects.nonNull(matcher.group(2))) between.add(matcher.group(2));
}
System.out.println("Printing group 1");
for (String string :
notBetween) {
System.out.println(string);
}
System.out.println("Printing group 2");
for (String string :
between) {
System.out.println(string);
}

Java Split String by colon on both side

Can you suggest me an approach by which I can split a String which is like:
:31C:150318
:31D:150425 IN BANGLADESH
:20:314015040086
So I tried to parse that string with
:[A-za-z]|\\d:
This kind of regular expression, but it is not working . Please suggest me a regular expression by which I can split that string with 20 , 31C , 31D etc as Keys and 150318 , 150425 IN BANGLADESH etc as Values .
If I use string.split(":") then it would not serve my purpose.
If a string is like:
:20: MY VALUES : ARE HERE
then It will split up into 3 string , and key 20 will be associated with "MY VALUES" , and "ARE HERE" will not associated with key 20 .
You may use matching mechanism instead of splitting since you need to match a specific colon in the string.
The regex to get 2 groups between the first and second colon and also capture everything after the second colon will look like
^:([^:]*):(.*)$
See demo. The ^ will assert the beginning of the string, ([^:]*) will match and capture into Group 1 zero or more characters other than :, and (.*) will match and capture into Group 2 the rest of the string. $ will assert the position at the end of a single line string (as . matches any symbol but a newline without Pattern.DOTALL modifier).
String s = ":20:AND:HERE";
Pattern pattern = Pattern.compile("^:([^:]*):(.*)$");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Key: " + matcher.group(1) + ", Value: " + matcher.group(2) + "\n");
}
Result for this demo: Key: 20, Value: AND:HERE
You can use the following to split:
^[:]+([^:]+):
Try with split function of String class
String[] splited = string.split(":");
For your requirements:
String c = ":31D:150425 IN BANGLADESH:todasdsa";
c=c.substring(1);
System.out.println("C="+c);
String key= c.substring(0,c.indexOf(":"));
String value = c.substring(c.indexOf(":")+1);
System.out.println("key="+key+" value="+value);
Result:
C=31D:150425 IN BANGLADESH:todasdsa
key=31D value=150425 IN BANGLADESH:todasdsa

Java Regex ReplaceAll with grouping

I want to surround all tokens in a text with tags in the following manner:
Input: " abc fg asd "
Output:" <token>abc</token> <token>fg</token> <token>asd</token> "
This is the code I tried so far:
String regex = "(\\s)([a-zA-Z]+)(\\s)";
String text = " abc fg asd ";
text = text.replaceAll(regex, "$1<token>$2</token>$3");
System.out.println(text);
Output:" <token>abc</token> fg <token>asd</token> "
Note: for simplicity we can assume that the input starts and ends with whitespaces
Use lookaround:
String regex = "(?<=\\s)([a-zA-Z]+)(?=\\s)";
...
text = text.replaceAll(regex, "<token>$1</token>");
If your tokens are only defined with a character class you don't need to describe what characters are around. So this should suffice since the regex engine walks from left to right and since the quantifier is greedy:
String regex = "[a-zA-Z]+";
text = text.replaceAll(regex, "<token>$0</token>");
// meaning not a space, 1+ times
String result = input.replaceAll("([^\\s]+)", "<token>$1</token>");
this matches everything that isn't a space. Prolly the best fit for what you need. Also it's greedy meaning it will never leave out a character that it shouldn't ( it will never find the string "as" in the string "asd" when there is another character with which it matches)

JAVA - Ignore part of strings containing "#"

I'm having some difficulties in excluding part of strings after the "#" symbol.
I explain myself better:
This is a sample input text a user could insert in a textbox:
Some Text
Some Text again #A comment
#A comment line
Another Text
Another Text again#Comment
I need to read this text and ignore all text after "#" symbol.
This should be the expected output:
Some Text;Some Text again;Another Text;Another Text again
As for now here's the code:
This replaces all newlines with ";"
readText = userInputTextArea.getText();
readTextAllInALine = readText.replaceAll("\\n", ";");
so the output after this is:
Some Text;Some Text again #A comment;#A comment line;Another Text;Another Text again#Comment
This code is to ignore all characters after the first "#" but works fine just for the first line if we read it all sequentially.
int startIndex = inputCommandText.indexOf("#");
int endIndex = inputCommandText.indexOf(";");
String toBeReplaced = inputCommandText.substring(startIndex, endIndex);
readTextAllInALine.replace(toBeReplaced, "");
I'm stuck in finding a way for having the expected output. I was thinking of using a StringTokenizer, processing every line, removing text after "#" or ignoring the whole line if it starts with "#", and then printing all tokens (i.e. all lines) separating them with ";" but I cannot make it work.
Any help will be appreciated.
Thank you very much in advance.
Regards.
Just call this replace command on your pure string, retrieved from the text input. The regex #[^;]* grabs everything, starting at the hash until it reads a semicolon. Afterwards it replaces it with an empty string.
public static void main(String[] args) {
String text = "Some Text;Some Text again #A comment;#A comment line;Another Text;Another Text again#Comment";
System.out.println(text);
text = text.replaceAll("#[^;]*", "");
System.out.println(text);
}
A regex is useful here but it's tricky because your pattern is moderately complex. The comments are end line so they can appear in more than one arrangement.
I came up with the following which is a two-pass:
replaceAll(" *(#.*(?=\\n|$))", "").replaceAll("\\n+", ";");
The two-pass circumvents the fact that sometimes you get a duplicate line break. The first expression replaces comments but not new line characters and the second expression replaces multiple new line characters with a single semicolon.
The individual parts of the expression in the first pass are the following:
" *"
This includes zero or more leading spaces in the comment match. IE in "...again #A...", we want to remove that space between n and #.
"(#.* )"
The start of the comment match: matches a # followed by zero or more characters. (Typically the . matches any character except a new line.)
"(?= )"
This is a positive lookahead and where the regex starts to get tricky. It looks for whatever is inside this expression but doesn't include it in the text that's matched. It asserts that the #.* is followed by a certain string but doesn't replace that certain string.
"\\n|$"
The lookahead finds a new line or the end anchor. This will find a comment ended with a new line character or a comment that is at the end of the String. But again, since it's inside the lookahead, the new line doesn't get replaced.
So given the input:
String text = (
"Some Text" + '\n' +
"Some Text again #A comment" + '\n' +
"#A comment line" + '\n' +
"Another Text" + '\n' +
"Another Text again#Comment"
);
System.out.println(
text.replaceAll(" *(#.*(?=\\n|$))", "").replaceAll("\\n+", ";")
);
The output is:
Some Text;Some Text again;Another Text;Another Text again
readText = userInputTextArea.getText();
readText = readText.replaceAll("\\s*#[^\n]*", "");
readText = readText.replaceAll("\n+", ";");
Just to make it clear, Coxer's reply is the way to go. Far more precise and clean. But in any case, if you fancy experimenting here is a recursive solution that will work:
public class IgnoreHash {
#Test
public void test() {
String readTextAllInALine = "Some Text;Some Text again #A comment;#A comment line;Another Text;Another Text again#Comment;";
String actualResult = removeHashComments(readTextAllInALine);
Assert.assertEquals(actualResult, "Some Text;Some Text again ;Another Text;Another Text again");
}
private String removeHashComments(String input) {
StringBuffer result = new StringBuffer();
int hashIndex = input.indexOf("#");
int endIndex = input.indexOf(";");
if(hashIndex != -1){
result.append(input.substring(0, hashIndex));
//first line
if(hashIndex < endIndex ) {
result.append(removeHashComments(input.substring(endIndex)));
} // the case of ;#
else if (endIndex == hashIndex-1) {
int endIndex2 = input.indexOf(";", hashIndex+1);
result.append(removeHashComments(input.substring(endIndex2+1)));
}
else {
result.append(removeHashComments(input.substring(hashIndex)));
}
}
return result.toString();
}
}

Categories

Resources