I have a big text files and I want to remove everything that is between
double curly brackets.
So given the text below:
String text = "This is {{\n" +
"{{the multiline\n" +
"text}} file }}\n" +
"what I\n" +
"{{ to {{be\n" +
"changed}}\n" +
"}} want.";
String cleanedText = Pattern.compile("(?<=\\{\\{).*?\\}\\}", Pattern.DOTALL).matcher(text).replaceAll("");
System.out.println(cleanedText);
I want the output to be:
This is what I want.
I have googled around and tried many different things but I couldn't find anything close to my case and as soon as I change it a little bit everything gets worse.
Thanks in advance
You can use this :
public static void main(String[] args) {
String text = "This is {{\n" +
"{{the multiline\n" +
"text}} file }}\n" +
"what I\n" +
"{{ to {{be\n" +
"changed}}\n" +
"}} want.";
String cleanedText = text.replaceAll("\\n", "");
while (cleanedText.contains("{{") && cleanedText.contains("}}")) {
cleanedText = cleanedText.replaceAll("\\{\\{[a-zA-Z\\s]*\\}\\}", "");
}
System.out.println(cleanedText);
}
A regular expression cannot express arbitrarily nested structures; i.e. any syntax that requires a recursive grammar to describe.
If you want to solve this using Java Pattern, you need to do it by repeated pattern matching. Here is one solution:
String res = input;
while (true) {
String tmp = res.replaceAll("\\{\\{[^}]*\\}\\}", "");
if (tmp.equals(res)) {
break;
}
res = tmp;
}
This is not very efficient ...
That can be transformed into an equivalent, but more concise form:
String res = input;
String tmp;
while (!(tmp = res.replaceAll("\\{\\{[^}]*\\}\\}", "")).equals(res)) {
res = tmp;
}
... but I prefer the first version because it is (IMO) a lot more readable.
I am not an expert in regular expression, so I just write a loop which does this for you. If you don't have/want to use a regEx, then it could be helpful for you;)
public static void main(String args[]) {
String text = "This is {{\n" +
"{{the multiline\n" +
"text}} file }}\n" +
"what I\n" +
"{{ to {{be\n" +
"changed}}\n" +
"}} want.";
int openBrackets = 0;
String output = "";
char[] input = text.toCharArray();
for(int i=0;i<input.length;i++){
if(input[i] == '{'){
openBrackets++;
continue;
}
if(input[i] == '}'){
openBrackets--;
continue;
}
if(openBrackets==0){
output += input[i];
}
}
System.out.println(output);
}
My suggestion is to remove anything between curly brackets, starting at the innermost pair:
String text = "This is {{\n" +
"{{the multiline\n" +
"text}} file }}\n" +
"what I\n" +
"{{ to {{be\n" +
"changed}}\n" +
"}} want.";
Pattern p = Pattern.compile("\\{\\{[^{}]+?}}", Pattern.MULTILINE);
while (p.matcher(text).find()) {
text = p.matcher(text).replaceAll("");
}
resulting in the output
This is
what I
want.
This might fail when having single curly brackets or unpaired pair of brackets, but could be good enough for your case.
Related
I have a file which contains double quotes only to String types but i need to add missing double quotes to other fields and write into a file using java.
for example
123 ,6 ,"abc#yahoo.com"
"
should be converted to
"123 ","6 ","abc#yahoo.com" "
without trimming any value just adding the missing text qualifier around the fields. I have tried by splitting based on delimiter and then wrapping around quotes but it did not work.
please share if you have solved any issue like this.
You need to use string.replaceAll method.
string.replaceAll("(^|,)(?!\")([^,]+)", "$1\"$2\"");
DEMO
There's a solution without such a complicated regular expressions: you have to split your input by , and wrap the resulting Strings:
String[] splitted = input.split(",");
for (int i = 0; i < splitted.size(); ++i) {
if (splitted[i].charAt(0) != '"') {
splitted[i] = "\"" + splitted[i] + "\"";
}
}
String output = String.join(",", Arrays.asList(splitted)); // or any other joining technic, this is from Java 8
You can easily do it by using split() method from String class and just add quote when you need it.
Basically, I would try something like this:
public static void main (String... args) {
String st = "123 ,6 ,\"abc#yahoo.com";
String[] results = st.split(",");
String result = "";
for (String s : results) {
if (!s.startsWith("\""))
s = "\"" + s + "\"";
if (!s.endsWith("\""))
s+="\"";
s+=",";
result += s;
}
System.out.println(st);
System.out.println("-------------");
System.out.println(result);
}
This keep spaces and add some missing quotes.
I tried and the below is working...
public static void test()
{
String str = "123 ,6 ,\"abc#yahoo.com \"";
String result = "",temp="";
StringTokenizer token = new StringTokenizer(str,",");
while(token.hasMoreTokens())
{
temp = token.nextToken();
if(!temp.startsWith("\""))
result += "\""+temp+"\"";
else
result += temp;
}
System.out.println(result);
}
Please check...
Try using a simple foreach and if loop to check if the value has double quotes then add quotes to the values.
for(Object obj:YourContainer){
if(!value.contains("\"")){
String s = "\"" + value + "\"";
}
}
If you are getting fields like ""Test"" with double double quotes, try using the replace function.
String s;
for(Object obj:YourContainer){
if(!value.contains("\"")){
s = "\"" + value + "\"";
}
s = s.replace("\"\"","\"");
}
I want to initialize a String in Java, but that string needs to include quotes; for example: "ROM". I tried doing:
String value = " "ROM" ";
but that doesn't work. How can I include "s within a string?
In Java, you can escape quotes with \:
String value = " \"ROM\" ";
In reference to your comment after Ian Henry's answer, I'm not quite 100% sure I understand what you are asking.
If it is about getting double quote marks added into a string, you can concatenate the double quotes into your string, for example:
String theFirst = "Java Programming";
String ROM = "\"" + theFirst + "\"";
Or, if you want to do it with one String variable, it would be:
String ROM = "Java Programming";
ROM = "\"" + ROM + "\"";
Of course, this actually replaces the original ROM, since Java Strings are immutable.
If you are wanting to do something like turn the variable name into a String, you can't do that in Java, AFAIK.
Not sure what language you're using (you didn't specify), but you should be able to "escape" the quotation mark character with a backslash: "\"ROM\""
\ = \\
" = \"
new line = \r\n OR \n\r OR \n (depends on OS) bun usualy \n enough.
taabulator = \t
Just escape the quotes:
String value = "\"ROM\"";
In Java, you can use char value with ":
char quotes ='"';
String strVar=quotes+"ROM"+quotes;
Here is full java example:-
public class QuoteInJava {
public static void main (String args[])
{
System.out.println ("If you need to 'quote' in Java");
System.out.println ("you can use single \' or double \" quote");
}
}
Here is Out PUT:-
If you need to 'quote' in Java
you can use single ' or double " quote
Look into this one ... call from anywhere you want.
public String setdoubleQuote(String myText) {
String quoteText = "";
if (!myText.isEmpty()) {
quoteText = "\"" + myText + "\"";
}
return quoteText;
}
apply double quotes to non empty dynamic string. Hope this is helpful.
This tiny java method will help you produce standard CSV text of a specific column.
public static String getStandardizedCsv(String columnText){
//contains line feed ?
boolean containsLineFeed = false;
if(columnText.contains("\n")){
containsLineFeed = true;
}
boolean containsCommas = false;
if(columnText.contains(",")){
containsCommas = true;
}
boolean containsDoubleQuotes = false;
if(columnText.contains("\"")){
containsDoubleQuotes = true;
}
columnText.replaceAll("\"", "\"\"");
if(containsLineFeed || containsCommas || containsDoubleQuotes){
columnText = "\"" + columnText + "\"";
}
return columnText;
}
suppose ROM is string variable which equals "strval"
you can simply do
String value= " \" "+ROM+" \" ";
it will be stored as
value= " "strval" ";
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Query about the trim() method in Java
I am parsing a site's usernames and other information, and each one has a bunch of spaces after it (but spaces in between the words).
For example: "Bob the Builder " or "Sam the welder ". The numbers of spaces vary from name to name. I figured I'd just use .trim(), since I've used this before.
However, it's giving me trouble. My code looks like this:
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).trim());
}
The result is just the same; no spaces are removed at the end.
Thank you in advance for your excellent answers!
UPDATE:
The full code is a bit more complicated, since there are HTML tags that are parsed out first. It goes exactly like this:
for (String s : splitSource2) {
if (s.length() > "<td class=\"dddefault\">".length() && s.substring(0, "<td class=\"dddefault\">".length()).equals("<td class=\"dddefault\">")) {
splitSource3.add(s.substring("<td class=\"dddefault\">".length()));
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).substring(0, splitSource3.get(i).length() - 5));
splitSource3.set(i, splitSource3.get(i).trim());
System.out.println(i + ": " + splitSource3.get(i));
}
}
UPDATE:
Calm down. I never said the fault lay with Java, and I never said it was a bug or broken or anything. I simply said I was having trouble with it and posted my code for you to collaborate on and help solve my issue. Note the phrase "my issue" and not "java's issue". I have actually had the code printing out
System.out.println(i + ": " + splitSource3.get(i) + "*");
in a for each loop afterward.
This is how I knew I had a problem.
By the way, the problem has still not been fixed.
UPDATE:
Sample output (minus single quotes):
'0: Olin D. Kirkland '
'1: Sophomore '
'2: Someplace, Virginia 12345<br />VA SomeCity<br />'
'3: Undergraduate '
EDIT the OP rephrased his question at Query about the trim() method in Java, where the issue was found to be Unicode whitespace characters which are not matched by String.trim().
It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.
If my assumption is correct then you've got two choices:
Use a binary reader and figure out what those characters are - and delete them with String.replace(); E.g.:
private static void cutCharacters(String fromHtml) {
String result = fromHtml;
char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
for (char ch : problematicCharacters) {
result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
}
return result;
}
If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:
private String getImportantParts(String fromHtml) {
Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
Matcher m = p.matcher(fromHtml);
StringBuilder buff = new StringBuilder();
while (m.find()) {
buff.append(m.group(1));
}
return buff.toString().trim();
}
Works without a problem for me.
Here your code a bit refactored and (maybe) better readable:
final String openingTag = "<td class=\"dddefault\">";
final String closingTag = "</td>";
List<String> splitSource2 = new ArrayList<String>();
splitSource2.add(openingTag + "Bob the Builder " + closingTag);
splitSource2.add(openingTag + "Sam the welder " + closingTag);
for (String string : splitSource2) {
System.out.println("|" + string + "|");
}
List<String> splitSource3 = new ArrayList<String>();
for (String s : splitSource2) {
if (s.length() > openingTag.length() && s.startsWith(openingTag)) {
String nameWithoutOpeningTag = s.substring(openingTag.length());
splitSource3.add(nameWithoutOpeningTag);
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
String name = splitSource3.get(i);
int closingTagBegin = splitSource3.get(i).length() - closingTag.length();
String nameWithoutClosingTag = name.substring(0, closingTagBegin);
String nameTrimmed = nameWithoutClosingTag.trim();
splitSource3.set(i, nameTrimmed);
System.out.println("|" + splitSource3.get(i) + "|");
}
I know that's not a real answer, but i cannot post comments and this code as a comment wouldn't fit, so I made it an answer, so that Olin Kirkland can check his code.
I want to split a string with a delimiter white space. but it should handle quoted strings intelligently. E.g. for a string like
"John Smith" Ted Barry
It should return three strings John Smith, Ted and Barry.
After messing around with it, you can use Regex for this. Run the equivalent of "match all" on:
((?<=("))[\w ]*(?=("(\s|$))))|((?<!")\w+(?!"))
A Java Example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Test
{
public static void main(String[] args)
{
String someString = "\"Multiple quote test\" not in quotes \"inside quote\" \"A work in progress\"";
Pattern p = Pattern.compile("((?<=(\"))[\\w ]*(?=(\"(\\s|$))))|((?<!\")\\w+(?!\"))");
Matcher m = p.matcher(someString);
while(m.find()) {
System.out.println("'" + m.group() + "'");
}
}
}
Output:
'Multiple quote test'
'not'
'in'
'quotes'
'inside quote'
'A work in progress'
The regular expression breakdown with the example used above can be viewed here:
http://regex101.com/r/wM6yT9
With all that said, regular expressions should not be the go to solution for everything - I was just having fun. This example has a lot of edge cases such as the handling unicode characters, symbols, etc. You would be better off using a tried and true library for this sort of task. Take a look at the other answers before using this one.
Try this ugly bit of code.
String str = "hello my dear \"John Smith\" where is Ted Barry";
List<String> list = Arrays.asList(str.split("\\s"));
List<String> resultList = new ArrayList<String>();
StringBuilder builder = new StringBuilder();
for(String s : list){
if(s.startsWith("\"")) {
builder.append(s.substring(1)).append(" ");
} else {
resultList.add((s.endsWith("\"")
? builder.append(s.substring(0, s.length() - 1))
: builder.append(s)).toString());
builder.delete(0, builder.length());
}
}
System.out.println(resultList);
well, i made a small snipet that does what you want and some more things. since you did not specify more conditions i did not go through the trouble. i know this is a dirty way and you can probably get better results with something that is already made. but for the fun of programming here is the example:
String example = "hello\"John Smith\" Ted Barry lol\"Basi German\"hello";
int wordQuoteStartIndex=0;
int wordQuoteEndIndex=0;
int wordSpaceStartIndex = 0;
int wordSpaceEndIndex = 0;
boolean foundQuote = false;
for(int index=0;index<example.length();index++) {
if(example.charAt(index)=='\"') {
if(foundQuote==true) {
wordQuoteEndIndex=index+1;
//Print the quoted word
System.out.println(example.substring(wordQuoteStartIndex, wordQuoteEndIndex));//here you can remove quotes by changing to (wordQuoteStartIndex+1, wordQuoteEndIndex-1)
foundQuote=false;
if(index+1<example.length()) {
wordSpaceStartIndex = index+1;
}
}else {
wordSpaceEndIndex=index;
if(wordSpaceStartIndex!=wordSpaceEndIndex) {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
}
wordQuoteStartIndex=index;
foundQuote = true;
}
}
if(foundQuote==false) {
if(example.charAt(index)==' ') {
wordSpaceEndIndex = index;
if(wordSpaceStartIndex!=wordSpaceEndIndex) {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
}
wordSpaceStartIndex = index+1;
}
if(index==example.length()-1) {
if(example.charAt(index)!='\"') {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, example.length()));
}
}
}
}
this also checks for words that were not separated with a space after or before the quotes, such as the words "hello" before "John Smith" and after "Basi German".
when the string is modified to "John Smith" Ted Barry the output is three strings,
1) "John Smith"
2) Ted
3) Barry
The string in the example is hello"John Smith" Ted Barry lol"Basi German"hello and prints
1)hello
2)"John Smith"
3)Ted
4)Barry
5)lol
6)"Basi German"
7)hello
Hope it helps
This is my own version, clean up from http://pastebin.com/aZngu65y (posted in the comment).
It can take care of Unicode. It will clean up all excessive spaces (even in quote) - this can be good or bad depending on the need. No support for escaped quote.
private static String[] parse(String param) {
String[] output;
param = param.replaceAll("\"", " \" ").trim();
String[] fragments = param.split("\\s+");
int curr = 0;
boolean matched = fragments[curr].matches("[^\"]*");
if (matched) curr++;
for (int i = 1; i < fragments.length; i++) {
if (!matched)
fragments[curr] = fragments[curr] + " " + fragments[i];
if (!fragments[curr].matches("(\"[^\"]*\"|[^\"]*)"))
matched = false;
else {
matched = true;
if (fragments[curr].matches("\"[^\"]*\""))
fragments[curr] = fragments[curr].substring(1, fragments[curr].length() - 1).trim();
if (fragments[curr].length() != 0)
curr++;
if (i + 1 < fragments.length)
fragments[curr] = fragments[i + 1];
}
}
if (matched) {
return Arrays.copyOf(fragments, curr);
}
return null; // Parameter failure (double-quotes do not match up properly).
}
Sample input for comparison:
"sdfskjf" sdfjkhsd "hfrif ehref" "fksdfj sdkfj fkdsjf" sdf sfssd
asjdhj sdf ffhj "fdsf fsdjh"
日本語 中文 "Tiếng Việt" "English"
dsfsd
sdf " s dfs fsd f " sd f fs df fdssf "日本語 中文"
"" "" ""
" sdfsfds " "f fsdf
(2nd line is empty, 3rd line is spaces, last line is malformed).
Please judge with your own expected output, since it may varies, but the baseline is that, the 1st case should return [sdfskjf, sdfjkhsd, hfrif ehref, fksdfj sdkfj fkdsjf, sdf, sfssd].
commons-lang has a StrTokenizer class to do this for you, and there is also java-csv library.
Example with StrTokenizer:
String params = "\"John Smith\" Ted Barry"
// Initialize tokenizer with input string, delimiter character, quote character
StrTokenizer tokenizer = new StrTokenizer(params, ' ', '"');
for (String token : tokenizer.getTokenArray()) {
System.out.println(token);
}
Output:
John Smith
Ted
Barry
i am trying to find a certain tag in a html-page with java. all i know is what kind of tag (div, span ...) and the id ... i dunno how it looks, how many whitespaces are where or what else is in the tag ... so i thought about using pattern matching and i have the following code:
// <tag[any character may be there or not]id="myid"[any character may be there or not]>
String str1 = "<" + Tag + "[.*]" + "id=\"" + search + "\"[.*]>";
// <tag[any character may be there or not]id="myid"[any character may be there or not]/>
String str2 = "<" + Tag + "[.*]" + "id=\"" + search + "\"[.*]/>";
Pattern p1 = Pattern.compile( str1 );
Pattern p2 = Pattern.compile( str2 );
Matcher m1 = p1.matcher( content );
Matcher m2 = p2.matcher( content );
int start = -1;
int stop = -1;
String Anfangsmarkierung = null;
int whichMatch = -1;
while( m1.find() == true || m2.find() == true ){
if( m1.find() ){
//System.out.println( " ... " + m1.group() );
start = m1.start();
//ende = m1.end();
stop = content.indexOf( "<", start );
whichMatch = 1;
}
else{
//System.out.println( " ... " + m2.group() );
start = m2.start();
stop = m2.end();
whichMatch = 2;
}
}
but i get an exception with m1(m2).start(), when i enter the actual tag without the [.*] and i dun get anything when i enter the regular expression :( ... i really havent found an explanation for this ... i havent worked with pattern or match at all yet, so i am a little lost and havent found anything so far. would be awesome if anyone could explain me what i am doing wrong or how i can do it better ...
thnx in advance :)
... dg
I know that I am broadening your question, but I think that using a dedicated library for parsing HTML documents (such as: http://htmlparser.sourceforge.net/) will be much more easier and accurate than regexps.
Here is an example for what you're trying to do adapted from one of my notes:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String tag = "thetag";
String id = "foo";
String content = "<tag1>\n"+
"<thetag name=\"Tag Name\" id=\"foo\">Some text</thetag>\n" +
"<thetag name=\"AnotherTag\" id=\"foo\">Some more text</thetag>\n" +
"</tag1>";
String patternString = "<" + tag + ".*?name=\"(.*?)\".*?id=\"" + id + "\".*?>";
System.out.println("Content:\n" + content);
System.out.println("Pattern: " + patternString);
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(content);
boolean found = false;
while (matcher.find()) {
System.out.format("I found the text \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
System.out.println("Name: " + matcher.group(1));
found = true;
}
if (!found) {
System.out.println("No match found.");
}
}
}
You'll notice that the pattern string becomes something like <thetag.*?name="(.*?)".*?id="foo".*?> which will search for tags named thetag where the id attribute is set to "foo".
Note the following:
It uses .*? to weakly match zero or more of anything (if you don't understand, try removing the ? to see what I mean).
It uses a submatch expression between parenthesis (the name="(.*?)" part) to extract the contents of the name attribute (as an example).
I think each call to find is advancing through your match. Calling m1.find() inside your condition is moving your matcher to a place where there is no longer a valid match, which causes m1.start() to throw (I'm guessing) an IllegalStateException Ensuring you call find once per iteration and referencing that result from some flag avoids this problem.
boolean m1Matched = m1.find()
boolean m2Matched = m2.find()
while( m1Matched || m2Matched ) {
if( m1Matched ){
...
}
m1Matched = m1.find();
m2Matched = m2.find();
}