I have written a program which is going to read from csv file using a delimiter and at the same time I have a use case where I needed to create a string from of the delimited data so for that I have created a regex to split the column data using the delimiter.
Now the challenge is when delimiter is present in in double quotes ideally I should not be splitting the data, spark is escaping that delimiter but my regex somehow are not.
private static void readFromSourceFile(SparkSession sparkSession) {
String delType = ",";
final String regex = "["+delType+ "]{"+delType.length()+"}(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
Dataset<Row> csv = sparkSession
.read().option("delimiter",delType)
.option("header",false)
.option("inferSchema",true)
.csv("src/main/resources/quotes2.csv");
char separator= '\u0001';
csv.show(false);
List<Row> df = csv.collectAsList();
String split[] = df.get(0).toString().split(regex);
System.out.println(split.length);
Arrays.stream(split).forEach(System.out::println);
}
The O/P for the program is -
The Red marked area is the string with double quotes, it shouldn't have split that column.
Input file csv file -
New,667.88,In Stock.,Now,true,true,B09D7MQ69X,B09D7MQ69X,NUC10i5FNHN 16GB+512GB,"Intel NUC10 NUC10i5FNHN Home & Business Desktop Mini PC,10th Generation Intel® Core™ i5-10210U, Upto 4.2 GHz, 4 core, 8 Thread, 25W Intel® UHD Graphics, 16GB RAM, 512GB PCIe SSD, Win 10 Pro 8GB RAM + 256GB SSD",false,"【Intel NUC10i5FNHN with RAM & SSD】 Intel NUC10 NUC10i5FNHN Mini PC/HTPC With All New Parts Assembled. Our store is HOT selling Intel NUC11 i5, i7, NUC10 i5, i7, NUC8, Barebone and Mini PC with various sizes of RAM or SSD. If you need to know more, please click on our Store Name:""GEEK + Computer Mall"" --------- ""Products"", OR click ""Visit the GEEK+ Store"" under the title.:BRK:【Quad Core Processor & Graphic 】 10th Generation Intel Core i5-10210U,1.6 GHz – 4.2 GHz Turbo, 4 core, 8 thread, 6MB Cache,25W Intel UHD Graphics, up to 1.0 GHz, 80 EU units.:BRK:【Storage Expansion Options】 Kingston 16GB DDR4 RAM"
Can someone suggest or provide a hint to improve the regex.
I find that splitting by complex delimiters leads to convoluted regular expressions, as your code showcases. In fact, the sub-expression "["+delType+ "]{"+delType.length()+"}" doesn’t make a lot of sense, and I strongly suspect this is a bug in your code (for instance if delType is <> your code would also split on occurrences of ><).
As an alternative, consider using a regular expression that exhaustively describes the lexical syntax of your input, and then match all tokens. This works particularly well when using named groups.
In your case (CSV with quoted fields, using doubled-up quotes to escape them), the lexical syntax of the tokens can be described by the following token types:
A delimiter (which is configurable, so we might need to handle strings of length > 1)
A quoted field (arbitrary tokens surrounded by "…", where the characters in … can be anything except ", but they can also include "")
An unquoted field (arbitrary tokens up to the next delimiter).
As a regular expression, this can be written as follows in Java:
Pattern.compile(
"(?<delim>" + d + ")|" +
"\"(?<quotedField>(?:[^\"]|\"\")*)\"|" +
"(?<field>.*?(?:(?=" + d + ")|$))"
);
Where d is defined as Pattern.quote(delim) (the quoting is important, in case the delimiter contains a regex special char!).
The only slight complication here is the last token type, because we are matching everything up to the next delimiter (.*? matches non-greedily), or until the end of the string.
Afterwards, we iterate over all matches and collect those where either the group field or quotedField is set. Putting it all together inside a method:
static String[] parseCsvRow(String row, String delim) {
final String d = Pattern.quote(delim);
final Pattern pattern = Pattern.compile(
"(?<delim>" + d + ")|" +
"\"(?<quotedField>(?:[^\"]|\"\")*)\"|" +
"(?<field>.*?(?:(?=" + d + ")|$))"
);
final Matcher matcher = pattern.matcher(row);
final List<String> results = new ArrayList<>();
while (matcher.find()) {
if (matcher.group("field") != null) {
results.add(matcher.group("field"));
} else if (matcher.group("quotedField") != null) {
results.add(matcher.group("quotedField").replaceAll("\"\"", "\""));
}
}
return results.toArray(new String[0]);
}
In real code I would wrap this in a CsvParser class instead of a single method, where the constructor creates the pattern based on the delimiter so that the pattern doesn’t have to be recompiled for each row.
Related
I am working on a project for a beginners java course, I need to read a file and turn each line into an object, which i will eventually print out as a job listing. (please no ArrayList suggestions)
so far i have gotten that file saved into a String[], which contains strings like this:
*"iOS/Android Mobile App Developer - Java, Swift","Freshop, Inc.","$88,000 - $103,000 a year"
"Security Engineer - Offensive Security","Indeed","$104,000 - $130,000 a year"
"Front End Developer - CSS/HTML/Vue","HiddenLevers","$80,000 - $130,000 a year"*
what im having trouble with is trying to split each string into its three parts so it can be inputted into my JobService createJob method which is as shown:
public Job createJob(String[] Arrs) {
Job job = new Job();
job.setTitle(Arrs[0]);
job.setCompany(Arrs[1]);
job.setCompensation(Arrs[2]);
return job;
}
I am terrible at regex but know that trying to .split(",") will break up the salary portion as well. if anyone could help figure out a reliable way to split these strings to fit into my method i would be grateful!!!
Also im super new, please use language the commoners like me will understand...
You need a slightly better split criteria, something like \"," for example...
String text = "\"iOS/Android Mobile App Developer - Java, Swift\",\"Freshop, Inc.\",\"$88,000 - $103,000 a year\"";
String[] parts = text.split("\",");
for (String part : parts) {
System.out.println(part);
}
Which prints...
"iOS/Android Mobile App Developer - Java, Swift
"Freshop, Inc.
"$88,000 - $103,000 a year"
Now, if you want to remove the quotes, you can do something like....
String text = "\"iOS/Android Mobile App Developer - Java, Swift\",\"Freshop, Inc.\",\"$88,000 - $103,000 a year\"";
String[] parts = text.split("\",");
for (String part : parts) {
System.out.println(part.replace("\"", ""));
}
Regular Expression
No, I'm not that good at it either. I tried...
String[] parts = text.split("^\"|\",\"|\"$");
And while this works, it produces 4 elements, not 3 (first match is blank).
You could remove the first and trailing quotes and then just use "," instead...
text = text.substring(1, text.length() - 2);
String[] parts = text.split("\",\"");
trim leading and trailing quotes
split on ","
As code:
String[] columns = line.replaceAll("^\"|\"$", "").split("\",\"");
^"|"$ means "a quote at start or a quote at end"
The regex for the split is just a literal ","
Input -
String ipXmlString = "<root>"
+ "<accntNoGrp><accntNo>1234567</accntNo></accntNoGrp>"
+ "<accntNoGrp><accntNo>6663823</accntNo></accntNoGrp>"
+ "</root>";
Tried follwing things using to mask values within using
String op = ipXmlString .replaceAll("<accntNo>(.+?)</accntNo>", "######");
But above code masks all the values
<root><accntNoGrp>######</accntNoGrp><accntNoGrp>######</accntNoGrp></root>
Expected Output:
<root><accntNoGrp><accntNo>#####67</accntNo></accntNoGrp><accntNoGrp><accntNo>#####23</accntNo></accntNoGrp></root>
How to achieve this using java regex ?Could someone help
Your replacement is wrong, you need to include the <accntNo> tag in the actual replacement. Also, it appears that you want to show the last two characters/numbers of the account number. In this case, we can capture this information during the match and use it in the replacement.
Code:
String op = ipXmlString.replaceAll("<accntNo>(?:.+?)(.{2})</accntNo>", "<accntNo>######$1</accntNo>");
Explanation:
<accntNo> match an opening tag
(?:.+?) match, but do not capture, anything up until the first
(.{2}) two characters before closing tag (and capture this)
</accntNo> match a closing tag
Note here that by using ?: inside a parenthesis in the pattern, we tell the regex engine to not capture it. There is no point in capturing anything before the last two characters of the account number because we don't want to us it.
The $1 quantity in the replacement refers to the first capture group. In this case, it is the last two characters of the account number. Hence, we build the replacement string you want this way.
Demo here:
Rextester
Try this code:
public static void main(String[] args) {
String ipXmlString = "<root>"
+ "<accntNoGrp><accntNo>1234567</accntNo></accntNoGrp>"
+ "<accntNoGrp><accntNo>6663823</accntNo></accntNoGrp>"
+ "</root>";
String replaceAll = ipXmlString.replaceAll("\\d+", "######");
System.out.println(replaceAll);
}
Prints:
<root><accntNoGrp><accntNo>######</accntNo></accntNoGrp><accntNoGrp><accntNo>######</accntNo></accntNoGrp></root>
Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.
Since Java 1.5, yes:
Pattern.quote("$5");
Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));
It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);
I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.
First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.
To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}
Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:
I am using Pattern and Matcher classes from Java ,
I am reading a Template text and I want to replace :
src="scripts/test.js" with src="scripts/test.js?Id=${Id}"
src="Servlet?Template=scripts/test.js" with src="Servlet?Id=${Id}&Template=scripts/test.js"
I'm using the below code to execute case 2. :
//strTemplateText is the Template's text
Pattern p2 = Pattern.compile("(?i)(src\\s*=\\s*[\"'])(.*?\\?)");
Matcher m2 = p2.matcher(strTemplateText);
strTemplateText = m2.replaceAll("$1$2Id=" + CurrentESSession.getAttributeString("Id", "") + "&");
The above code works correctly for case 2. but how can I create a regex to combine both cases 1. and 2. ?
Thank you
You don't need a regular expression. If you change case 2 to
replace Servlet?Template=scripts/test.js with Servlet?Template=scripts/test.js&Id=${Id}
all you need to do is to check whether the source string does contain a ? if not add ?Id=${Id} else add &Id=${Id}.
After all
if (strTemplateText.contains("?") {
strTemplateText += "&Id=${Id}";
}
else {
strTemplateText += "?Id=${Id}";
}
does the job.
Or even shorter
strTemplate += strTemplateText.contains("?") ? "&Id=${Id}" : "?Id=${Id}";
Your actual question doesn't match up so well with your example code. The example code seems to handle a more general case, and it substitutes an actual session Id value instead of a reference to one. The code below takes the example code to be more indicative of what you really want, but the same approach could be adapted to what you asked in the question text (using a simpler regex, even).
With that said, I don't see any way to do this with a single replaceAll() because the replacement text for the two cases is too different. You could nevertheless do it with one regex, in one pass, if you used a different approach:
Pattern p2 = Pattern.compile("(src\\s*=\\s*)(['\"])([^?]*?)(\\?.*?)?\\2",
Pattern.CASE_INSENSITIVE);
Matcher m2 = p2.matcher(strTemplateText);
StringBuffer revisedText = new StringBuffer();
while (m2.find()) {
// Append the whole match except the closing quote
m2.appendReplacement(revisedText, "$1$2$3$4");
// group 4 is the optional query string; null if none was matched
revisedText.append((m2.group(4) == null) ? '?' : '&');
revisedText.append("Id=");
revisedText.append(CurrentESSession.getAttributeString("Id", ""));
// append a copy of the opening quote
revisedText.append(m2.group(2));
}
m2.appendTail(revisedText);
strTemplateText = revisedText.toString();
That relies on BetaRide's observation that query parameter order is not significant, although the same general approach could accommodate a requirement to make Id the first query parameter, as in the question. It also matches the end of the src attribute in the pattern to the correct closing delimiter, which your pattern does not address (though it needs to do to avoid matching text that spans more than one src attribute).
Do note that nothing in the above prevents a duplicate query parameter 'Id' being added; this is consistent with the regex presented in the question. If you want to avoid that with the above approach then in the loop you need to parse the query string (when there is one) to determine whether an 'Id' parameter is already present.
You can do the following:
//strTemplateText is the Template's text
String strTemplateText = "src=\"scripts/test.js\"";
strTemplateText = "src=\"Servlet?Template=scripts/test.js\"";
java.util.regex.Pattern p2 = java.util.regex.Pattern.compile("(src\\s*=\\s*[\"'])(.*?)((?:[\\w\\s\\d.\\-\\#]+\\/?)+)(?:[?]?)(.*?\\=.*)*(['\"])");
java.util.regex.Matcher m2 = p2.matcher(strTemplateText);
System.out.println(m2.matches());
strTemplateText = m2.replaceAll("$1$2$3?Id=" + CurrentESSession.getAttributeString("Id", "") + (m2.group(4)==null? "":"&") + "$4$5");
System.out.println(strTemplateText);
It works on both cases.
If you are using java > 1.6; then, you could use custom-named group-capturing features for making the regex exp. more human-readable and easier to debug.
I would like a regular expression that will extract email addresses from a String (using Java regular expressions).
That really works.
Here's the regular expression that really works.
I've spent an hour surfing on the web and testing different approaches,
and most of them didn't work although Google top-ranked those pages.
I want to share with you a working regular expression:
[_A-Za-z0-9-]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})
Here's the original link:
http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
I had to add some dashes to allow for them. So a final result in Javanese:
final String MAIL_REGEX = "([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,})";
Install this regex tester plugin into eclipse, and you'd have whale of a time testing regex
http://brosinski.com/regex/.
Points to note:
In the plugin, use only one backslash for character escape. But when you transcribe the regex into a Java/C# string you would have to double them as you would be performing two escapes, first escaping the backslash from Java/C# string mechanism, and then second for the actual regex character escape mechanism.
Surround the sections of the regex whose text you wish to capture with round brackets/ellipses. Then, you could use the group functions in Java or C# regex to find out the values of those sections.
([_A-Za-z0-9-]+)(\.[_A-Za-z0-9-]+)#([A-Za-z0-9]+)(\.[A-Za-z0-9]+)
For example, using the above regex, the following string
abc.efg#asdf.cde
yields
start=0, end=16
Group(0) = abc.efg#asdf.cde
Group(1) = abc
Group(2) = .efg
Group(3) = asdf
Group(4) = .cde
Group 0 is always the capture of whole string matched.
If you do not enclose any section with ellipses, you would only be able to detect a match but not be able to capture the text.
It might be less confusing to create a few regex than one long catch-all regex, since you could programmatically test one by one, and then decide which regexes should be consolidated. Especially when you find a new email pattern that you had never considered before.
a little late but ok.
Here is what i use. Just paste it in the console of FireBug and run it. Look on the webpage for a 'Textarea' (Most likely on the bottom of the page) That will contain a , seperated list of all email address found in A tags.
var jquery = document.createElement('script');
jquery.setAttribute('src', 'http://code.jquery.com/jquery-1.10.1.min.js');
document.body.appendChild(jquery);
var list = document.createElement('textarea');
list.setAttribute('emaillist');
document.body.appendChild(list);
var lijst = "";
$("#emaillist").val("");
$("a").each(function(idx,el){
var mail = $(el).filter('[href*="#"]').attr("href");
if(mail){
lijst += mail.replace("mailto:", "")+",";
}
});
$("#emaillist").val(lijst);
The Java 's build-in email address pattern (Patterns.EMAIL_ADDRESS) works perfectly:
public static List<String> getEmails(#NonNull String input) {
List<String> emails = new ArrayList<>();
Matcher matcher = Patterns.EMAIL_ADDRESS.matcher(input);
while (matcher.find()) {
int matchStart = matcher.start(0);
int matchEnd = matcher.end(0);
emails.add(input.substring(matchStart, matchEnd));
}
return emails;
}