Data manipulation on all columns in Dataset with Java API

Data manipulation on all columns in Dataset with Java API - java

After reading csv file in Dataset, want to remove spaces from String type data using Java API.
Apache Spark 2.0.0
Dataset<Row> dataset = sparkSession.read().format("csv").option("header", "true").load("/pathToCsv/data.csv");
Dataset<String> dataset2 = dataset.map(new MapFunction<Row,String>() {
#Override
public String call(Row value) throws Exception {
return value.getString(0).replace(" ", "");
// But this will remove space from only first column
}
}, Encoders.STRING());
By using MapFunction, not able to remove spaces from all columns.
But in Scala, by using following way in spark-shell able to perform desired operation.
val ds = spark.read.format("csv").option("header", "true").load("/pathToCsv/data.csv")
val opds = ds.select(ds.columns.map(c => regexp_replace(col(c), " ", "").alias(c)): _*)
Dataset opds have data without spaces. Want to achieve same in Java. But in Java API columns method returns String[] and not able to perform functional programming on Dataset.
Input Data
+----------------+----------+-----+---+---+
| x| y| z| a| b|
+----------------+----------+-----+---+---+
| Hello World|John Smith|There| 1|2.3|
|Welcome to world| Bob Alice|Where| 5|3.6|
+----------------+----------+-----+---+---+
Expected Output Data
+--------------+---------+-----+---+---+
| x| y| z| a| b|
+--------------+---------+-----+---+---+
| HelloWorld|JohnSmith|There| 1|2.3|
|Welcometoworld| BobAlice|Where| 5|3.6|
+--------------+---------+-----+---+---+

Try:
for (String col: dataset.columns) {
dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", ""));
}

You can try following regex to remove white spaces between strings.
value.getString(0).replaceAll("\\s+", "");
About \s+ : match any white space character between one and unlimited times, as many times as possible.
Instead of replace use replaceAll function.
More about replace and replaceAll functions Difference between String replace() and replaceAll()

Related

Compare and Highlight the differences of two dataframes using spark and java

I am using spark and java to to try and compare two data frames.
Once I convert my csv files into data frames, I want to highlight exactly what changed between two dataframes.
They all have the same columns in common.
As you can see the only thing not correct with below data frames is emp_id 4 in the second df2.
Dataset<Row> df1 = spark.read().csv("/Users/dataframeOne.csv");
Dataset<Row> df1 = spark.read().csv("/Users/dataframeTwo.csv");
df1.unionAll(df2).except(df1.intersect(df2)).show(true);
Df1
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romin|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
Df2
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
Difference
+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+--------+--------+----------+-------+--------+
How can I highlight in yellow 'Romino', the incorrect field using JAVA and SPARK?

Highlighting something in Spark depends on your GUI, so as first step I would suggest to detect the different values and add the information about the differences as additional column to the dataframe.
Step 1: Add a suffix to all columns of the two dataframes and join them over the primary key (emp_id):
import static org.apache.spark.sql.functions.*;
private static Dataset<Row> prefix(Dataset<Row> df, String prefix) {
for(String col: df.columns()) df = df.withColumnRenamed(col, col + prefix);
return df;
}
[...]
Dataset<Row> df1 = spark.read().option("header", "true").csv(...);
Dataset<Row> df2 = spark.read().option("header", "true").csv(...);
String[] columns = df1.columns();
Dataset<Row> joined = prefix(df1, "_1").join(prefix(df2, "_2"),
col("emp_id_1").eqNullSafe(col("emp_id_2")), "full_outer");
Step 2: create a list of column objects that check if the value from one table is different from the other table. This list will later be used as input parameter for map.
List<Column> diffs = new ArrayList<>();
for( String column: columns) {
diffs.add(lit(column));
diffs.add(when(col(column + "_1").eqNullSafe(col(column + "_2")), null)
.otherwise(concat_ws("/", col(column + "_1"), col(column + "_2"))));
}
Step 3: create a new column containing a map with all differences:
joined.withColumn("differences", map(diffs.toArray(new Column[]{})))
.withColumn("differences", map_filter(col("differences"), (k, v) -> not(v.isNull())))
.select("emp_id_1", "differences")
.filter(size(col("differences")).gt(0))
.show(false);
Output:
+--------+--------------------------+
|emp_id_1|differences |
+--------+--------------------------+
|4 |{emp_name -> romin/romino}|
+--------+--------------------------+

\u000b and other Control Unicode Characters not compatible with docx4j? [duplicate]

The list of valid XML characters is well known, as defined by the spec it's:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [\p{Cc}\p{Cs}\p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.

I know this isn't exactly an answer to your question, but it's helpful to have it here:
Regular Expression to match valid XML Characters:
[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]
So to remove invalid chars from XML, you'd do something like
// filters control characters but allows only properly-formed surrogate sequences
private static Regex _invalidXMLChars = new Regex(
#"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
RegexOptions.Compiled);
/// <summary>
/// removes any unusual unicode characters that can't be encoded into XML
/// </summary>
public static string RemoveInvalidXMLChars(string text)
{
if (string.IsNullOrEmpty(text)) return "";
return _invalidXMLChars.Replace(text, "");
}
I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.

For systems that internally stores the codepoints in UTF-16, it is common to use surrogate pairs (xD800-xDFFF) for codepoints above 0xFFFF and in those systems you must verify if you really can use for example \u12345 or must specify that as a surrogate pair. (I just found out that in C# you can use \u1234 (16 bit) and \U00001234 (32-bit))
According to Microsoft "the W3C recommendation does not allow surrogate characters inside element or attribute names." While searching W3s website I found C079 and C078 that might be of interest.

I tried this in java and it works:
private String filterContent(String content) {
return content.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "");
}
Thank you Jeff.

The above solutions didn't work for me if the hex code was present in the xml. e.g.
<element></element>
The following code would break:
string xmlFormat = "<element>{0}</element>";
string invalid = " ";
string xml = string.Format(xmlFormat, invalid);
xml = Regex.Replace(xml, #"[\x01-\x08\x0B\x0C\x0E\x0F\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
XDocument.Parse(xml);
It returns:
XmlException: '', hexadecimal value 0x08, is an invalid character.
Line 1, position 14.
The following is the improved regex and fixed the problem mentioned above:
&#x([0-8BCEFbcef]|1[0-9A-Fa-f]);|[\x01-\x08\x0B\x0C\x0E\x0F\u0000-\u0008\u000B\u000C\u000E-\u001F]
Here is a unit test for the first 300 unicode characters and verifies that only invalid characters are removed:
[Fact]
public void validate_that_RemoveInvalidData_only_remove_all_invalid_data()
{
string xmlFormat = "<element>{0}</element>";
string[] allAscii = (Enumerable.Range('\x1', 300).Select(x => ((char)x).ToString()).ToArray());
string[] allAsciiInHexCode = (Enumerable.Range('\x1', 300).Select(x => "&#x" + (x).ToString("X") + ";").ToArray());
string[] allAsciiInHexCodeLoweCase = (Enumerable.Range('\x1', 300).Select(x => "&#x" + (x).ToString("x") + ";").ToArray());
bool hasParserError = false;
IXmlSanitizer sanitizer = new XmlSanitizer();
foreach (var test in allAscii.Concat(allAsciiInHexCode).Concat(allAsciiInHexCodeLoweCase))
{
bool shouldBeRemoved = false;
string xml = string.Format(xmlFormat, test);
try
{
XDocument.Parse(xml);
shouldBeRemoved = false;
}
catch (Exception e)
{
if (test != "<" && test != "&") //these char are taken care of automatically by my convertor so don't need to test. You might need to add these.
{
shouldBeRemoved = true;
}
}
int xmlCurrentLength = xml.Length;
int xmlLengthAfterSanitize = Regex.Replace(xml, #"&#x([0-8BCEF]|1[0-9A-F]);|[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "").Length;
if ((shouldBeRemoved && xmlCurrentLength == xmlLengthAfterSanitize) //it wasn't properly Removed
||(!shouldBeRemoved && xmlCurrentLength != xmlLengthAfterSanitize)) //it was removed but shouldn't have been
{
hasParserError = true;
Console.WriteLine(test + xml);
}
}
Assert.Equal(false, hasParserError);
}

Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)
public static string RemoveInvalidXmlChars(string content)
{
return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}
or you may check that all characters are XML-valid.
public static bool CheckValidXmlChars(string content)
{
return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
}
.Net Fiddle - https://dotnetfiddle.net/v1TNus
For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

In PHP the regex would look like the following way:
protected function isStringValid($string)
{
$regex = '/[^\x{9}\x{a}\x{d}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u';
return (preg_match($regex, $string, $matches) === 0);
}
This would handle all 3 ranges from the xml specification:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Replace content of each files with it's header values in spark

I have a directory with several text files and I access that all files in spark as follows,
JavaRDD<String> filesRDD = sc.textFile(directoryName);
In each file, the first line is a header which contains some mapping values. eg:-
"1,apple|4,banana|3,lemon"
that means if, in the content, there is a "3", it maps to "lemon".
Example of the content as follows,
I like 1
John eat 3 and 1
and so on.
Now What I need to do is, I need to filter lines from the content first and assign original values from the mapping. For example, the first filter by the string "like" and I get "I like 1" then, I replace with mapping, then "I like apple"
Please note that this mapping header is different from each file. How can I do this? Since I'm new to spark, I don't have much idea on how to achieve this.

Do you want something like this?
var fruitPair = sc.parallelize(List("1,apple","4,banana","3,lemon")).map{ str =>
var temp = str.split(",")
(temp(0), temp(1))
}
fruitPair.toDF.show()
+---+------+
| _1| _2|
+---+------+
| 1| apple|
| 4|banana|
| 3| lemon|
+---+------+
var contents = List("I like 1", "John eat 3 and 1")
var results = contents.map { content =>
var tmpContent = content
fruitPair.collect.foreach { item =>
var index = tmpContent.indexOf(item._1)
if (index >= 0) {
tmpContent = tmpContent.replace(item._1, item._2)
}
}
tmpContent
}
results.foreach{ it => println(it) }
I like apple
John eat lemon and apple
results: List[String] = List(I like apple, John eat lemon and apple)

String#replaceAll() to replace anything but a = group

I have a parameter of key-value like this:
sign="aaaabbbb="
And I want to get the parameter name sign and the value "aaaabbb="(with quote signs)
I thought I could split the string with = to get the first elem of the array which is the parameter name and do a String.replaceAll() to remove the sign= to get the value. Anyway here is my sample code:
public class TestStringReplace {
public static void main(String[] argvs){
String s = "sign=\"aaaabbbb=\"";
String[] ss = s.split("=");
String value = s.replaceAll("\\[^=]+=","");
//EDIT: s.replaceAll("[^=]+=","") will not do the job either.
System.out.println(ss[0]);
System.out.println(value);
}
}
but the output shows this:
sign
sign="aaaabbbb="
Why \\[^=]+= not matching sign= and replace it with empty string here?Quite a newbie of Java regex, need some help.
Thanks in advance.

In Java you can use the following:
String str = "sign=\"aaaabbbb=\"";
String var1 = str.substring(0, str.indexOf('='));
String var2 = str.substring(str.indexOf('=')+1);
System.out.println("var1="+var1+", var2="+var2);
The above would have the following output:
var1=sign, var2="aaaabbbb="

Try the following regex ^\\w+= with replaceAll() instead of your regex:
public class TestStringReplace {
public static void main(String[] argvs){
String s = "sign=\"aaaabbbb=\"";
String[] ss = s.split("=");
String value = s.replaceAll("^\\w+=","");
System.out.println(ss[0]);
System.out.println(value);
}
}
This will remove the sign=.
You can see the DEMO here.
Note that with your "\\[^=]+=" regex you were trying to match the character [ literally in the beginning of your regex.
And it explains why you got sign="aaaabbbb=" as a result with replaceAll() which didn't replace anything because there's no match.

You're probably better off with an actual Pattern and back-references here.
For instance:
String[] test = {
"sign=\"aaaabbbb=\"",
// assuming a HTTP GET-styled parameter list
"blah?sign=\"aaaabbbb=\"",
"foo?sign=\"aaaabbbb=\"&blah=\"hodor\""
};
// | group 1: literal "sign"
// | | literal key-value delimiter and double quote
// | | | group 2: any character reluctantly quantified
// | | | | literal ending double quote
// | | | | | look-ahead for either "&" or end
// | | | | |
Pattern p = Pattern.compile("(sign)=\"(.+?)\"(?=$|&)");
Matcher m = null;
for (String s: test) {
m = p.matcher(s);
while (m.find()) {
System.out.printf(
"Found key: \"%s\" and value: \"%s\"%n", m.group(1), m.group(2)
);
}
}
Output
Found key: "sign" and value: "aaaabbbb="
Found key: "sign" and value: "aaaabbbb="
Found key: "sign" and value: "aaaabbbb="
Notes
I'm assuming a HTTP GET styled parameter list, but maybe you don't need to actually check for a next parameter key-value pair delimiter (i.e. &) - in which case you can remove the & part
I'm also assuming you want the "s out of your value back-reference, which kind of makes the following & check useless
Your current pattern for the replaceAll invocation will match as follows:
// | literal "[" (double-escaped)
// ||literal "^" or "=" (in character class)
// || | ... greedily quantified (1+ occurrences)
// || || literal "="
"\\[^=]+="
Finally, if you really, really want to use String#replaceAll for this, here's a slightly different pattern than the one above:
for (String s: test) {
System.out.println(
s.replaceAll(
".*(sign)=\"(.+?)\"(?=$|&).*",
"Found key: \"$1\" and value: \"$2\""
)
);
}
It still uses back-references and will produce the same result, albeit in a uglier way: you can't reuse the $1 and $2 group values, since you're creating a new String replacing the original one.
Last possible solution, using String#'split. This is the ugliest as it won't work well with a list of parameters:
for (String s: test) {
System.out.println(
// | negative look-behind for start of input
// | | literal "="
// | | | literal "
// | | |
Arrays.toString(s.split("(?<!^)=\""))
);
}
Output
[sign, aaaabbbb]
[blah?sign, aaaabbbb] --> yuck
[foo?sign, aaaabbbb, &blah, hodor"] --> yuck again

The double slash is a mistake, because it is escaping the [ to a literal [, which will never match.
Instead, do this:
String name = s.replaceAll("=.*", "");
String value = s.replaceAll(".*?=", "");

java indexOf returns -1 when it's supposed to return a positive number

I'm new to Network programming and I never used Java for network programming before.
I'm writing a server using Java and I have some problem processing message from client. I used
DataInputStream inputFromClient = new DataInputStream( socket.getInputStream() );
while ( true ) {
// Receive radius from the client
byte[] r=new byte[256000];
inputFromClient.read(r);
String Ffss =new String(r);
System.out.println( "Received from client: " + Ffss );
System.out.print("Found Index :" );
System.out.println(Ffss.indexOf( '\a' ));
System.out.print("Found Index :" );
System.out.println(Ffss.indexOf( ' '));
String Str = new String("add 12341\n13243423");
String SubStr1 = new String("\n");
System.out.print("Found Index :" );
System.out.println( Str.indexOf( SubStr1 ));
}
If I do this, and have a sample input asg 23\aag, it will return:
Found Index :-1
Found Index :3
Found Index :9
It's clear that if the the String object is created from scratch, indexOf can locate "\".
How come the code would have problem locating \a if the String is obtained from processing DataInputStream?

try String abc=new String("\\a"); - you need \\ to get a backslash in a string otherwise the \ defines the start of an "escape sequence".

It looks like the a is being escaped.
Have a look at this article to understand how the back slash affects a string.
Escape Sequences
A character preceded by a backslash (\) is an escape
sequence and has special meaning to the compiler. The following table
shows the Java escape sequences:
| Escape Sequence | Description|
|:----------------|------------:|
| \t | Insert a tab in the text at this point.|
| \b | Insert a backspace in the text at this point.|
| \n | Insert a newline in the text at this point.|
| \r | Insert a carriage return in the text at this point.|
| \f | Insert a formfeed in the text at this point.|
| \' | Insert a single quote character in the text at this point.|
| \" | Insert a double quote character in the text at this point.|
| \\ | Insert a backslash character in the text at this point.|

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Data manipulation on all columns in Dataset with Java API - java

Try: for (String col: dataset.columns) { dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", "")); }

Related

Compare and Highlight the differences of two dataframes using spark and java

\u000b and other Control Unicode Characters not compatible with docx4j? [duplicate]

Replace content of each files with it's header values in spark

String#replaceAll() to replace anything but a = group

java indexOf returns -1 when it's supposed to return a positive number

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Data manipulation on all columns in Dataset with Java API - java

Try: for (String col: dataset.columns) { dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", "")); }

Related

Compare and Highlight the differences of two dataframes using spark and java

\u000b and other Control Unicode Characters not compatible with docx4j? [duplicate]

Replace content of each files with it's header values in spark

String#replaceAll() to replace *anything but a =* group

java indexOf returns -1 when it's supposed to return a positive number

Categories

Resources

String#replaceAll() to replace anything but a = group