Apache POI Anomalous Whitespace (Resolved: \u00A0 non-breaking space)

Apache POI Anomalous Whitespace (Resolved: \u00A0 non-breaking space) - java

EDIT: Resolved Answer: Was a 00a0 nonbreaking space, not a c0a0 nonbreaking space.
After using Apache POI to convert from docx to plaintext, and then reading the plaintext into Java and trying to parse it I've run into the following problems.
Output:
" "
first characterequals SPACE OR TAB
false
[B#5e481248
[B#66d3c617
ARRAYTOSTRING SPACE: [32]
ARRAYTOSTRING ?????: [-62, -96]
For code:
System.out.println("\t\"" + line.substring(0,1) + "\"\n\tfirst characterequals SPACE OR TAB \n\t" + (line.substring(0,1).equals(" ")
|| line.substring(0,1).equals("\t") ));
System.out.println(line.substring(0,1).getBytes());
System.out.println(" ".getBytes());
System.out.println("ARRAYTOSTRING SPACE: " + Arrays.toString(" ".getBytes()));
System.out.println("ARRAYTOSTRING ?????: " + Arrays.toString(line.substring(0,1).getBytes()));
String.trim() does not get rid of it
String.replaceAll("\s" , "") does not get rid of it
I'm trying to parse an enormous materials document and this is turning into a major hurdle. I have no idea what's going on or how to interface with it, can anyone shed some light on what's going on here?

This translates to the bytes with hex codes c2 a0, which according to this answer is a UTF-8 encoded non-breaking space. Note that this is not really a space and \s will not match it.

this worked for me:
String valor = org.apache.commons.lang3.StringUtils.normalizeSpace(java.text.Normalizer.normalize(valor, java.text.Normalizer.Form.NFD));

Related

How to solve the IllegalDataException in jdom2 library?

I am using jdom 2.0.6 version and received this IllegalDataException:
Error in setText for tokenization:
it fails on calling the setText() method.
Element text = new Element("Text");
text.setText(doc.getText());
It seems some characters in 'text' it doesn't accept. For two examples:
Originally Posted by Yvette H( 45) Odd socks, yes, no undies yes, no coat yes, no shoes odd. 🏻
ParryOtter said: Posted
Should I specify encoding somewhere or for some other reasons?

In fact you just have to escape your text which contains illegal characters with CDATA :
Element text = new Element("Text");
text.setContent(new CDATA(doc.getText()));
The reverse operation (reading text escaped with CDATA is transparent in JDOM2, you won't have to escape it back).
For my tests I added an illegal character at the end of my text by creating one from hex value 0x2 like that :
String text = doc.getText();
int hex = 0x2;
text += (char) hex;

Replace empty string with Non-Disclosed

I am a novice in Java so please pardon my inexperience. I have a column (source) like below which has empty strings and I am trying to replace it with Non-Disclosed.
Source
Website
Drive-by
Realtor
Social Media
Billboard
Word of Mouth
Visitor
I tried:
String replacedString = Source.replace("", "Non-Disclosed");
After running the above snippet, everything gets replaced by Non-Disclosed:
Non-Disclosed
Non-Disclosed
Non-Disclosed
............
How can I tackle this issue? Any assistance would be appreciated.

I think you simply have to do : Source.replace("\n\n", "\nNon-Disclosed\n")

I am assuming that your entire column is stored in one string.
In that case you can use ^$ regex to represent empty line (with MULTILINE flag (?m) which will allow ^ and $ to represent start and end of lines).
This approach
will work for many line separators \r \n \r\n
will not consume those line separators so we don't need to add them back in replacement part.
To use regex while replacing we can use replaceAll(regex, replacement) method
DEMO:
String text = "Source\r\n" +
"\r\n" +
"\r\n" +
"\r\n" +
"Website\r\n" +
"Drive-by\r\n" +
"Realtor\r\n" +
"Social Media\r\n" +
"\r\n" +
"Billboard\r\n" +
"\r\n" +
"Word of Mouth\r\n" +
"\r\n" +
"Visitor";
text = text.replaceAll("(?m)^$", "Non-Disclosed");
System.out.println(text);
Output:
Source
Non-Disclosed
Non-Disclosed
Non-Disclosed
Website
Drive-by
Realtor
Social Media
Non-Disclosed
Billboard
Non-Disclosed
Word of Mouth
Non-Disclosed
Visitor

You can use
String replacedString = Source.trim().isEmpty() ? "Non-Disclosed" : Source;
to replace only the "blank" String.

Java remove unwanted spaces in a text file and replace with character

I have the following text file. I want to remove the lines and spaces so that the text file has a clear delimter to process. I cannot think of any way to remove the gaps between lines, is there a way?
Student+James Smith+Status: Current Student+Student+James Fits+Status: Not a current Student
Textfile
Student
James Smith
Status: Current Student
Student
James Fits
Status: Not a current Student

I know that this
a.replaceAll("\\s+","");
removes whitespaces.
You could remove end of line characters in a similar fashion
a.replaceAll("\n","");
Where 'a' is a String.

use a regex take the whole text in to a string and
string txt = "whole String";
String formatted = txt.replaceAll("[^A-Za-z0-9]", "-");
this will result in changing + sign and " " to replace with "-" sign. so now you have a specific deleimeter.

Something like find \s*\r?\n\s* replace +
Trims whitespace and adds delimiter '+'
Result:
Student+James Smith+Status: Current Student+Student+James Fits+Status: Not a current Student

Try using this one.
\n+\s*
just use it like this :
yourStrVar.replaceAll("\n+ *", "+")

How to replace invalid characters using java

Invalid XML: Error on line 190: An invalid XML character (Unicode: 0x10) was found in the CDATA section.
I get this error while parsing an XML file, I used String.replaceAll to replace this character but my regex pattern seems to be incorrect.
The following is a different string, but it just gives me back the original string. How should I do it?
str = str.replaceAll("\\^p", "");

Use this:
String replaced = your_original_string.replaceAll("\\x10", "");
The xdd... is the Java syntax to match a single unicode character
Your error said Unicode: 0x10

str = str.replace("\u0010", "");
Or maybe you need a space
str = str.replace("\u0010", " ");

Encoding URL query parameters in Java

How does one encode query parameters to go on a url in Java? I know, this seems like an obvious and already asked question.
There are two subtleties I'm not sure of:
Should spaces be encoded on the url as "+" or as "%20"? In chrome if I type in "http://google.com/foo=?bar me" chrome changes it to be encoded with %20
Is it necessary/correct to encode colons ":" as %3B? Chrome doesn't.
Notes:
java.net.URLEncoder.encode doesn't seem to work, it seems to be for encoding data to be form submitted. For example, it encodes space as + instead of %20, and encodes colon which isn't necessary.
java.net.URI doesn't encode query parameters

java.net.URLEncoder.encode(String s, String encoding) can help too. It follows the HTML form encoding application/x-www-form-urlencoded.
URLEncoder.encode(query, "UTF-8");
On the other hand, Percent-encoding (also known as URL encoding) encodes space with %20. Colon is a reserved character, so : will still remain a colon, after encoding.

Unfortunately, URLEncoder.encode() does not produce valid percent-encoding (as specified in RFC 3986).
URLEncoder.encode() encodes everything just fine, except space is encoded to "+". All the Java URI encoders that I could find only expose public methods to encode the query, fragment, path parts etc. - but don't expose the "raw" encoding. This is unfortunate as fragment and query are allowed to encode space to +, so we don't want to use them. Path is encoded properly but is "normalized" first so we can't use it for 'generic' encoding either.
Best solution I could come up with:
return URLEncoder.encode(raw, "UTF-8").replaceAll("\\+", "%20");
If replaceAll() is too slow for you, I guess the alternative is to roll your own encoder...
EDIT: I had this code in here first which doesn't encode "?", "&", "=" properly:
//don't use - doesn't properly encode "?", "&", "="
new URI(null, null, null, raw, null).toString().substring(1);

EDIT: URIUtil is no longer available in more recent versions, better answer at Java - encode URL or by Mr. Sindi in this thread.
URIUtil of Apache httpclient is really useful, although there are some alternatives
URIUtil.encodeQuery(url);
For example, it encodes space as "+" instead of "%20"
Both are perfectly valid in the right context. Although if you really preferred you could issue a string replace.

It is not necessary to encode a colon as %3B in the query, although doing so is not illegal.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
It also seems that only percent-encoded spaces are valid, as I doubt that space is an ALPHA or a DIGIT
look to the URI specification for more details.

The built in Java URLEncoder is doing what it's supposed to, and you should use it.
A "+" or "%20" are both valid replacements for a space character in a URL. Either one will work.
A ":" should be encoded, as it's a separator character. i.e. http://foo or ftp://bar. The fact that a particular browser can handle it when it's not encoded doesn't make it correct. You should encode them.
As a matter of good practice, be sure to use the method that takes a character encoding parameter. UTF-8 is generally used there, but you should supply it explicitly.
URLEncoder.encode(yourUrl, "UTF-8");

I just want to add anther way to resolve this problem.
If your project depends on spring web, you can use their utils.
import org.springframework.web.util.UriUtils
import java.nio.charset.StandardCharsets
UriUtils.encode('vip:104534049:5', StandardCharsets.UTF_8)
Output:
vip%3A104534049%3A5

String param="2019-07-18 19:29:37";
param="%27"+param.trim().replace(" ", "%20")+"%27";
I observed in case of Datetime (Timestamp)
URLEncoder.encode(param,"UTF-8") does not work.

The white space character " " is converted into a + sign when using URLEncoder.encode. This is opposite to other programming languages like JavaScript which encodes the space character into %20. But it is completely valid as the spaces in query string parameters are represented by +, and not %20. The %20 is generally used to represent spaces in URI itself (the URL part before ?).

if you have only space problem in url. I have used below code and it work fine
String url;
URL myUrl = new URL(url.replace(" ","%20"));
example : url is
www.xyz.com?para=hello sir
then output of muUrl is
www.xyz.com?para=hello%20sir

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache POI Anomalous Whitespace (Resolved: \u00A0 non-breaking space) - java

This translates to the bytes with hex codes c2 a0, which according to this answer is a UTF-8 encoded non-breaking space. Note that this is not really a space and \s will not match it.

this worked for me: String valor = org.apache.commons.lang3.StringUtils.normalizeSpace(java.text.Normalizer.normalize(valor, java.text.Normalizer.Form.NFD));

Related

How to solve the IllegalDataException in jdom2 library?

Replace empty string with Non-Disclosed

Java remove unwanted spaces in a text file and replace with character

How to replace invalid characters using java

Encoding URL query parameters in Java

Categories

Resources