Handle special charecters while writing xml through java

Handle special charecters while writing xml through java - java

Through a java program I am creating a xml of stock holders. The generated xml would look like -
<?xml version="1.0" encoding="UTF-8" ?>
<urlset>
<url>
<loc>FirstName-LastName/id/</loc>
</url>
</urlset>
There are some stock holders having special characters in there name e.g. A. Pitkänen. Now, when I see xml for this stock holders it looks like -
<?xml version="1.0" encoding="UTF-8" ?>
<urlset>
<url>
<loc>/A-Pitk寥n/ELS_1005091/</loc>
</url>
</urlset>
This is making the xml invalid. Why this is happening? The java program is -
FileWriter fstream = new FileWriter("c:\stock-holders.xml");
final BufferedWriter out = new BufferedWriter(fstream);
try {
// Making Connection and query the stock holders to get the resultset
String aId = "";
String aFName = "";
String aLName = "";
out.write("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n");
out.write("<urlset>\n");
while (rs.next()) {
String url = "";
aFName = rs.getString(2);
if (StringUtils.isNotEmpty(aFName) ) {
aFName = aFName.trim();
url += aFName;
}
aLName = rs.getString(3);
if (StringUtils.isNotEmpty(aLName)) {
aLName = aLName.trim();
url += "-" + aFName;
}
aId = rs.getString(1);
if (StringUtils.isNotEmpty(aId)) {
aId = aId.trim();
url += "/" + aId + "/";
}
out.write("<url>\n");
out.write("<loc>" + url + "</loc>\n");
out.write("</url>\n");
out.flush();
}
out.write("</urlset>");
out.close();
}

Sicne your XML file is supposed to be written in UTF-8 encoding, you need to configure your Writers to use that encoding rather than the system default one:
FileOutputStream fstream = new FileOutputStream("c:\stock-holders.xml");
OutputStreamWriter writer = new OutputStreamWriter(fstream, "UTF-8");
final BufferedWriter out = new BufferedWriter(writer);
Note that use of FileWriter is not recommended for this very reason - it cannot be configured to use encoding other than the default one.
Also, perhaps it would be better to use some existing API for constructing XML files (such as DOM or StAX) rather than do it by string concatenation. For example, your solution doesn't take into account that your data may contain characters that are illegal in XML and should be escaped.

I suspect that the problem is that you are using a FileWriter instead of a FileOutputStream hooked up the a OutputStreamWriter, where the OSW specifies "utf-8" as the encoding

You can use something more short:
PrintWriter out = new PrintWriter("c:\\stock-holders.xml", "UTF-8");
This constructor is available since Java 1.5.
The Documentation says:
Creates a new PrintWriter, without automatic line flushing, with the
specified file name and charset. This convenience constructor creates
the necessary intermediate OutputStreamWriter, which will encode
characters using the provided charset.
You need call the method flush() when all write calls is done.

Related

Check whether data can be represented in a specified encoding

I'm writing a Java program that saves data to UTF8 text files. However, I'd also like to provide the option to save to IBM437 for compatibility with an old program that uses the same sort of data files.
How can I check to see if the data the user is trying to save isn't representable in IBM437? At the moment the file saves without complaining but results in unusual characters being replaced with question marks.
I'd prefer it if I could show a warning to the user that the data they are saving isn't supported in IBM437. The user could then have the option of manually replacing characters with the nearest ASCII equivalent.
Current code for saving is:
String encoding = "UTF-8";
if (forceLegacySupport)
{
// Force character encoding to IBM437
encoding = "IBM437";
}
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(saveFile.getAbsoluteFile()), encoding));
IOController.writeFileToDisk(bw);
bw.close();

As mentioned by JB Nizet in comments you can use charset encoder
and for creating text/String as UTF-8
just a suggestion from my end:
public static char[] cookie = "HEADER_COOKIE".toCharArray();
byte[] cookieInBytes = new byte[COOKIE_SIZE];
for(int i=0;i<cookie.length;i++)
{
if(i < cookie.length)
cookieInBytes[i] = (byte)cookie[i];
}
String headerStr = new String(cookieInBytes,StandardCharsets.UTF_8);

Gujarati text in Java String

I have Gujarati Bible and trying to insert each verse in MySQL database using parser written in Java. When I assign Gujarati text to Java String variable it shows junks in debug.
E.g. This is my Gujarati text
હે યહોવા તું મારો દેવ છે;
I assign it to Java String variable as shown below
verse._verseText = "હે યહોવા તું મારો દેવ છે;";
What i see in debug window is all junk characters. Any help is appreciated. If need more information let me know and I will provide as and when asked.
UPDATE
Pasting my parser code here
private Boolean Insert(String _text)
{
BibleVerse verse = new BibleVerse();
String[] data = _text.split("\\|");
try
{
if (data[0].equals(bookName) || bookName.equals("All"))
{
verse._Version = "Gujarati";
verse._book = data[0];
verse._chapter = Integer.parseInt(data[1]);
verse._verse = Integer.parseInt(data[2]);
verse._verseText = new String(data[3].getBytes(), "UTF-8");
_bibleDatabase.Insert(verse);
pcs.firePropertyChange("logupdate", null, data[0] + " " + data[1] + "," + data[2] + " - INSERTED.");
}
else
{
pcs.firePropertyChange("logupdate", null, data[0] + " " + data[1] + "," + data[2] + " - SKIPPED.");
}
return true;
}
catch(Exception e)
{
pcs.firePropertyChange("logupdate", null, "ERROR : " + e.getMessage());
return false;
}
}
Here is the sample line from the text file
Isaiah|25|1|હે યહોવા તું મારો દેવ છે; હું તને મોટો માનીશ, હું તારા નામની સ્તુતિ કરીશ; કેમકે તેં અદભુત કાર્યો કર્યાં છે, તેં વિશ્વાસુપણે તથા સત્યતાથી પુરાતન સંકલ્પો પાર પાડ્યા છે.
UPDATE
Here is the code where I open & read file.
try
{
FileReader _file = new FileReader(this._filename);
_bufferedReader = new BufferedReader(_file);
SwingWorker parseWorker = new SwingWorker()
{
#Override
protected Object doInBackground() throws Exception
{
String line;
String[] data;
int lineno=0;
BibleVerse verse = new BibleVerse();
while ((line = _bufferedReader.readLine()) != null)
{
++lineno;
pcs.firePropertyChange("pgbupdate", null, lineno);
Insert(line);
}
_bufferedReader.close();
return null;
}
#Override
protected void done()
{
pcs.firePropertyChange("logupdate", null, "Parsing complete.");
}
};
parseWorker.execute();
}
catch (Exception e)
{
pcs.firePropertyChange("logupdate", null, "ERROR : " + e.getMessage());
}

The problem is this:
FileReader _file = new FileReader(this._filename);
This reads the file using the platform's default charset. If your data file is not encoded in that charset, you will get incorrect characters.
On Windows, the default charset is almost always UTF-16LE. On most other systems, it's UTF-8.
The easiest solution is to find out the actual encoding of your data file, so you can specify it explicitly in the code. The encoding of a file can be determined with the file command on Unix and Linux systems. In Windows, you may need to examine it with a binary editor, or install something like Cygwin, which has a file command of its own.
Once you know what it is, you should pass it explicitly to the construction of your Reader:
// Replace "UTF-8" with the actual encoding of your data file (if it's not UTF-8).
Reader _file = new InputStreamReader(new FileInputStream(this._filename), "UTF-8");
Once you've done that, there is no reason for any other part of your code to concern itself with bytes. You should replace this:
verse._verseText = new String(data[3].getBytes(), "UTF-8");
with this:
verse._verseText = data[3];

how to inject chinese characters using javascript?
not quite the same problem, but I think the same solution may work in this case.
If the script is inline (in the HTML file), then it's using the
encoding of the HTML file and you won't have an issue.
If the script is loaded from another file:
Your text editor must save the file in an appropriate encoding such as
utf-8 (it's probably doing this already if you're able to save it,
close it, and reopen it with the characters still displaying
correctly) Your web server must serve the file with the right http
header specifying that it's utf-8 (or whatever the enocding happens to
be, as determined by your text editor settings). Here's an example for
how to do this with php: Set http header to utf-8 php If you can't
have your webserver do this, try to set the charset attribute on your
script tag (e.g. > I tried to see what the spec said should happen
in the case of mismatching charsets defined by the tag and the http
headers, but couldn't find anything concrete, so just test and see if
it helps. If that doesn't work, place your script inline

It looks like if you want to store Gujarati text in Java string, you need to use unicode characters. See this: http://jrgraphix.net/r/Unicode/0A80-0AFF
So for example the first Gujarati character:
char example = '0A80';
String result = Character.toString((char)example);

Csv: search for String and replace with another string

I have a .csv file that contains:
scenario, custom, master_data
1, ${CUSTOM}, A_1
I have a string:
a, b, c
and I want to replace 'custom' with 'a, b, c'. How can I do that and save to the existing .csv file?

Probably the easiest way is to read in one file and output to another file as you go, modifying it on a per-line basis
You could try something with tokenizers, this may not be completely correct for your output/input, but you can adapt it to your CSV file formatting
BufferedReader reader = new BufferedReader(new FileReader("input.csv"));
BufferedWriter writer = new BufferedWriter(new FileWriter("output.csv"));
String custom = "custom";
String replace = "a, b, c";
for(String line = reader.readLine(); line != null; line = reader.readLine())
{
String output = "";
StringTokenizer tokenizer = new StringTokenizer(line, ",");
for(String token = tokenizer.nextToken(); tokenizer.hasMoreTokens(); token = tokenizer.nextToken())
if(token.equals(custom)
output = "," + replace;
else
output = "," + token;
}
readInventory.close();
If this is for a one off thing, it also has the benefit of not having to research regular expressions (which are quite powerful and useful, good to know, but maybe for a later date?)

Have a look at Can you recommend a Java library for reading (and possibly writing) CSV files?
And once the values have been read, search for strings / value that start with ${ and end with }. Use Java Regular Expressions like \$\{(\w)\}. Then use some map for looking up the found key, and the related value. Java Properties would be a good candidate.
Then write a new csv file.

Since your replacement string is quite unique you can do it quickly without complicated parsing by just reading your file into a buffer, and then converting that buffer into a string. Replace all occurrences of the text you wish to replace with your target text. Then convert the string to a buffer and write that back to the file...
Pattern.quote is required because your string is a regular expression. If you don't quote it you may run into unexpected results.
Also it's generally not smart to overwrite your source file. Best is to create a new file then delete the old and rename the new to the old. Any error halfway will then not delete all your data.
final Path yourPath = Paths.get("Your path");
byte[] buff = Files.readAllBytes(yourPath);
String s = new String(buff, Charset.defaultCharset());
s = s.replaceAll(Pattern.quote("${CUSTOM}"), "a, b, c");
Files.write(yourPath, s.getBytes());

Reading from property file containing utf 8 character

I am reading a property file which consists of a message in the UTF-8 character set.
Problem
The output is not in the appropriate format. I am using an InputStream.
The property file looks like
username=LBSUSER
password=Lbs#123
url=http://localhost:1010/soapfe/services/MessagingWS
timeout=20000
message=Spanish character are = {á é í, ó,ú ,ü, ñ, ç, å, Á, É, Í, Ó, Ú, Ü, Ñ, Ç, ¿, °, 4° año = cuarto año, €, ¢, £, ¥}
And I am reading the file like this,
Properties props = new Properties();
props.load(new FileInputStream("uinsoaptest.properties"));
String username = props.getProperty("username", "test");
String password = props.getProperty("password", "12345");
String url = props.getProperty("url", "12345");
int timeout = Integer.parseInt(props.getProperty("timeout", "8000"));
String messagetext = props.getProperty("message");
System.out.println("This is soap msg : " + messagetext);
The output of the above message is
You can see the message in the console after the line
{************************ SOAP MESSAGE TEST***********************}
I will be obliged if I can get any help reading this file properly. I can read this file with another approach but I am looking for less code modification.

Use an InputStreamReader with Properties.load(Reader reader):
FileInputStream input = new FileInputStream(new File("uinsoaptest.properties"));
props.load(new InputStreamReader(input, Charset.forName("UTF-8")));
As a method, this may resemble the following:
private Properties read( final Path file ) throws IOException {
final var properties = new Properties();
try( final var in = new InputStreamReader(
new FileInputStream( file.toFile() ), StandardCharsets.UTF_8 ) ) {
properties.load( in );
}
return properties;
}
Don't forget to close your streams. Java 7 introduced StandardCharsets.UTF_8.

Use props.load(new FileReader("uinsoaptest.properties")) instead. By default it uses the encoding Charset.forName(System.getProperty("file.encoding")) which can be set to UTF-8 with System.setProperty("file.encoding", "UTF-8") or with the commandline parameter -Dfile.encoding=UTF-8.

If somebody use #Value annotation, could try StringUils.
#Value("${title}")
private String pageTitle;
public String getPageTitle() {
return StringUtils.toEncodedString(pageTitle.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("UTF-8"));
}

You should specify the UTF-8 encoding when you construct your FileInputStream object. You can use this constructor:
new FileInputStream("uinsoaptest.properties", "UTF-8");
If you want to make a change to your JVM so as to be able to read UTF-8 files by default, you will have to change the JAVA_TOOL_OPTIONS in your JVM options to something like this :
-Dfile.encoding=UTF-8

If anybody comes across this problem in Kotlin, like me:
The accepted solution of #Würgspaß works here as well. The corresponding Kotlin syntax:
Instead of the usual
val properties = Properties()
filePath.toFile().inputStream().use { stream -> properties.load(stream) }
I had to use
val properties = Properties()
InputStreamReader(FileInputStream(filePath.toFile()), StandardCharsets.UTF_8).use { stream -> properties.load(stream) }
With this, special UTF-8 characters are loaded correctly from the properties file given in filePath.

how to exclude tag from XML String in java

I am making a piece of code to send and recieve data from and to an webpage. I am doeing this in java. But when i 'receive' the xml data it is still between tags like this
<?xml version='1.0'?>
<document>
<title> TEST </title>
</document>
How can i get the data without the tags in Java.
This is what i tried, The function writes the data and then should get the reponse and use that in a System.out.println.
public static String User_Select(String username, String password) {
String mysql_type = "1"; // 1 = Select
try {
String urlParameters = "mysql_type=" + mysql_type + "&username=" + username + "&password=" + password;
URL url = new URL("http://localhost:8080/HTTP_Connection/index.php");
URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter writer = new OutputStreamWriter(conn.getOutputStream());
writer.write(urlParameters);
writer.flush();
String line;
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = reader.readLine()) != null) {
System.out.println(line);
//System.out.println("Het werkt!!");
}
writer.close();
reader.close();
return line;
} catch (IOException iox) {
iox.printStackTrace();
return null;
}
}
Thanks in advance

I would suggest simply using RegEx to read the XML, and get the tag content that you are after.
That simplifies what you need to do, and limits the inclusion of additional (unnecessary) libraries.
And then there are lots of StackOverflows on this topic: Regex for xml parsing and In RegEx, I want to find everything between two XML tags just to mention 2 of them.

use DOMParser in java.
Check further in java docs

Use an XML Parser to Parse your XML. Here is a link to Oracle's Tutorial
Oracle Java XML Parser Tutorial

Simply pass the InputStream from URLConnection
Document doc = DocumentBuilderFactory.
newInstance().
newDocumentBuilder().
parse(conn.getInputStream());
From there you could use xPath to query the contents of the document or simply walk the document model.
Take a look at Java API for XML Processing (JAXP) for more details

You have to use an XML Parser , in your case the perfect choice is JSoup which scrap data from the web and parse XML & HTML format ,it will load data and parse it and give you what you want , here is a an example of how it works :
1. XML From an URL
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.get().toString();
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST
Edit :
to send GET or POST parameters with you request use this code:
String xml = Jsoup.connect("http://localhost:8080/HTTP_Connection/index.php")
.data("param1Name";"param1Value")
.data("param2Name","param2Value").get().toString();
you can use get() to invoke HTTP GET method or post() to invoke HTTP POST method.
2. XML From String
You can use JSoup to parse XML data in a String :
String xmlData="<?xml version='1.0'?><document> <title> TEST </title> </document>" ;
Document doc = Jsoup.parse(xmlData, "", Parser.xmlParser());
String myTitle=doc.select("title").first();// myTitle contain now TEST

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Handle special charecters while writing xml through java - java

I suspect that the problem is that you are using a FileWriter instead of a FileOutputStream hooked up the a OutputStreamWriter, where the OSW specifies "utf-8" as the encoding

Related

Check whether data can be represented in a specified encoding

Gujarati text in Java String

Csv: search for String and replace with another string

Reading from property file containing utf 8 character

how to exclude tag from XML String in java

Categories

Resources