What I'm trying to do :
Generate on the server a txt file and download it on the client side.
I'm using struts 2, here are the code parts :
AwesomeAction.java
InputStream fileInputStream;
public InputStream getFileInputStream(){
return fileInputStream;
}
public String execute() {
res = "toto";
fileInputStream = new StringInputStream(res);
return SUCCESS;
}
struts.xml
<action name="awesomeAction" class="pathtomyawesomeaction">
<result name="success" type="stream">
<param name="contentType">text/plain</param>
<param name="inputName">fileInputStream</param>
<param name="contentDisposition">attachment;filename="id_opp.txt"</param>
<param name="bufferSize">1024</param>
</result>
<result name="error" type="redirect">/erreur.do</result>
</action>
What is not working :
When I click on the link triggering the action, a file named "id_opp.txt" is actually downloaded, it contains all the text ("toto") but it adds a whitespace before each character.
" t o t o "
With server debugs, I'm sure that my variable contains "toto" server side, so it must be a config that I miss...
Any idea ?
Using import org.hsqldb.lib.StringInputStream; for the InputStream, since the String is built in that class, I can't use a FileInputStream or anything, I'm not aware of any other way to make that ?
Checking on the encoding, will update as soon as I got some results
Resolved thanks to Thomas :
Using the StringInputStream was the root of the problem, instead switched it to :
fileInputStream = new ByteArrayInputStream(res.getBytes(StandardCharsets.UTF_8));
Which build an inputStream for the "res" variable, with the actual encoding etc...
Problems like this might occur due to different encodings. Internally Java stores strings using 16-bit characters and when you convert those into a byte representation (e.g. for writing to a stream) it will use some encoding (either one the caller provides or the default encoding which often is the system encoding).
Thus it would depend on what StringInputStream is doing with the string, i.e. how it converts the string to bytes and which encoding is used (if any).
Additionally it might depend on the reader on how the txt file is interpreted if you don't add any information to indicate the encoding (like the BOM (byte order mark) for UTF-8).
Doing it as you did, i.e. using ByteArrayInputStream(res.getBytes(StandardCharsets.UTF_8)), would at least solve the problem when writing. Editors might then interpret the data correctly, even if the BOM is missing (and UTF-8 represents common characters like ISO-Latin 1 (ASCII) and thus even the "wrong" encoding in the reader might not be a problem).
Related
I am using Jax RS and have simple POST WS, that takes InputStream, that contains MIME message (xml + file).
The MIME message is in UTF-8, file contained as a body part is an email message in MIME RFC 822 in ISO-8859-1 encoding, that I'm converting to PDF using Aspose.
When running as a webservice, the resulting PDF has incorrect characters (ø, å etc.). But when I tried to use the exact input, but reading it from file instead and call the method with FileInputStream, the resulting PDF is OK.
Here is the simplified version of the code:
#POST
#Path(value = "/documents/convert/{flag}")
#Produces("text/plain")
public String convertFile(InputStream input, #PathParam("flag") String flag) throws WebApplicationException {
FileInfo info = convertToPdf(input);
return info.getResponse();
}
If I run this as webservice it produces PDF with incorrectly encoded characters with "box" instead of some charcters (such as ø, å etc.). When I run the the same code with the same input by by calling
FileInputStream fis = new FileInputStream(file);
convertFile(fis);
the resulting PDF has correct encoding (the WS is run on server, testing with file is done on my local machine).
Could this be incorrect setting of locale on the server?
Do you use an InputStreamReader to read the FileInputStream ? If so, did you initialize it using the 2-parameters constructor, with CharSet.forName("UTF-8") as the second argument ? (as you mentionned the incoming stream is already in UTF-8) ?
You might need to tell the container that it's UTF-8.
something like...
#Produces("text/plain; charset=utf-8")
Apparently your local file and you MIME message body are not encoded the same way.
Your post states that the file is encoded in ISO-8859-1.
If you are using an InputStreamReader (as Xavier Coulon's is suggesting) you should pass the expected encoding to it. In this case
CharSet.forName("ISO-8859-1")
If this does not help, could you please provide the content of the convertToPdf(InputStream is) method
For development I'm using ResourceBundle to read a UTF-8 encoded properties-file (I set that in Eclipse' file properties on that file) directly from my resources-directory in the IDE (native2ascii is used on the way to production), e.g.:
menu.file.open.label=&Öffnen...
label.btn.add.name=&Hinzufügen
label.btn.remove.name=&Löschen
Since that causes issues with the character encoding when using non-ASCII characters I thought I'd be happy with:
ResourceBundle resourceBundle = ResourceBundle.getBundle("messages", Locale.getDefault());
String value = resourceBundle.getString(key);
value = new String(value.getBytes(), "UTF-8");
Well, it does work nicely for lower-case German umlauts, but not for the upper-case ones, the ß also doesn't work. Here's the value read with getString(key) and the value after the conversion with new String(value.getBytes(), "UTF-8"):
&Löschen => &Löschen
&Hinzufügen => &Hinzufügen
&Ã?ber => &??ber
&SchlieÃ?en => &Schlie??en
&Ã?ffnen... => &??ffnen...
The last three should be:
&Ã?ber => &Über
&SchlieÃ?en => &Schließen
&Ã?ffnen... => &Öffnen...
I guess that I'm not too far away from the truth, but what am I missing here?
Google found something similar, but that remained unanswered.
EDIT: a little more code
The problem is you're calling String.getBytes() without specifying an encoding - which will use the default platform encoding. You're then using the binary result of that operation as if it were in UTF-8.
If you use UTF-8 in both directions, it'll be fine:
// Should be a round-trip
value = new String(value.getBytes("UTF-8"), "UTF-8");
... but if you were trying to use this to read a UTF-8-encoded property file without telling the code which is performing the initial read, that won't work.
The code you've presented is basically always the wrong approach. Your "Since that causes issues with the character encoding" suggests that you'd already run across an earlier problem - so I'd go back to that, instead of trying to apply a broken fix. If you've already lost data when constructing the ResourceBundle, it's too late to go back later... you need to make sure the ResourceBundle itself is loaded correctly.
Please tell us exactly what problems you had with the ResourceBundle, and we can see if we can fix the root cause.
EDIT: It's not clear how you're running native2ascii. The fix may be as simple as changing to use:
native2ascii -encoding UTF-8 input.properties output.properties
Some notes:
If it is a String it is UTF-16 and if it isn't it is a corrupt string (and too late to fix.)
new String(value.getBytes(), "UTF-8"); - this code will (at best) do nothing on a system that uses UTF-8 as the default encoding; otherwise it will corrupt the string.
.properties files must be ISO 8859-1 (the Properties type supports other formats and encodings, but I don't know how you would tell ResourceBundle that.)
System.out can introduce its own transcoding bugs (the PrintStream encodes UTF-16 strings to the default encoding; the receiving device must decode the bytes using the same encoding.)
I suspect you are trying to fix your problems in the wrong place.
You are encoding the text with a different encoding to the one you are decoding with.
Try instead using the same character set for encoding and decoding.
value = new String(value.getBytes("UTF-8"), "UTF-8");
String s = "ßßßßß";
s += s.toUpperCase();
s = new String(s.getBytes("UTF-8"), "UTF-8");
System.out.println(s);
prints
ßßßßßSSSSSSSSSS
Today I was talking to one of my colleagues and he was pretty much on the same path as the other answers have mentioned. So I tried to achieve what Jon Skeet had mentioned, meaning creating the same file as in production. Since rebuilding the project after each change of a resource is out of question and I hadn't done any of what solved this (and I guess this will be new to some) let me line it out (even if it may be just for personal reference ;) ). In short this uses Eclipse' project builders.
Create an Ant-style build.xml
<?xml version="1.0" encoding="UTF-8"?>
<project>
<property name="dir.resources" value="src/main/resources" />
<property name="dir.target" value="bin/main" />
<target name="native-to-ascii">
<delete dir="${dir.target}" includes="**/*.properties" />
<native2ascii src="${dir.resources}" dest="${dir.target}" includes="**/*.properties" />
</target>
</project>
Its intention is to delete the properties-files in the target directory and use native2ascii to recreate them. The delete is necessary as native2ascii won't overwrite existing files.
In Eclipse go to the project properties and select "Builders", click "New...", pick "Ant Builder" (that's the slightly enhanced editor for run configurations)
In "Main" let "Buildfile" point to the Ant-script, set "Base Directory" to ${project_loc}
In "Refresh" tick "Refresh resources upon completion" and pick "The project containing the selected resource"
In "Targets" click "Set Targets" next to the "Auto Build" and pick native-to-ascii there (note that for some reason I had to do this later again)
This might not be necessary for everybody, but in "JRE" pick a proper execution environment
In "Build Options" tick off "Allocate Console" (however, you may want to keep this ticked on until you see that it's all working)
"Apply", "OK"
I was told that the newly created builder should be somewhere underneath the Java Builder (use Up/Down-button)
In the "Java Build Path" select the source folder with the resources (src/main/resources for me) and add an exclusion for **/*.properties
That should have been it. If you edit a properties-file and save it, it should automatically be converted to ASCII in the output folder. You can try with entering ü, which should end up as \u00fc.
Note that if you have a lot of properties-files, this may take some time. Just don't save after every keypress. :)
I have a Java servlet which gets RSS feeds converts them to JSON. It works great on Windows, but it fails on Centos.
The RSS feed contains Arabic and it shows unintelligible characters on Centos. I am using those lines to encode the RSS feed:
byte[] utf8Bytes = Xml.getBytes("Cp1256");
// byte[] defaultBytes = Xml.getBytes();
String roundTrip = new String(utf8Bytes, "UTF-8");
I tried it on Glassfish and Tomcat. Both have the same problem; it works on Windows, but fails on Centos. How is this caused and how can I solve it?
byte[] utf8Bytes = Xml.getBytes("Cp1256");
String roundTrip = new String(utf8Bytes, "UTF-8");
This is an attempt to correct a badly-decoded string. At some point prior to this operation you have read in Xml using the default encoding, which on your Windows box is code page 1256 (Windows Arabic). Here you are encoding that string back to code page 1256 to retrieve its original bytes, then decoding it properly as the encoding you actually wanted, UTF-8.
On your Linux server, it fails, because the default encoding is something other than Cp1256; it would also fail on any Windows server not installed in an Arabic locale.
The commented-out line that uses the default encoding instead of explicitly Cp1256 is more likely to work on a Linux server. However, the real fix is to find where Xml is being read, and fix that operation to use the correct encoding(*) instead of the default. Allowing the default encoding to be used is almost always a mistake, as it makes applications dependent on configuration that varies between servers.
(*: for this feed, that's UTF-8, which is the most common encoding, but it may differ for others. Finding out the right encoding for a feed depends on the Content-Type header returned for the resource and the <?xml encoding declaration. By far the best way to cope with this is to fetch and parse the resource using a proper XML library that knows about this, for example with DocumentBuilder.parse(uri).)
There are many places where wrong encoding can be used. Here is the complete list http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8
I have a file (more specifically, a log4j configuration file) and I want to be able to read in the file and pick out certain lines in the code and replace them. For example, within the file there is a string of text that indicates the directory it is stored in, or the level of the logger. I want to be able to replace those string of text without reading in the file, writing it to another file, and deleting the original file. Is there a more efficient way of doing find and replace texts in a file using Java?
Here is an example of the text file I'm trying to work with:
log4j.rootLogger=DEBUG, A0
log4j.appender.A0=org.apache.log4j.RollingFileAppender
log4j.appender.A0.File=C:/log.txt
log4j.appender.A0.MaxFileSize=100KB
log4j.appender.A0.MaxBackupIndex=1
log4j.appender.A0.layout=org.apache.log4j.RollingFileAppender
log4j.appender.A0.layout.ConversionPattern=%-4r [%t] %-5p: %c %x - %m%n
I want to be able to read the file and replace 'DEBUG' with another level or replace the file directory name 'C:/log.txt'. The log configuration file is also written in xml. An example of that is featured below.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration>
<appender class="org.apache.log4j.RollingFileAppender" name="A0">
<param name="append" value="false"/>
<param name="File" value="C:/log/.txt"/>
<param name="MaxBackupIndex" value="1"/>
<param name="MaxFileSize" value="100KB"/>
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%-4r [%t] %-5p: %c %x - %m%n"/>
</layout>
</appender>
<root>
<level value="DEBUG"/>
<appender-ref ref="A0"/>
</root>
</log4j:configuration>
I'm thinking it may be possible to use a hash map for this type of implementation?
Any decent text editor has a search&replace facility that supports regular expressions.
If however, you have reason to reinvent the wheel in Java, you can do:
Path path = Paths.get("test.txt");
Charset charset = StandardCharsets.UTF_8;
String content = new String(Files.readAllBytes(path), charset);
content = content.replaceAll("foo", "bar");
Files.write(path, content.getBytes(charset));
This only works for Java 7 or newer. If you are stuck on an older Java, you can do:
String content = IOUtils.toString(new FileInputStream(myfile), myencoding);
content = content.replaceAll(myPattern, myReplacement);
IOUtils.write(content, new FileOutputStream(myfile), myencoding);
In this case, you'll need to add error handling and close the streams after you are done with them.
IOUtils is documented at http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/IOUtils.html
After visiting this question and noting the initial concerns of the chosen solution, I figured I'd contribute this one for those not using Java 7 which uses FileUtils instead of IOUtils from Apache Commons. The advantage here is that the readFileToString and the writeStringToFile handle the issue of closing the files for you automatically. (writeStringToFile doesn't document it but you can read the source). Hopefully this recipe simplifies things for anyone new coming to this problem.
try {
String content = FileUtils.readFileToString(new File("InputFile"), "UTF-8");
content = content.replaceAll("toReplace", "replacementString");
File tempFile = new File("OutputFile");
FileUtils.writeStringToFile(tempFile, content, "UTF-8");
} catch (IOException e) {
//Simple exception handling, replace with what's necessary for your use case!
throw new RuntimeException("Generating file failed", e);
}
public static void replaceFileString(String old, String new) throws IOException {
String fileName = Settings.getValue("fileDirectory");
FileInputStream fis = new FileInputStream(fileName);
String content = IOUtils.toString(fis, Charset.defaultCharset());
content = content.replaceAll(old, new);
FileOutputStream fos = new FileOutputStream(fileName);
IOUtils.write(content, new FileOutputStream(fileName), Charset.defaultCharset());
fis.close();
fos.close();
}
above is my implementation of Meriton's example that works for me. The fileName is the directory (ie. D:\utilities\settings.txt). I'm not sure what character set should be used, but I ran this code on a Windows XP machine just now and it did the trick without doing that temporary file creation and renaming stuff.
You might want to use Scanner to parse through and find the specific sections you want to modify. There's also Split and StringTokenizer that may work, but at the level you're working at Scanner might be what's needed.
Here's some additional info on what the difference is between them:
Scanner vs. StringTokenizer vs. String.Split
This is the sort of thing I'd normally use a scripting language for. It's very useful to have the ability to perform these sorts of transformations very simply using something like Ruby/Perl/Python (insert your favorite scripting language here).
I wouldn't normally use Java for this since it's too heavyweight in terms of development cycle/typing etc.
Note that if you want to be particular in manipulating XML, it's advisable to read the file as XML and manipulate it as such (the above scripting languages have very useful and simple APIs for doing this sort of work). A simple text search/replace can invalidate your file in terms of character encoding etc. As always, it depends on the complexity of your search/replace requirements.
You can use Java's Scanner class to parse words of a file and process them in your application, and then use a BufferedWriter or FileWriter to write back to the file, applying the changes.
I think there is a more efficient way of getting the iterator's position of the scanner at some point, in order to better implement editting. But since files are either open for reading, or writing, I'm not sure regarding that.
In any case, you can use libraries already available for parsing of XML files, which have all of this implemented already and will allow you to do what you want easily.
I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"Catégorie:", "Modèle:", "Wikipédia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);
Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.