I want to edit a php's code which used ord methods, the php's code follow:
$fh=fopen("../ip.data","r");
fseek($fh,327680);
$country_id=ord(fread($fh,1)); //return 137
I try to write code in java :
FileReader ip=new FileReader("ip.data");
int i = 0;
int str;
while((str=ip.read()) != -1){
if(i==327681){
System.out.println(str); //return 0
break;
}
i++;
}
But the two are not equal.
I know ord('a')==97 in PHP, (int)'a'==97 in Java.
The ip.data download here.
FileReader assumes the system's default charset for the input file, which might be utf-8 e.g.* In that case up to four bytes would be read to "form up" the single character you get from FileReader.read().
So maybe (just guesswork at this point though) that's the problem. Your php code doesn't assume any encoding, it just reads an (8bit) byte and its equivalent in java would be e.g.
fis = new FileInputStream("ip.data");
country_id = fis.read(); // FileInputStream.read() is reading a byte, not a character
*) To test that assumption try
FileReader fr = new FileReader("ip.data")
System.out.println( fr.getEncoding() );
What does it print?
Try this:
String str = "t";
byte b = str.getBytes()[0];
int ascii = (int) b;
Maybe you have to use char.
Related
I know that I can do this. But I also want to know, is there a short way to do this ? For example: Why there is no method that has public String readString(int len); prototype in Reader class hierarchy to do what I want with only single code in this question ?
InputStream in = new FileInputStream("abc.txt");
InputStreamReader inReader = new InputStreamReader(in);
char[] foo = new char[5];
inReader.read(foo);
System.out.println(new String(foo));
// I think this way is too long
// for reading a string that has only 5 character
// from InputStream or Reader
In Python 3 programming language, I can do it very very easy for UTF-8 and another files. Consider the following code.
fl = open("abc.txt", mode="r", encoding="utf-8")
fl.read(1) # returns string that has 1 character
fl.read(3) # returns string that has 3 character
How can I dot it in Java ?
Thanks.
How can I do it in Java ?
The way you're already doing it.
I'd recommend doing it in a reusable helper method, e.g.
final class IOUtil {
public static String read(Reader in, int len) throws IOException {
char[] buf = new char[len];
int charsRead = in.read(buf);
return (charsRead == -1 ? null : new String(buf, 0, charsRead));
}
}
Then use it like this:
try (Reader in = Files.newBufferedReader(Paths.get("abc.txt"), StandardCharsets.UTF_8)) {
System.out.println(IOUtil.read(in, 5));
}
If you want to make a best effort to read as many as the specified number of characters, you may use
int len = 4;
String result;
try(Reader r = new FileReader("abc.txt")) {
CharBuffer b = CharBuffer.allocate(len);
do {} while(b.hasRemaining() && r.read(b) > 0);
result = b.flip().toString();
}
System.out.println(result);
While the Reader may read less than the specified characters (depending on the underlying stream), it will read at least one character before returning or return -1 to signal the end of the stream. So the code above will loop until either, having read the requested number of characters or reached the end of the stream.
Though, a FileReader will usually read all requested characters in one go and read only less when reaching the end of the file.
I am trying to make function which will remove diacritic(dont want to use Normalizer on purpose).Function looks like
private static String normalizeCharacter(Character curr) {
String sdiac = "áäčďéěíĺľňóôőöŕšťúůűüýřžÁÄČĎÉĚÍĹĽŇÓÔŐÖŔŠŤÚŮŰÜÝŘŽ";
String bdiac = "aacdeeillnoooorstuuuuyrzAACDEEILLNOOOORSTUUUUYRZ";
char[] s = sdiac.toCharArray();
char[] b = bdiac.toCharArray();
String ret;
for(int i = 0; i < sdiac.length(); i++){
if(curr == s[i])
curr = b[i];
}
ret = curr.toString().toLowerCase();
ret = ret.replace("\n", "").replace("\r","");
return ret;
}
funcion is called like this(every charracter from file is sent to this function)
private static String readFile(String fName) {
File f = new File(fName);
StringBuilder sb = new StringBuilder();
try{
FileInputStream fStream = new FileInputStream(f);
Character curr;
while(fStream.available() > 0){
curr = (char) fStream.read();
sb.append(normalizeCharacter(curr));
System.out.print(normalizeCharacter(curr));
}
}catch(IOException e){
e.printStackTrace();
}
return sb.toString();
}
file text.txt contains this: ľščťžýáíéúäôň and i expect lcstzyaieuaonin return from program but insted of expected string i get this ¾è yaieuaoò. I know that problem is somewhere in encoding but dont know where. Any ideas ?
You are trying to convert bytes into characters.
However, the character ľ is not represented as a single byte. Its unicode representation is U+013E, and its UTF-8 representation is C4 BE. Thus, it is represented by two bytes. The same is true for the other characters.
Suppose the encoding of your file is UTF-8. Then you read the byte value C4, and then you convert it to a char. This will give you the character U+00C4 (Ä), not U+013E. Then you read the BE, and it is converted to the character U+00BE (¾).
So don't confuse bytes and characters. Instead of using the InputStream directly, you should wrap it with a Reader. A Reader is able to read charecters based on the encoding it is created with:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(f), StandardCharsets.UTF_8
)
);
Now, you'll be able to read characters or even whole lines and the encoding will be done directly.
int readVal;
while ( ( readVal = reader.read() ) != -1 ) {
curr = (char)readVal;
// ... the rest of your code
}
Remember that you are still reading an int if you are going to use read() without parameters.
I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.
My task: I have 2 positions (starting byte, ending byte), pos1 and pos2. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.
For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):
f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
s = f.readline
# do something with "s" here
}
It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n) from regular local files:
RandomAccessFile has getFilePointer() and seek(long pos), but it's readLine() reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that every read*() call would be translated into single undelying OS read() => fairly slow).
BufferedReader has great readLine() method, and it can even do some seeking with skip(long n), but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.
I've tried to use something like:
FileInputStream fis = new FileInputStream(fileName);
FileChannel fc = fis.getChannel();
BufferedReader br = new BufferedReader(
new InputStreamReader(
fis,
CHARSET_UTF8
)
);
... and then using fc.position() to get current file reading position and fc.position(newPosition) to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.
Do I really have to implement it all by myself, i.e. a file readering interface which would:
allow me to get/set position in a file
buffer file reading operations
allow reading UTF8 strings (or at least allow operations like "read everything till the next \n")
Is there a quicker way than implementing it all myself? Am I overseeing something?
import org.apache.commons.io.input.BoundedInputStream
FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);
If you didn't care about pos2, then you woundn't need Apache Commons IO.
I wrote this code to read utf-8 using randomaccessfiles
//File: CyclicBuffer.java
public class CyclicBuffer {
private static final int size = 3;
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(size);
public CyclicBuffer(FileChannel channel) {
this.channel = channel;
}
private int read() throws IOException {
return channel.read(buffer);
}
/**
* Returns the byte read
*
* #return byte read -1 - end of file reached
* #throws IOException
*/
public byte get() throws IOException {
if (buffer.hasRemaining()) {
return buffer.get();
} else {
buffer.clear();
int eof = read();
if (eof == -1) {
return (byte) eof;
}
buffer.flip();
return buffer.get();
}
}
}
//File: UTFRandomFileLineReader.java
public class UTFRandomFileLineReader {
private final Charset charset = Charset.forName("utf-8");
private CyclicBuffer buffer;
private ByteBuffer temp = ByteBuffer.allocate(4096);
private boolean eof = false;
public UTFRandomFileLineReader(FileChannel channel) {
this.buffer = new CyclicBuffer(channel);
}
public String readLine() throws IOException {
if (eof) {
return null;
}
byte x = 0;
temp.clear();
while ((byte) -1 != (x = (buffer.get())) && x != '\n') {
if (temp.position() == temp.capacity()) {
temp = addCapacity(temp);
}
temp.put(x);
}
if (x == -1) {
eof = true;
}
temp.flip();
if (temp.hasRemaining()) {
return charset.decode(temp).toString();
} else {
return null;
}
}
private ByteBuffer addCapacity(ByteBuffer temp) {
ByteBuffer t = ByteBuffer.allocate(temp.capacity() + 1024);
temp.flip();
t.put(temp);
return t;
}
public static void main(String[] args) throws IOException {
RandomAccessFile file = new RandomAccessFile("/Users/sachins/utf8.txt",
"r");
UTFRandomFileLineReader reader = new UTFRandomFileLineReader(file
.getChannel());
int i = 1;
while (true) {
String s = reader.readLine();
if (s == null)
break;
System.out.println("\n line " + i++);
s = s + "\n";
for (byte b : s.getBytes(Charset.forName("utf-8"))) {
System.out.printf("%x", b);
}
System.out.printf("\n");
}
}
}
For #Ken Bloom A very quick go at a Java 7 version. Note: I don't think this is the most efficient way, I'm still getting my head around NIO.2, Oracle has started their tutorial here
Also note that this isn't using Java 7's new ARM syntax (which takes care of the Exception handling for file based resources), it wasn't working in the latest openJDK build that I have. But if people want to see the syntax, let me know.
/*
* Paths uses the default file system, note no exception thrown at this stage if
* file is missing
*/
Path file = Paths.get("C:/Projects/timesheet.txt");
ByteBuffer readBuffer = ByteBuffer.allocate(readBufferSize);
FileChannel fc = null;
try
{
/*
* newByteChannel is a SeekableByteChannel - this is the fun new construct that
* supports asynch file based I/O, e.g. If you declared an AsynchronousFileChannel
* you could read and write to that channel simultaneously with multiple threads.
*/
fc = (FileChannel)file.newByteChannel(StandardOpenOption.READ);
fc.position(startPosition);
while (fc.read(readBuffer) != -1)
{
readBuffer.rewind();
System.out.println(Charset.forName(encoding).decode(readBuffer));
readBuffer.flip();
}
}
Start with a RandomAccessFile and use read or readFully to get a byte array between pos1 and pos2. Let's say that we've stored the data read in a variable named rawBytes.
Then create your BufferedReader using
new BufferedReader(new InputStreamReader(new ByteArrayInputStream(rawBytes)))
Then you can call readLine on the BufferedReader.
Caveat: this probably uses more memory than if you could make the BufferedReader seek to the right location itself, because it preloads everything into memory.
I think the confusion is caused by the UTF-8 encoding and the possibility of double byte characters.
UTF8 doesn't specify how many bytes are in a single character. I'm assuming from your post that you are using single byte characters. For example, 412 bytes would mean 411 characters. But if the string were using double byte characters, you would get the 206 character.
The original java.io package didn't deal well with this multi-byte confusion. So, they added more classes to deal specifically with strings. The package mixes two different types of file handlers (and they can be confusing until the nomenclature is sorted out). The stream classes provide for direct data I/O without any conversion. The reader classes convert files to strings with full support for multi-byte characters. That might help clarify part of the problem.
Since you state you are using UTF-8 characters, you want the reader classes. In this case, I suggest FileReader. The skip() method in FileReader allows you to pass by X characters and then start reading text. Alternatively, I prefer the overloaded read() method since it allows you to grab all the text at one time.
If you assume your "bytes" are individual characters, try something like this:
FileReader fr = new FileReader( new File("x.txt") );
char[] buffer = new char[ pos2 - pos ];
fr.read( buffer, pos, buffer.length );
...
I'm late to the party here, but I ran across this problem in my own project.
After much traversal of Javadocs and Stack Overflow, I think I found a simple solution.
After seeking to the appropriate place in your RandomAccessFile, which I am here calling raFile, do the following:
FileDescriptor fd = raFile.getFD();
FileReader fr = new FileReader(fd);
BufferedReader br = new BufferedReader(fr);
Then you should be able to call br.readLine() to your heart's content, which will be much faster than calling raFile.readLine().
The one thing I'm not sure about is whether UTF8 strings are handled correctly.
The java IO API is very flexible. Unfortunately sometimes the flexibility makes it verbose. The main idea here is that there are many streams, writers and readers that implement wrapper patter. For example BufferedInputStream wraps any other InputStream. The same is about output streams.
The difference between streams and readers/writers is that streams work with bytes while readers/writers work with characters.
Fortunately some streams, writers and readers have convenient constructors that simplify coding. If you want to read file you just have to say
InputStream in = new FileInputStream("/usr/home/me/myfile.txt");
if (in.markSupported()) {
in.skip(1024);
in.read();
}
It is not so complicated as you afraid.
Channels is something different. It is a part of so called "new IO" or nio. New IO is not blocked - it is its main advantage. You can search in internet for any "nio java tutorial" and read about it. But it is more complicated than regular IO and is not needed for most applications.
I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as “, – and ’ (they display as ?).
Is it possible to convert these characters from UTF-8 to ISO-8859-1?
Here is a snippet of code I have written to attempt this:
BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with
byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");
I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.
I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.
The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:
public final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
Example usage:
String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";
StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());
Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C “ ) is encoded as “. A couple of other arbitrary code points are likewise encoded.
Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.
Depending on your default encoding, following lines could cause problem,
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1 buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.
With Java 8, McDowell's answer can be simplified like this (while preserving correct handling of surrogate pairs):
public final class HtmlEncoder {
private HtmlEncoder() {
}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
int codePoint = iterator.nextInt();
if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
out.append((char) codePoint);
} else {
out.append("&#x");
out.append(Integer.toHexString(codePoint));
out.append(";");
}
}
return out;
}
}
when you instanciate your String object, you need to indicate which encoding to use.
So replace :
return new String(latin1);
by
return new String(latin1, "ISO-8859-1");
In the last section of the code I print what the Reader gives me. But its just bogus, where did I go wrong?
public static void read_impl(File file, String targetFile) {
// Create zipfile input stream
FileInputStream stream = new FileInputStream(file);
ZipInputStream zipFile = new ZipInputStream(new BufferedInputStream(stream));
// Im looking for a specific file/entry
while (!zipFile.getNextEntry().getName().equals(targetFile)) {
zipFile.getNextEntry();
}
// Next step in api requires a reader
// The target file is a UTF-16 encoded text file
InputStreamReader reader = new InputStreamReader(zipFile, Charset.forName("UTF-16"));
// I cant make sense of what this print
char buf[] = new char[1];
while (reader.read(buf, 0, 1) != -1) {
System.out.print(buf);
}
}
I'd guess that where you went wrong was believing that the file was UTF-16 encoded.
Can you show a few initial byte values if you don't decode them?
Your use of a char array is a bit pointless, though at first glance it should work. Try this instead:
int c;
while ((c = reader.read()) != -1) {
System.out.print((char)c);
}
If that does not work either, then perhaps you got the wrong file, or the file does not contain what you think it does, or the console can't display the characters it contains.