Splitting strings by newline trouble

Splitting strings by newline trouble - java

I am reading in a file that is being sent though a socket and then trying to split it via newlines (\n), when I read in the file I am using a byte[] and I convert the byte array to a string so that I can split it.
public String getUserFileData()
{
try
{
byte[] mybytearray = new byte[1024];
InputStream is = clientSocket.getInputStream();
int bytesRead = is.read(mybytearray, 0, mybytearray.length);
is.close();
return new String(mybytearray);
}
catch(IOException e)
{
}
return "";
}
Here is the code used to attempting to split the String
public void readUserFile(String userData, Log logger)
{
String[] data;
String companyName;
data = userData.split("\n");
username = data[0];
password = data[1].toCharArray();
companyName = data[2];
quota = Float.parseFloat(data[3]);
company = new Company();
company.readCompanyFile("C:\\Users\\Chris\\Documents\\NetBeansProjects\\ArFile\\ArFile Clients\\" + companyName + "\\"
+ companyName + ".cmp");
cloudFiles = new CloudFiles();
cloudFiles.readCloudFiles(this, logger);
}
It causes this error
Exception in thread "AWT-EventQueue-1" java.lang.ArrayIndexOutOfBoundsException

You can use the readLine method in BufferedReader class.
Wrap the InputStream under InputStreamReader, and wrap it under BufferedReader:
InputStream is = clientSocket.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
Please also check the encoding of the stream - you might need to specify the encoding in the constructor of InputStreamReader.

As stated in comments, using a BufferedReader would be best - you should be using an InputStreamReader anyway in order to convert from binary to text.
// Or use a different encoding - whatever's appropriate
BufferedReader reader = new BufferedReader(
new InputStreamReader(clientSocket.getInputStream(), "UTF-8");
try {
String line;
// I'm assuming you want to read every incoming line
while ((line = reader.readLine()) != null) {
processLine(line);
}
} finally {
reader.close();
}
Note that it's important to state which encoding you want to use - otherwise it'll use the platform's default encoding, which will vary from machine to machine, whereas presumably the data is in one specific encoding. If you don't know which encoding that is yet, you need to find out. Until then, you simply can't reliably understand the data.
(I hope your real code doesn't have an empty catch block, by the way.)

Related

URLConnection doesn't read whole page

In my app I need to download some web page. I do it in a way like this
URL url = new URL(myUrl);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setReadTimeout(5000000);//5 seconds to download
conn.setConnectTimeout(5000000);//5 seconds to connect
conn.setRequestMethod("GET");
conn.setDoInput(true);
conn.connect();
int response = conn.getResponseCode();
is = conn.getInputStream();
String s = readIt(is, len);
System.out.println("got: " + s);
My readIt function is:
public String readIt(InputStream stream) throws IOException {
int len = 10000;
Reader reader;
reader = new InputStreamReader(stream, "UTF-8");
char[] buffer = new char[len];
reader.read(buffer);
return new String(buffer);
}
The problem is that It doesn't dowload the whole page. For example, if myUrl is "https://wikipedia.org", then the output is
How can I download the whole page?
Update
Second answer from here Read/convert an InputStream to a String solved my problem. The problem is in readIt function. You should read response from InputStream like this:
static String convertStreamToString(java.io.InputStream is) {
java.util.Scanner s = new java.util.Scanner(is).useDelimiter("\\A");
return s.hasNext() ? s.next() : "";
}

There are a number of mistakes your code:
You are reading into a character buffer with a fixed size.
You are ignoring the result of the read(char[]) method. It returns the number of characters actually read ... and you need to use that.
You are assuming that read(char[]) will read all of the data. In fact, it is only guaranteed to return at least one character ... or zero to indicate that you have reached the end of stream. When you reach from a network connection, you are liable to only get the data that has already been sent by the other end and buffered locally.
When you create the String from the char[] you are assuming that every position in the character array contains a character from your stream.
There are multiple ways to do it correctly, and this is one way:
public String readIt(InputStream stream) throws IOException {
Reader reader = new InputStreamReader(stream, "UTF-8");
char[] buffer = new char[4096];
StringBuilder builder = new StringBuilder();
int len;
while ((len = reader.read(buffer) > 0) {
builder.append(buffer, 0, len);
}
return builder.toString();
}
Another way to do it is to look for an existing 3rd-party library method with a readFully(Reader) method.

You need to read in a loop till there are no more bytes left in the InputStream.
while (-1 != (len = in.read(buffer))) { //do stuff here}

You are reading only 10000 bytes from the input stream.
Use a BufferedReader to make your life easier.
public String readIt(InputStream stream) throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
StringBuilder out = new StringBuilder();
String newLine = System.getProperty("line.separator");
String line;
while ((line = reader.readLine()) != null) {
out.append(line);
out.append(newLine);
}
return out.toString();
}

Converting utf8 to gb2312 in java

Just look at the code bellow
try {
String str = "上海上海";
String gb2312 = new String(str.getBytes("utf-8"), "gb2312");
String utf8 = new String(gb2312.getBytes("gb2312"), "utf-8");
System.out.println(str.equals(utf8));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
print false!!!
I run this code both under jdk7 and jdk8 and my code style of IDE is utf8.
Can anyone help me?

what you are looking for is the encoding/decoding when you output/input.
as #kalpesh said, internally, it is all unicode. if you want to READ a stream in a specific encoding and then WRITE it to a different one, you will have to specify the encoding for the conversion between bytes (in the stream) and strings (in java), and then between strings (in java) to bytes (the output stream) like so:
InputStream is = new FileInputStream("utf8_encoded_text.txt");
OutputStream os = new FileOutputStream("gb2312_encoded.txt");
Reader r = new InputStreamReader(is,"utf-8");
BufferedReader br = new BufferedReader(r);
Writer w = new OutputStreamWriter(os, "gb2312");
BufferedWriter bw = new BufferedWriter(w);
String s=null;
while((s=br.readLine())!=null) {
bw.write(s);
}
br.close();
bw.close();
os.flush();
of course, you still have to do proper exception handling to make sure everything is properly closed.

String gb2312 = new String(str.getBytes("utf-8"), "gb2312");
This statement is incorrect because String constructor is supposed to take matching byte array and charset, you are saying bytes are utf-8 but charset is gb2312

Read String and bytes from the same file java

I'm looking for a way to switch between reading bytes (as byte[]) and reading lines of Strings from a file. I know that a byte[] can be obtained form a file through a FileInputStream, and a String can be obtained through a BufferedReader, but using both of them at the same time is proving problematic. I know how long the section of bytes are. String encoding can be kept constant from when I write the file. The filetype is a custom one that is still in development, so I can change how I write data to it.
How can I read Strings and byte[]s from the same file in java?

Read as bytes. When you have read a sequence of bytes that you know should be a string, place those bytes in an array, put the array inside a ByteArrayInputStream and use that as the underlying InputStream for a Reader to get the bytes as characters, then read those characters to produce a String.
For the later parts of this process see the related SO question on how to create a String from an InputStream.

Read the file as Strings using a BufferedReader then use String.getBytes().

Why not try this:
BufferedReader bufferedReader = null;
try {
bufferedReader = new BufferedReader(new FileReader("testing.txt"));
String line = bufferedReader.readLine();
while(line != null){
byte[] b = line.getBytes();
}
} finally {
if(bufferedReader!=null){
bufferedReader.close();
}
}
or
FileInputStream in = null;
BufferedReader bufferedReader = null;
try {
bufferedReader = new BufferedReader(new FileReader("xanadu.txt"));
String line = bufferedReader.readLine();
while(line != null){
//read your line
}
in = new FileInputStream("xanadu.txt");
int c;
while ((c = in.read()) != -1) {
//read your bytes (c)
}
} finally {
if (in != null) {
in.close();
}
if(bufferedReader!=null){
bufferedReader.close();
}
}

Read everything as bytes from the buffered input stream, and convert string sections into String's using constructor that accepts the byte array:
String string = new String(bytes, offset, length, "US-ASCII");
Depending on how the data are actually encoded, you may need to use "UTF-8" or something else as the name of the charset.

Convert Windows-1252 to UTF-16 in Java

I am trying to convert all Windows special characters to their Unicode equivalent. We have a Flex application, where a user saves some Rich Text, and then it is emailed through a Java Emailer to their recipient. However, we keep running into Word's special characters that just show up in the email as a ?.
So far I've tried
private String replaceWordChars(String text_in) {
String s = text_in;
// smart single quotes and apostrophe
s = s.replaceAll("[\\u2018|\\u2019|\\u201A]", "\'");
// smart double quotes
s = s.replaceAll("[\\u201C|\\u201D|\\u201E]", "\"");
// ellipsis
s = s.replaceAll("\\u2026", "...");
// dashes
s = s.replaceAll("[\\u2013|\\u2014]", "-");
// circumflex
s = s.replaceAll("\\u02C6", "^");
// open angle bracket
s = s.replaceAll("\\u2039", "<");
// close angle bracket
s = s.replaceAll("\\u203A", ">");
// spaces
s = s.replaceAll("[\\u02DC|\\u00A0]", " ");
return s;
Which works, but I don't want to hand encode all Windows-1252 characters to their equivalent UTF-16 (assuming that's what default Java character set is)
However our users keep finding more characters from Microsoft Word that Java just can't handle. So I searched and searched, and found this example
private String replaceWordChars(String text_in) {
String s = text_in;
try {
byte[] b = s.getBytes("Cp1252");
byte[] encoded = new String(b, "Cp1252").getBytes("UTF-16");
s = new String(encoded, "UTF-16");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return s;
But when I watch the encoding happen in the Eclipse debugger, nothing changes.
There has to be a simple solution to dealing with Microsoft's lovely encoding with Java.
Any thoughts?

You could try using java.nio.charset.Charset:
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
final CharBuffer windowsEncoded = windowsCharset.decode(ByteBuffer.wrap(new byte[] {(byte) 0x91}));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
System.out.println(new String(utfEncoded, utfCharset.displayName()));

Use the following steps:
Create an InputStreamReader using the source file's encoding (Windows-1252)
Create an OutputStreamWriter using the destination file's encoding (UTF-16)
Copy the information read from the reader to the writer. You can use BufferedReader and BufferedWriter to write contents line-by-line.
So your code may look like this:
public void reencode(InputStream source, OutputStream dest,
String sourceEncoding, String destEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dest, destEncoding));
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.newLine();
}
}
This, of course, excludes try/catch stuff and delegates it to the caller.
If you're just trying to get the contents as a string of sorts, you can replace the writer with StringWriter and return its toString value. Then you don't need a destination stream or encoding, just a place to dump characters:
public String decode(InputStream source, String sourceEncoding)
throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(source, sourceEncoding));
StringWriter writer = new StringWriter();
String in;
while ((in = reader.readLine()) != null) {
writer.write(in);
writer.write('\n'); // Java newline should be fine, test this just in case
}
return writer.toString();
}

What seems to work so far for everything I've tested is:
private String replaceWordChars(String text_in) {
String s = text_in;
final Charset windowsCharset = Charset.forName("windows-1252");
final Charset utfCharset = Charset.forName("UTF-16");
byte[] incomingBytes = s.getBytes();
final CharBuffer windowsEncoded =
windowsCharset.decode(ByteBuffer.wrap(incomingBytes));
final byte[] utfEncoded = utfCharset.encode(windowsEncoded).array();
s = new String(utfEncoded);
return s;
}

Convert InputStream to String with encoding given in stream data

My input is a InputStream which contains an XML document. Encoding used in XML is unknown and it is defined in the first line of XML document.
From this InputStream, I want to have all document in a String.
To do this, I use a BufferedInputStream to mark the beginning of the file and start reading first line. I read this first line to get encoding and then I use an InputStreamReader to generate a String with the correct encoding.
It seems that it is not the best way to achieve this goal because it produces an OutOfMemory error.
Any idea, how to do it?
public static String streamToString(final InputStream is) {
String result = null;
if (is != null) {
BufferedInputStream bis = new BufferedInputStream(is);
bis.mark(Integer.MAX_VALUE);
final StringBuilder stringBuilder = new StringBuilder();
try {
// stream reader that handle encoding
final InputStreamReader readerForEncoding = new InputStreamReader(bis, "UTF-8");
final BufferedReader bufferedReaderForEncoding = new BufferedReader(readerForEncoding);
String encoding = extractEncodingFromStream(bufferedReaderForEncoding);
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
// stream reader that handle encoding
bis.reset();
final InputStreamReader readerForContent = new InputStreamReader(bis, encoding);
final BufferedReader bufferedReaderForContent = new BufferedReader(readerForContent);
String line = bufferedReaderForContent.readLine();
while (line != null) {
stringBuilder.append(line);
line = bufferedReaderForContent.readLine();
}
bufferedReaderForContent.close();
bufferedReaderForEncoding.close();
} catch (IOException e) {
// reset string builder
stringBuilder.delete(0, stringBuilder.length());
}
result = stringBuilder.toString();
}else {
result = null;
}
return result;
}

The call to mark(Integer.MAX_VALUE) is causing the OutOfMemoryError, since it's trying to allocate 2GB of memory.
You can solve this by using an iterative approach. Set the mark readLimit to a reasonable value, say 8K. In 99% of cases this will work, but in pathological cases, e.g 16K spaces between the attributes in the declaration, you will need to try again. Thus, have a loop that tries to find the encoding, but if it doesn't find it within the given mark region, it tries again, doubling the requested mark readLimit size.
To be sure you don't advance the input stream past the mark limit, you should read the InputStream yourself, upto the mark limit, into a byte array. You then wrap the byte array in a ByteArrayInputStream and pass that to the constructor of the InputStreamReader assigned to 'readerForEncoding'.

You can use this method to convert inputstream to string. this might help you...
private String convertStreamToString(InputStream input) throws Exception{
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line);
}
input.close();
return sb.toString();
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting strings by newline trouble - java

Related

URLConnection doesn't read whole page

Converting utf8 to gb2312 in java

Read String and bytes from the same file java

Convert Windows-1252 to UTF-16 in Java

Convert InputStream to String with encoding given in stream data

Categories

Resources