How to cut a String into 1 megabyte subString with Java?

How to cut a String into 1 megabyte subString with Java? - java

I have come up with the following:
public static void cutString(String s) {
List<String> strings = new ArrayList<>();
int index = 0;
while (index < s.length()) {
strings.add(s.substring(index, Math.min(index + 1048576, s.length())));
index += 1048576;
}
}
But my problem is, that using UTF-8 some character doesn't exactly take 1 byte, so using 1048576 to tell where to cut the String is not working. I was thinking about maybe using Iterator, but that doesn't seem efficient. What'd be the most efficient solution for this? The String can be smaller than 1 Mb to avoid character slicing, just not bigger than that!

Quick, unsafe hack
You can use s.getBytes("UTF-8") to get an array with the actual bytes used by each UTF-8 character. Like this:
System.out.println("¡Adiós!".getBytes("UTF-8").length);
// Prints: 9
Once you have that, it's just a matter of splitting the byte array in chunks of length 1048576, and then turn the chunks back into UTF-8 strings with new String(chunk, "UTF-8").
However, by doing it like that you can break multi-byte characters at the beginning or end of the chunks. Say the 1048576th character is a 3-byte Unicode character: the first byte would go into the first chunk and the other two bytes would get put into the second chunk, thus breaking the encoding.
Proper approach
If you can relax the "1 MB" requirement, you can take a safer approach: split the string in chunks of 1048576 characters (not bytes), and then test each chunk's real length with getBytes, removing chars from the end as needed until the real size is equal or less than 1 MB.
Here's an implementation that won't break characters, at the expense of having some lines smaller than the given size:
public static List<String> cutString(String original, int chunkSize, String encoding) throws UnsupportedEncodingException {
List<String> strings = new ArrayList<>();
final int end = original.length();
int from = 0, to = 0;
do {
to = (to + chunkSize > end) ? end : to + chunkSize; // next chunk, watch out for small strings
String chunk = original.substring(from, to); // get chunk
while (chunk.getBytes(encoding).length > chunkSize) { // adjust chunk to proper byte size if necessary
chunk = original.substring(from, --to);
}
strings.add(chunk); // add chunk to collection
from = to; // next chunk
} while (to < end);
return strings;
}
I tested it with chunkSize = 24 so you could see the effect. It should work as well with any other size:
String test = "En la fase de maquetación de un documento o una página web o para probar un tipo de letra es necesario visualizar el aspecto del diseño. ٩(-̮̮̃-̃)۶ ٩(●̮̮̃•̃)۶ ٩(͡๏̯͡๏)۶ ٩(-̮̮̃•̃).";
for (String chunk : cutString(test, 24, "UTF-8")) {
System.out.println(String.format(
"Chunk [%s] - Chars: %d - Bytes: %d",
chunk, chunk.length(), chunk.getBytes("UTF-8").length));
}
/*
Prints:
Chunk [En la fase de maquetaci] - Chars: 23 - Bytes: 23
Chunk [ón de un documento o un] - Chars: 23 - Bytes: 24
Chunk [a página web o para pro] - Chars: 23 - Bytes: 24
Chunk [bar un tipo de letra es ] - Chars: 24 - Bytes: 24
Chunk [necesario visualizar el ] - Chars: 24 - Bytes: 24
Chunk [aspecto del diseño. ٩(] - Chars: 22 - Bytes: 24
Chunk [-̮̮̃-̃)۶ ٩(●̮̮] - Chars: 14 - Bytes: 24
Chunk [̃•̃)۶ ٩(͡๏̯͡] - Chars: 12 - Bytes: 23
Chunk [๏)۶ ٩(-̮̮̃•̃).] - Chars: 14 - Bytes: 24
*/
Another test with a 3 MB string like the one you mention in your comments:
String string = "0123456789ABCDEF";
StringBuilder bigAssString = new StringBuilder(1024*1024*3);
for (int i = 0; i < ((1024*1024*3)/16); i++) {
bigAssString.append(string);
}
System.out.println("bigAssString.length = " + bigAssString.toString().length());
bigAssString.replace((1024*1024*3)/4, ((1024*1024*3)/4)+1, "á");
for (String chunk : cutString(bigAssString.toString(), 1024*1024, "UTF-8")) {
System.out.println(String.format(
"Chunk [...] - Chars: %d - Bytes: %d",
chunk.length(), chunk.getBytes("UTF-8").length));
}
/*
Prints:
bigAssString.length = 3145728
Chunk [...] - Chars: 1048575 - Bytes: 1048576
Chunk [...] - Chars: 1048576 - Bytes: 1048576
Chunk [...] - Chars: 1048576 - Bytes: 1048576
Chunk [...] - Chars: 1 - Bytes: 1
*/

You can use a ByteArrayOutputStream with an OutputStreamWriter
ByteArrayOutputStream out = new ByteArrayOutputStream();
Writer w = OutputStreamWriter(out, "utf-8");
//write everything to the writer
w.write(myString);
byte[] bytes = out.toByteArray();
//now you have the actual size of the string, you can parcel by Mb. Be aware that problems may occur however if you have a multi-byte character separated into two locations

Related

About Java Android BASE64 decoding to ASCII String

I have Base64 string data that i have received from a service.
I am able to decode this data and get byte array.
But when i create a new string from that byte array, my server is not being able to read that data properly.
But this same process in C language of Linux based device is working fine on my server side. That is to say, if i (Base64) decode that same string (using OpenSSL and get char array) on that device and send it to my server, the server is able to read that properly.
Now, i tried a sample code in eclipse to understand the problem. Below is the sample,
String base1 =
"sUqVKrgErEId6j3rH8BMMpzvXuTf05rj0PlO/eLOoJwQb3rXrsplAl28unkZP0WvrXRTlpAmT3Y
ohtPFl2+zyUaCSrYfug5JtVHLoVsJ9++Afpx6A5dupn3KJQ9L9ItfWvatIlamQyMo2S5nDypCw79
B2HNAR/PG1wfgYG5OPMNjNSC801kQSE9ljMg3hH6nrRJhXvEVFlllKIOXOYuR/NORAH9k5W+rQeQ
7ONsnao2zvYjfiKO6eGleL6/DF3MKCnGx1sbci9488EQhEBBOG5FGJ7KjTPEQzn/rq3m1Yj9Le/r
KsmzbRNcJN2p/wy1xz9oHy8jWDm81iwRYndJYAQ==";
byte[] b3 = Base64.getDecoder().decode(base1.getBytes());
System.out.println("B3Len:" + b3.length );
String s2 = new String(b3);
System.out.println("S2Len:" + s2.length() );
System.out.println("B3Hex: " + bytesToHex(b3) );
System.out.println("B3HexLen: " + bytesToHex(b3).length() );
byte[] b2 = s2.getBytes();
System.out.println("B2Len:" + b2.length );
int count = 0;
for(int i = 0; i< b3.length; i++) {
if(b3[i] != b2[i]) {
count++;
System.out.println("Byte: " + i + " >> " + b3[i] + " != " + b2[i]);
}
}
System.out.println("Count: " + count);
System.out.println("B2Hex: " + bytesToHex(b2) );
System.out.println("B2HexLen: " + bytesToHex(b2).length() );
Below is output:
B3Len:256
S2Len:256
B3Hex:
b14a952ab804ac421dea3deb1fc04c329cef5ee4dfd39ae3d0f94efde2cea09c106f7ad7aeca
65025dbcba79193f45afad74539690264f762886d3c5976fb3c946824ab61fba0e49b551cba1
5b09f7ef807e9c7a03976ea67dca250f4bf48b5f5af6ad2256a6432328d92e670f2a42c3bf41
d8734047f3c6d707e0606e4e3cc3633520bcd35910484f658cc837847ea7ad12615ef1151659
65288397398b91fcd391007f64e56fab41e43b38db276a8db3bd88df88a3ba78695e2fafc317
730a0a71b1d6c6dc8bde3cf0442110104e1b914627b2a34cf110ce7febab79b5623f4b7bfaca
b26cdb44d709376a7fc32d71cfda07cbc8d60e6f358b04589dd25801
B3HexLen: 512
B2Len:256
Byte: 52 >> -112 != 63
Byte: 175 >> -115 != 63
Byte: 252 >> -99 != 63
Count: 3
B2Hex:
b14a952ab804ac421dea3deb1fc04c329cef5ee4dfd39ae3d0f94efde2cea09c106f7ad7aeca
65025dbcba79193f45afad7453963f264f762886d3c5976fb3c946824ab61fba0e49b551cba1
5b09f7ef807e9c7a03976ea67dca250f4bf48b5f5af6ad2256a6432328d92e670f2a42c3bf41
d8734047f3c6d707e0606e4e3cc3633520bcd35910484f658cc837847ea7ad12615ef1151659
65288397398b91fcd391007f64e56fab41e43b38db276a3fb3bd88df88a3ba78695e2fafc317
730a0a71b1d6c6dc8bde3cf0442110104e1b914627b2a34cf110ce7febab79b5623f4b7bfaca
b26cdb44d709376a7fc32d71cfda07cbc8d60e6f358b04583fd25801
B2HexLen: 512
I understand that there are extended characters in this string.
So, here we can see that the reconverting the hex to string is not working properly, because of the differences in the byte arrays.
I actually need this to work because, i have much larger Base64 string than the one in this sample that i need to send to my server which is trying to read ASCII string.
Or,
Can anyone give me a solution that can give me an ASCII String output that is identical to char array output from C language (OpenSSL decoding) on Linux device.

Calculating APNS frame sizes / formatting byte stream

As per [ https://developer.apple.com/library/ios/documentation/NetworkingInternet/Conceptual/RemoteNotificationsPG/Chapters/CommunicatingWIthAPS.html#//apple_ref/doc/uid/TP40008194-CH101-SW4 ] the proper notification format appears to be:
OutputStream os; // assuming this connection is valid/open
// header
os.write(command); // 1 byte | value of 2 in doc but 1 in notnoop/java-apns ?
os.write(frame_length); // 4 bytes | value of total size of frame items:
// item 1 - token
os.write(item_id_1); // 1 byte | value of 1
os.write(item_length_1); // 2 bytes | value of size of data (32 for token):
os.write(item_data_1); // 32 bytes | token data
// = total frame item size: 35
// item 2 - payload
os.write(item_id_2); // 1 byte | value of 2
os.write(item_length_2); // 2 bytes | value of size of data (175 for my payload):
os.write(item_data_2); // 175 bytes | payload data
// = total frame item size: 178
// item 3 - identifier
os.write(item_id_3); // 1 byte | value of 3
os.write(item_length_3); // 2 bytes | value of size of data (4)
os.write(item_data_3); // 4 byte | identifier data
// = total frame item size: 7
// item 4 - expiration date
os.write(item_id_4); // 1 byte | value of 4
os.write(item_length_4); // 2 bytes | value of size of data (4)
os.write(item_data_4); // 4 byte | expiration data
// = total frame item size: 7
// item 5 - priority
os.write(item_id_5); // 1 byte | value of 5
os.write(item_length_5); // 2 bytes | value of size of data (1):
os.write(item_data_5); // 1 byte | priority data
// = total frame item size: 4
Assuming that's all correct, that should give a frame data length total of: 35 + 178 + 7 + 7 + 4 = 232 by summing all of the frame item totals.
However in looking over some of the notnoop/java-apns code:
public byte[] marshall() {
if (marshall == null) {
marshall = Utilities.marshallEnhanced(COMMAND, identifier,
expiry, deviceToken, payload);
}
return marshall.clone();
}
public int length() {
int length = 1 + 4 + 4 + 2 + deviceToken.length + 2 + payload.length;
//1 = ?
//4 = identifier length
//4 = expiration length
//2 = ?
//32 = token length
//2 = ?
//x = payload length
final int marshalledLength = marshall().length;
assert marshalledLength == length;
return length;
}
I fail to see how this is calculating the length correctly. My code however does not work while this presumably does. What am I doing wrong?

First problem: I was examining an enhanced format (uses command 1) while trying to use a new format (uses command 2).
Second problem: I wasn't using some form of ByteBuffering so the packet wasn't formed right.
The total frame size calculations were correct in both cases.

SMPP Submit Long Message and message split

We are using SMPP cloud-hopper library to SMS long long messages to SMS gateway Innovativetxt.com, but it seems like when we split following the long message TO 140 bytes each part. The number of characters in each message gets to 134 character.
However industry standard is kind of 153 character shall be for each part of GSM Encoded long message. Is it something wrong we are doing by having only 134 character when we split via 140 byte? If we trying to submit greater than 140 bytes message, the gateway provider rejects it with message oversized message body.
Shall be split the message to 153 character each to sbumit to SMSC, instead spiting the messages via 140 bytes each.
What is the best way to split long message? By message size i.e 140 bytes or message characters count?
Anyone faced same issues via cloudhopper or other Java-based Library what we shall do.

It's a common confusion. You are doing everything right. Message lengths may be 160 chars (7-bit GSM 03.38), 140 chars (8-bit Latin), 70 chars (16-bit UCS-2). Notice: 160 * 7 == 140 * 8 == 70 * 16.
When you split a long message additional info like total parts number and part index is stored in the message body, so-called User Data Header (UDH). This header also takes place. So, with UDH you left with 153 GSM chars (7-bit), 134 chars/bytes (8-bit) payload or 67 2bytes-unicode chars (16-bit)
See also http://www.nowsms.com/long-sms-text-messages-and-the-160-character-limit
The UDH is 6 bytes long for Contatenated message 8-bit as in your case.
UDH structure
0x05: Length of UDH (5 bytes to follow)
0x00: Concatenated message Information Element (8-bit reference number)
0x03: Length of Information Element data (3 bytes to follow)
0xXX: Reference number for this concatenated message
0xYY: Number of fragments in the concatenated message
0xZZ: Fragment number/index within the concatenated message
Total message length, bits: 160*7 = 140*8 = 1120
UDH length, bits: 6*8 = 48
Left payload, bits: 1120-48 = 1072
For GSM 03.38 you get 1072/7 = 153 GSM (7-bit) chars + 1 filling unused bit.
For Latin you get 1072/8 = 134 (8-bit) chars.
For UCS-2 you get 1072/16 = 67 (16-bit) chars.
As you can see 153 GSM chars equals to 134 bytes minus 1 bit. Probably these 134 chars is what Java reports you. But once you split your long text message you end up with a binary message containing both text and UDH. And you should treat the message as binary. I suggest you to make binary dumps out of the resulting parts and investigate them.

Hello See sample method for sending both short or long SMS
public synchronized String sendSMSMessage(String aMessage,
String aSentFromNumber, String aSendToNumber,
boolean requestDeliveryReceipt) {
byte[] textBytes = CharsetUtil.encode(aMessage,
CharsetUtil.CHARSET_ISO_8859_1);
try {
SubmitSm submitMsg = new SubmitSm();
// add delivery receipt if enabled.
if (requestDeliveryReceipt) {
submitMsg
.setRegisteredDelivery(SmppConstants.REGISTERED_DELIVERY_SMSC_RECEIPT_REQUESTED);
}
submitMsg.setSourceAddress(new Address((byte) 0x03, (byte) 0x00,
aSentFromNumber));
submitMsg.setDestAddress(new Address((byte) 0x01, (byte) 0x01,
aSendToNumber));
if (textBytes != null && textBytes.length > 255) {
submitMsg.addOptionalParameter(new Tlv(SmppConstants.TAG_MESSAGE_PAYLOAD, textBytes, "message_payload"));
}else{
submitMsg.setShortMessage(textBytes);
}
logger.debug("About to send message to " + aSendToNumber
+ ", Msg is :: " + aMessage + ", from :: "
+ aSentFromNumber);
SubmitSmResp submitResp = smppSession.submit(submitMsg, 15000);
logger.debug("Message sent to " + aSendToNumber
+ " with message id " + submitResp.getMessageId());
return submitResp.getMessageId();
} catch (Exception ex) {
logger.error("Exception sending message [Msg, From, To] :: ["
+ aMessage + ", " + aSentFromNumber + ", " + aSendToNumber,
ex);
}
logger.debug("Message **NOT** sent to " + aSendToNumber);
return "Message Not Submitted to " + aSendToNumber;
}

extract values of ping message

I am working on an application on android that performs ping requests (via android shell) and I read from the console the message displayed. A typical message is the following
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=46 time=186 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=46 time=209 ms
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 186.127/197.891/209.656/11.772 ms
I store the above message in a String. I want to extract the values of the time, for example 186 and 209 and also the percentage for loss, 0 (in this case).
I was thinking to go through the string and look the values after "time=". However I don't know how to do it.
How can I manipulate the string I have in order to extract the values?

Start by getting each line of the string:
String[] lines = pingResult.split("\n");
Then, loop and use substring.
for (String line : lines) {
if (!line.contains("time=")) continue;
// Find the index of "time="
int index = line.indexOf("time=");
String time = line.substring(index + "time=".length());
// do what you will
}
If you want to parse to an int, you could additionally do:
int millis = Integer.parseInt(time.replaceAll("[^0-9]", ""));
This will remove all non-digit characters
You can do something similar for the percentage:
for (String line : lines) {
if (!line.contains("%")) continue;
// Find the index of "received, "
int index1 = line.indexOf("received, ");
// Find the index of "%"
int index2 = line.indexOf("%");
String percent = line.substring(index1 + "received, ".length(), index2);
// do what you will
}

String of 1's & 0's into Raw Data

I have a web page with random 1's and 0's in the body and I want to treat it as raw binary data and save it to a file.
<html>
<head>...</head>
<pre style="word-wrap: break-word; white-space: pre-wrap;">1 0 1 0 1 1 1 1 1 0</pre>
</body>
</html>
Alternatively, I can get the file in one column. If I just url.openStream() and read bytes, it spits out ascii values (49 & 48). I'm also not sure how to write one bit at a time to a file. How do I go about doing this?

<pre style="word-wrap: break-word; white-space: pre-wrap;">1 0 1 0 1 1 1 1 1 0</pre>
This can be sent as two (base64) or three (hex) bytes, so I am assuming efficiency isn't an issue here. ;) Once you extract the String you can convert it with.
String s = "1 0 1 0 1 1 1 1 1 0";
long l = Long.parseLong(s.replaceAll(" ", ""), 2);

You can read bits into byte and then write bytes to a file:
int byteIndex = 0;
int currentByte = 0;
while (hasBits) {
String bit = readBit();
currentByte = currentByte << 1 | Integer.valueOf(bit);
if (++byteIndex == 8) {
writeByte(currentByte);
currentByte = 0;
byteIndex = 0;
}
}
// write the rest of bits here
Also I agree with Robert Rouhani, this is very inefficient way to transfer data.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to cut a String into 1 megabyte subString with Java? - java

Related

About Java Android BASE64 decoding to ASCII String

Calculating APNS frame sizes / formatting byte stream

SMPP Submit Long Message and message split

extract values of ping message

String of 1's & 0's into Raw Data

Categories

Resources