Processing a BZIP string/file in Scala

Processing a BZIP string/file in Scala - java

I'm punishing myself a bit by doing the python challenges series in Scala.
Now, one of the challenges is to read in a string that's been compressed using the bzip algorithm and output the result.
BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084
Now, after some digging it appears as if there isn't a standard java library for bzip processing, but there is something in the apache ant project, that this guy has kindly taken out for use as a separate library.
The thing is, I can't seem to get it to work with the following code, it just hangs in the scala REPL and the JVM maxes out at 100% CPU usage
This is the code I'm trying...
import java.io.{ByteArrayInputStream}
import org.apache.tools.bzip2.{CBZip2InputStream}
import org.apache.commons.io.{IOUtils}
object ChallengeEight extends Application {
val inputString = """BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084"""
val inputStream = new ByteArrayInputStream( inputString.getBytes("UTF-8") ) //convert string to inputstream
inputStream.skip(2) //skip the 'BZ' part at the start
val bzipInputStream = new CBZip2InputStream(inputStream) //hangs here....
val result = IOUtils.toString(bzipInputStream, "UTF-8");
println(result)
}
Anyone got any ideas? Or is the CBZip2InputStream class expecting some extra bytes that you might find in a file that has been zipped with bzip2?
Any help would be appreciated
EDIT For the record this is the python solution
import bz2
un = "BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!" \
"\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084"
print [bz2.decompress(elt) for elt in (un)]

To escape characters use a unicode escape sequence like \uXXXX syntax where XXXX is the hexadecimal sequence for the unicode character.
val un = "BZh91AY&SYA\u00af\u0082\r\u0000\u0000\u0001\u0001\u0080\u0002\u00c0\u0002\u0000 \u0000!\u009ah3M\u0007<]\u00c9\u0014\u00e1BA\u0006\u00be\u00084"

You are enclosing your string in triple quotes which means you will pass the literal characters to the algorithm rather than the control/binary characters they represent.

Related

Character coding between mysql and java

I have an error in printing special characters in Java.
The system reads a product name from a mysql database, and checking the database from the command line, the data displays the Registered Symbol ® correctly.
A java program then runs a database read to get information of orders to print out as a PDF, but when the print is produced the ® symbol becomes 'fi'.
Is there a way of retaining the myself character coding when handling in Java?

Before printing to PDF, you can replace the special characters with the unicode characters as below.
public static String specialCharactersConversion( String charString ) {
if( isNotEmpty( charString ) ){
charString = charString.replaceAll( "\\(R\\)", "\u00AE" );
}
}

There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
So, what you can do before converting your text to PDF, you can convert special characters or entire text to Unicode sequences. The answer is copied with modifications from this question: Convert International String to \u Codes in java
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder

Encoding issue with JLine

Jline is a module for intercepting user input at a console before the user presses Enter. It uses JNA or similar wizardry.
I'm doing a few experiments with it and I'm getting encoding problems when I input more "exotic" Unicode characters. The OS here is W10 and I'm using Cygwin. Also this is in Groovy but should be obvious to Java people.
def terminal = org.jline.terminal.TerminalBuilder.builder().jna( true ).system( true ).build()
terminal.enterRawMode()
// NB the Terminal I get is class org.jline.terminal.impl.PosixSysTerminal
def reader = terminal.reader()
def bytes = [] // NB class ArrayList
int readInt = -1
while( readInt != 13 && readInt != 10 ) {
readInt = reader.read()
byte convertedByte = (byte)readInt
// see what the binary looks like:
String binaryString = String.format("%8s", Integer.toBinaryString( convertedByte & 0xFF)).replace(' ', '0')
println "binary |$binaryString|"
bytes << (byte)readInt // NB means "append to list"
println ">>> read |$readInt| byte |$convertedByte|"
}
// strip final byte (13 or 10)
bytes = bytes[0..-2]
println "z bytes $bytes, class ${bytes.class.name}"
def response = new String( (byte[])bytes.toArray(), 'UTF-8' )
// to get proper out encoding for Cygwin I then need to do this (I have no idea why!)
def psOut = new PrintStream(System.out, true, 'UTF-8' )
psOut.print( "using PrintStream: |$response|" )
This works fine with one-byte Unicode, and letters like "é" (2-bytes) get handled fine. But it goes wrong with "ẃ":
ẃ --> Unicode U+1E83
UTF-8 HEX: 0xE1 0xBA 0x83 (e1ba83)
BINARY: 11100001:10111010:10000011
Actually the binary it puts out when you enter "ẃ" is 11100001:10111010:10010010.
This translates to U+1E92, which is another Polish character, "Ẓ". And that is indeed what gets printed out in the response String.
Unfortunately the JLine package hands you this reader, which is class org.jline.utils.NonBlocking$NonBlockingInputStreamReader... So I don't really know what I can do to investigate its encoding (I presume UTF-8) or somehow modify it... Can anyone explain what the problem is?

As far as I can tell this relates to a Cygwin-specific problem, as asked and then answered by me a year ago.
There is a solution, in my answer to the question I asked directly after this one... which correctly deals with Unicode input, even when outside the Basic Multilingual Plane, using JLine, ... and using a Cygwin console ... hopefully.

send a string from c++ to java through socket

I have a block of code on the c++ end that would pass a string like this:
char someName[100]="Some String here";
send(sock,someName,sizeof(someName),0);
and on the other end I have a java code looking for a string message like this:
DataInputStream dIn= new DataInputStream(SOCK.getInputStream());
String filename=dIn.readUTF(); //Looks for "Some String here"
the code does not continue and gives a UTFDataFormatException. So I'm basically looking for conversion on c++ someName to UTF-8 format so both ends will be happy!
Thanks!
EDIT:
I tried using BufferedReader on the java code and got something like:
‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~‡áø~

DataInputBuffer.ReadUTF() reads in a string encoded in modified utf-8, which makes it a bad candidate for reading data that is not encoded with DataOutputBuffer.WriteUTF().
Also, your C++ program sends a fixed length string padded with zero's. This can also cause problem when the receiver doesn't expect this.
C++11 also supports UTF-8 string literals, see here.

Write the data into a text file from C/C++ application and read that text file from Java application.

Writing strings with chars like "ñ" to a txt file

Im having a strange issue trying to write in text files with strings which contain characters like "ñ", "á".. and so on. Let me first show you my little piece of code:
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
String content = "whatever";
int c;
c = System.in.read();
content = content + (char)c;
FileWriter fw = new FileWriter("filename.txt");
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.close();
}
}
In this example, im just reading a char from the keyboard input and appending it to a given string; then writting the final string into a txt. The problem is that if I type an "ñ" for example (i have a Spanish layout keyboard), when i check the txt, it shows a strange char "¤" where there should be a "ñ", that is, the content of the file is "whatever¤". The same happens with "ç", "ú"..etc. However it writes it fine ("whateverñ") if i just forget about the keyboard input and i write:
...
String content = "whateverñ";
...
or
...
content = content + "ñ";
...
It makes me think that there might be something wrong with the read() method? Or maybe im using it wrongly? or should i use a different method to get the keyboard input? or..? Im a bit lost here.
(Im using the jdk 7u45 # Windows 7 Pro x64)

So ...
It works (i.e. you can read the accented characters on the output file) if you write them as literal strings.
It doesn't work when you read them from System.in and then write them.
This suggests that the problem is on the input side. Specifically, I think your console / keyboard must be using a character encoding for the input stream that does not match the encoding that Java thinks should be used.
You should be able to confirm this tentative diagnosis by outputting the characters you are reading in hexadecimal, and then checking the codes against the unicode tables (which you can find at unicode.org for example).
It strikes me as "odd" that the "platform default encoding" appears to be working on the output side, but not the input side. Maybe someone else can explain ... and offer a concrete suggestion for fixing it. My gut feeling is that the problem is in the way your keyboard is configured, not in Java or your application.

files do not remember their encoding format, when you look at a .txt, the text editor makes a "best guess" to the encoding used.
if you try to read the file into your program again, the text should be back to normal.
also, try printing the "strange" character directly.

Creating Java DataInputStream data in Python

I have a Java program that uses a DataInputStream for storing object data.
Example:
DataInputStream tInput = new DataInputStream(getClass().getResourceAsStream(aDirectory + "/ResultItemInfo.dat"));
this._text = tInput.readUTF();
this._image = tInput.readUTF();
this._audio = tInput.readUTF();
this._random = false;
if (tInput.read() == 1) {
this._random = true;
}
this._hasMenu = false;
if (tInput.read() == 1) {
this._hasMenu = true;
}
Nice, isn't it?
There is an existing dataset, and now I have to add some records. If the tool that I am required to make was written in Java too, this would be pretty easy. Unfortunately, it is written in Python, so I have to discover a way how to create files that can be read from the Java application using Python.
Is there any easy way to do this?
As a last resort, I could:
Modify the Java app and use XML. This would break compatibility with all existing data, so I really don't want to do it.
Use Jython. Possible solution, but I want pure C-Python.
Reverse-Engineer the data format. Not a good solution either.

For a string to be readUTF-able, it must contain two bytes of counter and then exactly as many bytes of UTF-8 encoded data as counter says.
So I suggest this piece of code should write a unicode string data the way what readUTF() could read:
encoded = data.encode('UTF-8')
count = len(encoded)
msb, lsb = divmod(count, 256) # split to two bytes
outfile.write(chr(msb))
outfile.write(chr(lsb))
outfile.write(encoded)
The outfile must be open in binary mode (e.g. "wb").
This is according to description of Java interface DataInput. I did not try to run this, though.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Processing a BZIP string/file in Scala - java

To escape characters use a unicode escape sequence like \uXXXX syntax where XXXX is the hexadecimal sequence for the unicode character. val un = "BZh91AY&SYA\u00af\u0082\r\u0000\u0000\u0001\u0001\u0080\u0002\u00c0\u0002\u0000 \u0000!\u009ah3M\u0007<]\u00c9\u0014\u00e1BA\u0006\u00be\u00084"

You are enclosing your string in triple quotes which means you will pass the literal characters to the algorithm rather than the control/binary characters they represent.

Related

Character coding between mysql and java

Encoding issue with JLine

send a string from c++ to java through socket

Writing strings with chars like "ñ" to a txt file

Creating Java DataInputStream data in Python

Categories

Resources