Story
While conducting an experiment I was saving a stream of random Bytes generated by a hardware RNG device. After the experiment was finished, I realized that the saving method was incorrect. I hope I can find the way how to fix the corrupted file so that I obtain the correct stream of random numbers back.
Example
The story of the problem can be explained in the following simple example.
Let's say I have a stream of random numbers in an input file randomInput.bin. I will simulate the stream of random numbers coming from the hardware RNG device by sending the input file to stdout via cat. I found two ways how to save this stream to a file:
A) Harmless saving method
This method gives me exactly the original stream of random Bytes.
import scala.sys.process._
import java.io.File
val res = ("cat randomInput.bin" #> new File(outputFile))!
B) Saving method leading to corruption
Unfortunately, this is the original saving method I chose.
import scala.sys.process._
import java.io.PrintWriter
val randomBits = "cat randomInput.bin".!!
val out = new PrintWriter(outputFile)
out.println(randomBits)
if (out != null) {
out.close()
Seq("chmod", "600", outputFile).!
}
The file saved using method B) is still binary, however, is is approximately 2x larger that the file saved by method A). Further analysis shows that the stream of random Bits is significantly less random.
Summary
I suspect that the saving method B) adds something to almost every byte, however, the understanding of this is behind my expertise in Java/Scala I/O.
I would very much appreciate if somebody explained me the low-level difference between methods A) and B). The goal is to revert the changes created by saving method B) and obtain the original stream of random Bytes.
Thank you very much in advance!
The problem is probably that println is meant for text, and this text is being encoded as Unicode, which uses multiple bytes for some or all characters, depending on which version of Unicode.
If the file is exactly 2x larger than it should be, then you've probably got a null byte every other byte, which could be easy to fix. Otherwise, it may be harder to figure out what you would need to do to recover the binary data. Viewing the corrupted file in a hex editor may help you see what happened. Either way, I think it may be easier to just generate new random data and save it correctly.
Especially if this is for an experiment, if your random data has been corrupted and then fixed, it may be harder to justify that the data is truly random compared to just generating it properly in the first place.
I am trying to program an auralization via Ray-Tracing in processing. To edit a sample over the information from the Ray Tracer, i need to convert a .wav File (File-Format: PCM-signed,16bit,stereo,2 bytes/frame, little endian) to an Float Array.
I converted the audio via an audioInputStream and a DataInputStream, where I am loading the audio into an byte Array.
Then I convert the byte Array to a float array like this.
byte[] samples;
float[] audio_data = float(samples);
When I convert the float Array back to a .wav File, I'm getting the sound of the original Audio-File.
But when I'm adding another Float Array to the Original signal and convert it back to a. wav file via the method above(even if I'm adding the same signal), i get a white noise signal instead of the wanted signal (I can hear the original signal under the white noise modulated, but very very silent).
I read about this problem before, that there can be problems by the conversion from the float array to a byte array. That's because float is a 32bit datatype and byte (in java) is only 16 bits and somehow the bytes get mixed together wrong so the white noise is the result. In Processing there is a data type with signed 16bit integers (named: "short") but i can't modify the amplitude anymore, because therefore i need float values, which i can't convert to short.
I also tried to handle the overflow (amplitude) in the float array by modulating the signal from 16 bit values (-32768/32767) to values from -1/1 and back again after mixing (adding) the signals. The result gave me white noise. When i added more than 2 signals it gaves me nothing (nothing to hear).
The concrete Problem I want to solve is to add many signals (more than 1000 with a decent delay to create a kind of reverbation) in the form of float Arrays. Then I want to combine them to one Float Array that i want to save as an audio file without white noise.
I hope you guys can help me.
If you have true PCM data points, there should be no problem using simple addition. The only issue is that on rare occasions (assuming your audio is not too hot to begin with) the values will go out of range. This will tend create a harsh distortion, not white noise. The fact that you are getting white noise suggests to me that maybe you are not converting your PCM sums back to bytes correctly for the format that you are outputting.
Here is some code I use in AudioCue to convert PCM back to bytes. The format is assumed to be 16-bit, 44100 fps, stereo, little-endian. I'm working with PCM as normalized floats. This algorithm does the conversion for a buffer's worth of data at a time.
for (int i = 0, n = buffer.length; i < n; i++)
{
buffer[i] *= 32767;
audioBytes[i*2] = (byte) buffer[i];
audioBytes[i*2 + 1] = (byte)((int)buffer[i] >> 8 );
}
Sometimes, a function like Math.min(Math.max(audioval, -1), 1) or Math.min(Math.max(audioval, -32767), 32767) is used to keep the values in range. More sophisticated limiters or compressor algorithms will scale the volume to fit. But still, if this is not handled, the result should be distortion, not white noise.
If the error is happening at another stage, we will need to see more of your code.
All this said, I wish you luck with the 1000-point echo array reverb. I hadn't heard of this approach working. Maybe there are processors that can handle the computational load now? (Are you trying to do this in real time?) My only success with coding real-time reverberation has been to use the Schroeder method, plugging the structure and values from the CCMRA Freeberb, working off of code from Craig Lindley's now ancient (copyright 2001) book "Digital Audio with Java". Most of that book deals with obsolete GUI code (pre-Swing!), but the code he gives for AllPass and Comb filters is still valid.
I recall when I was working on this that I tracked down references a better reverb to try and code, but I would have to do some real digging to try and find my notes. I was feeling over my head at the time, as the algorithm was presented via block diagrams not coding details or even pseudo-code. Would like to work on this again though and get a better reverb than the Shroeder-type to work. The Schoeder was passable for sounds that were not too percussive.
Getting a solution for real-time ray tracing would be a valuable accomplishment. Many applications in AR/VR and games.
I have created the below code and executed:
Ocr.setUp();
Ocr ocr = new Ocr();
ocr.startEngine("eng", Ocr.SPEED_FASTEST);
String s = ocr.recognize(theImage, Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT);
ocr.stopEngine();
Output:
Result: L‘i
L‘L’Ui l
Actually the image only contains the numeric values. Is it possible to extract only numeric value by using above code?
I have 1234 numeric value displayed in jpg file, I just want to print that numeric value in o/p console. Can anybody help me out?
I have a few technology-independent observations about your code.
"SPEED_FASTEST" indicates your preference for fast OCR. Fast is the opposite of High-quality. You either get speed or quality. If the image is clear - no problem, but if the image is less than perfect, Quality mode will have more algorithms to deal with defects.
Nowhere in your code it is specified that you constrain the character set to digits only. If you do not indicate the language or character set, then usually by default the entire English character set is used. See my response on this post here: OCR why not find only character
Typically if you post a sample image along with your question and code, contributors can give you better advice.
As tilte,
I tried to capture the file size info which is saved in 4 bytes of data from bitmap's file code,
but if I use byte[] to save it and any byte of file exceeds 127, it'd be misjudged as negative values,
in this case how could we correct these kind of values?
My book just simply plus 256 to it, but could it fixedly be -128?
if not, then the result still isn't correct.
I know we can just use int[] or larger array for it, just wanna know how to deal with these kind of problems!
Thanks a lot for helping!!!
According to here, the C compiler will pad out values when writing a structure to a binary file. As the example in the link says, when writing a struct like this:
struct {
char c;
int i;
} a;
to a binary file, the compiler will usually leave an unnamed, unused hole between the char and int fields, to ensure that the int field is properly aligned.
How could I to create an exact replica of the binary output file (generated in C), using a different language (in my case, Java)?
Is there an automatic way to apply C padding in Java output? Or do I have to go through compiler documentation to see how it works (the compiler is g++ by the way).
Don't do this, it is brittle and will lead to alignment and endianness bugs.
For external data it is much better to explicitly define the format in terms of bytes and write explicit functions to convert between internal and external format, using shift and masks (not union!).
This is true not only when writing to files, but also in memory. It is the fact that the struct is padded in memory, that leads to the padding showing up in the file, if the struct is written out byte-by-byte.
It is in general very hard to replicate with certainty the exact padding scheme, although I guess some heuristics would get you quite far. It helps if you have the struct declaration, for analysis.
Typically, fields larger than one char will be aligned so that their starting offset inside the structure is a multiple of their size. This means shorts will generally be on even offsets (divisible by 2, assuming sizeof (short) == 2), while doubles will be on offsets divisible by 8, and so on.
UPDATE: It is for reasons like this (and also reasons having to do with endianness) that it is generally a bad idea to dump whole structs out to files. It's better to do it field-by-field, like so:
put_char(out, a.c);
put_int(out, a.i);
Assuming the put-functions only write the bytes needed for the value, this will emit a padding-less version of the struct to the file, solving the problem. It is also possible to ensure a proper, known, byte-ordering by writing these functions accordingly.
Is there an automatic way to apply C
padding in Java output? Or do I have
to go through compiler documentation
to see how it works (the compiler is
g++ by the way).
Neither. Instead, you explicitly specify a data/communication format and implement that specification, rather than relying on implementation details of the C compiler. You won't even get the same output from different C compilers.
For interoperability, look at the ByteBuffer class.
Essentially, you create a buffer of a certain size, put() variables of different types at different positions, and then call array() at the end to retrieve the "raw" data representation:
ByteBuffer bb = ByteBuffer.allocate(8);
bb.order(ByteOrder.LITTLE_ENDIAN);
bb.put(0, someChar);
bb.put(4, someInteger);
byte[] rawBytes = bb.array();
But it's up to you to work out where to put padding-- i.e. how many bytes to skip between positions.
For reading data written from C, then you generally wrap() a ByteBuffer around some byte array that you've read from a file.
In case it's helpful, I've written more on ByteBuffer.
A handy way of reading/writing C structs in Java is to use the javolution Struct class (see http://www.javolution.org). This won't help you with automatically padding/aligning your data, but it does make working with raw data held in a ByteBuffer much more convenient. If you're not familiar with javolution, it's well worth a look as there's lots of other cool stuff in there too.
This hole is configurable, compiler has switches to align structs by 1/2/4/8 bytes.
So the first question is: Which alignment exactly do you want to simulate?
With Java, the size of data types are defined by the language specification. For example, a byte type is 1 byte, short is 2 bytes, and so on. This is unlike C, where the size of each type is architecture-dependent.
Therefore, it would be important to know how the binary file is formatted in order to be able to read the file into Java.
It may be necessary to take steps in order to be certain that fields are a specific size, to account for differences in the compiler or architecture. The mention of alignment seem to suggest that the output file will depend on the architecture.
you could try preon:
Preon is a java library for building codecs for bitstream-compressed data in a
declarative (annotation based) way. Think JAXB or Hibernate, but then for binary
encoded data.
it can handle Big/Little endian binary data, alignment (padding) and various numeric types along other features. It is a very nice library, I like it very much
my 0.02$
I highly recommend protocol buffers for exactly this problem.
As I understand it, you're saying that you don't control the output of the C program. You have to take it as given.
So do you have to read this file for some specific set of structures, or do you have to solve this in a general case? I mean, is the problem that someone said, "Here's the file created by program X, you have to read it in Java"? Or do they expect your Java program to read the C source code, find the structure definition, and then read it in Java?
If you've got a specific file to read, the problem isn't really very difficult. Either by reviewing the C compiler specifications or by studying example files, figure out where the padding is. Then on the Java side, read the file as a stream of bytes, and build the values you know are coming. Basically I'd write a set of functions to read the required number of bytes from an InputStream and turn them into the appropriate data type. Like:
int readInt(InputStream is,int len)
throws PrematureEndOfDataException
{
int n=0;
while (len-->0)
{
int i=is.read();
if (i==-1)
throw new PrematureEndOfDataException();
byte b=(byte) i;
n=(n<<8)+b;
}
return n;
}
You can alter the packing on the c side to ensure that no padding is used, or alternatively you can look at the resultant file format in a hex editor to allow you to write a parser in Java that ignores bytes that are padding.