I'm developing an application that should verify signatures of pdf files. The application should detect full history of updates done on the file content before each signature is applied.
For example:
Signer 1 signed the plain pdf file
Signer 2 added comment to the signed file, then signed it
How can application detect that Signer 2 added a comment before his signature.
I have tried to use itext and pdfbox
As already explained in a comment, neither iText nor PDFBox bring along a high-level API telling you what changed in an incremental update in terms of UI objects (comments, text content, ...).
You can use them to render the different revisions of the PDF as bitmaps and compare those images.
Or you can use them to tell you the changes in terms of low level COS objects (dictionaries, arrays, numbers, strings, ...).
But analyzing the changes in those images or low level objects and determining their meaning in terms of UI objects, that e.g. a comment and only a comment has been added, is highly non-trivial.
In response you asked
Can you explain more, how can I detect changes in low level COS objects.
What to Compare And What Changes to Consider
First of all you have to be clear about what document states you can compare to detect changes.
The PDF format allows to append changes to a PDF in so called incremental updates. This allows changes to signed documents without cryptographically breaking those signatures as the original signed bytes are left as is:
There can be more incremental updates in-between, though, which are not signed; e.g. the "Changes for version 2" might include multiple incremental updates.
One might consider comparing the revisions created by arbitrary incremental updates. The problem here is, though, that you cannot identify the person who applied an incremental update without signature.
Thus, it usually makes more sense to compare the signed revisions only and to hold each signer responsible for all changes since the previous signed revision. The only exception here is the whole file which as the current version of the PDF is of special interest even if it there is no signature covering all of it.
Next you have to decide what you consider a change. In particular:
Is every object override in an incremental update a change? Even those that override the original object with an identical copy?
What about changes that make a direct object indirect (or vice versa) but keep all contents and references intact?
What about addition of new objects that are not referred to from anywhere in the standard structure?
What about addition of objects that are not referenced from the cross reference streams or tables?
What about addition of data that's not following PDF syntax at all?
If you are indeed interested in such changes, too, existing PDF libraries out-of-the-box usually don't provide you the means to determine them; you most likely will at least have to change their code for traversing the chain of cross reference tables/streams or even analyze the file bytes in the update directly.
If you are not interested in such changes, though, there usually is no need to change or replace library routines.
As the enumerated and similar changes make no difference when the PDF is processed by specification conform PDF processors, one can usually ignore such changes.
If this is your position, too, the following example tool might give you a starting point.
An Example Tool Based on iText 7
With the limitations explained above you can compare signed revisions of a PDF using iText 7 without changes to the library by loading the revisions to compare into separate PdfDocument instances and recursively comparing the PDF objects starting with the trailer.
I once implemented this as a small helper tool for personal use (so it is not completely finished yet, more work-in-progress). First there is the base class that allows comparing two arbitrary documents:
public class PdfCompare {
public static void main(String[] args) throws IOException {
System.out.printf("Comparing:\n* %s\n* %s\n", args[0], args[1]);
try ( PdfDocument pdfDocument1 = new PdfDocument(new PdfReader(args[0]));
PdfDocument pdfDocument2 = new PdfDocument(new PdfReader(args[1])) ) {
PdfCompare pdfCompare = new PdfCompare(pdfDocument1, pdfDocument2);
pdfCompare.compare();
List<Difference> differences = pdfCompare.getDifferences();
if (differences == null || differences.isEmpty()) {
System.out.println("No differences found.");
} else {
System.out.printf("%d differences found:\n", differences.size());
for (Difference difference : pdfCompare.getDifferences()) {
for (String element : difference.getPath()) {
System.out.print(element);
}
System.out.printf(" - %s\n", difference.getDescription());
}
}
}
}
public interface Difference {
List<String> getPath();
String getDescription();
}
public PdfCompare(PdfDocument pdfDocument1, PdfDocument pdfDocument2) {
trailer1 = pdfDocument1.getTrailer();
trailer2 = pdfDocument2.getTrailer();
}
public void compare() {
LOGGER.info("Starting comparison");
try {
compared.clear();
differences.clear();
LOGGER.info("START COMPARE");
compare(trailer1, trailer2, Collections.singletonList("trailer"));
LOGGER.info("START SHORTEN PATHS");
shortenPaths();
} finally {
LOGGER.info("Finished comparison and shortening");
}
}
public List<Difference> getDifferences() {
return differences;
}
class DifferenceImplSimple implements Difference {
DifferenceImplSimple(PdfObject object1, PdfObject object2, List<String> path, String description) {
this.pair = Pair.of(object1, object2);
this.path = path;
this.description = description;
}
#Override
public List<String> getPath() {
List<String> byPair = getShortestPath(pair);
return byPair != null ? byPair : shorten(path);
}
#Override public String getDescription() { return description; }
final Pair<PdfObject, PdfObject> pair;
final List<String> path;
final String description;
}
void compare(PdfObject object1, PdfObject object2, List<String> path) {
LOGGER.debug("Comparing objects at {}.", path);
if (object1 == null && object2 == null)
{
LOGGER.debug("Both objects are null at {}.", path);
return;
}
if (object1 == null) {
differences.add(new DifferenceImplSimple(object1, object2, path, "Missing in document 1"));
LOGGER.info("Object in document 1 is missing at {}.", path);
return;
}
if (object2 == null) {
differences.add(new DifferenceImplSimple(object1, object2, path, "Missing in document 2"));
LOGGER.info("Object in document 2 is missing at {}.", path);
return;
}
if (object1.getType() != object2.getType()) {
differences.add(new DifferenceImplSimple(object1, object2, path,
String.format("Type difference, %s in document 1 and %s in document 2",
getTypeName(object1.getType()), getTypeName(object2.getType()))));
LOGGER.info("Objects have different types at {}, {} and {}.", path, getTypeName(object1.getType()), getTypeName(object2.getType()));
return;
}
switch (object1.getType()) {
case PdfObject.ARRAY:
compareContents((PdfArray) object1, (PdfArray) object2, path);
break;
case PdfObject.DICTIONARY:
compareContents((PdfDictionary) object1, (PdfDictionary) object2, path);
break;
case PdfObject.STREAM:
compareContents((PdfStream)object1, (PdfStream)object2, path);
break;
case PdfObject.BOOLEAN:
case PdfObject.INDIRECT_REFERENCE:
case PdfObject.LITERAL:
case PdfObject.NAME:
case PdfObject.NULL:
case PdfObject.NUMBER:
case PdfObject.STRING:
compareContentsSimple(object1, object2, path);
break;
default:
differences.add(new DifferenceImplSimple(object1, object2, path, "Unknown object type " + object1.getType() + "; cannot compare"));
LOGGER.warn("Unknown object type at {}, {}.", path, object1.getType());
break;
}
}
void compareContents(PdfArray array1, PdfArray array2, List<String> path) {
int count1 = array1.size();
int count2 = array2.size();
if (count1 < count2) {
differences.add(new DifferenceImplSimple(array1, array2, path, "Document 1 misses " + (count2-count1) + " array entries"));
LOGGER.info("Array in document 1 is missing {} entries at {} for {}.", (count2-count1), path);
}
if (count1 > count2) {
differences.add(new DifferenceImplSimple(array1, array2, path, "Document 2 misses " + (count1-count2) + " array entries"));
LOGGER.info("Array in document 2 is missing {} entries at {} for {}.", (count1-count2), path);
}
if (alreadyCompared(array1, array2, path)) {
return;
}
int count = Math.min(count1, count2);
for (int i = 0; i < count; i++) {
compare(array1.get(i), array2.get(i), join(path, String.format("[%d]", i)));
}
}
void compareContents(PdfDictionary dictionary1, PdfDictionary dictionary2, List<String> path) {
List<PdfName> missing1 = new ArrayList<PdfName>(dictionary2.keySet());
missing1.removeAll(dictionary1.keySet());
if (!missing1.isEmpty()) {
differences.add(new DifferenceImplSimple(dictionary1, dictionary2, path, "Document 1 misses dictionary entries for " + missing1));
LOGGER.info("Dictionary in document 1 is missing entries at {} for {}.", path, missing1);
}
List<PdfName> missing2 = new ArrayList<PdfName>(dictionary1.keySet());
missing2.removeAll(dictionary2.keySet());
if (!missing2.isEmpty()) {
differences.add(new DifferenceImplSimple(dictionary1, dictionary2, path, "Document 2 misses dictionary entries for " + missing2));
LOGGER.info("Dictionary in document 2 is missing entries at {} for {}.", path, missing2);
}
if (alreadyCompared(dictionary1, dictionary2, path)) {
return;
}
List<PdfName> common = new ArrayList<PdfName>(dictionary1.keySet());
common.retainAll(dictionary2.keySet());
for (PdfName name : common) {
compare(dictionary1.get(name), dictionary2.get(name), join(path, name.toString()));
}
}
void compareContents(PdfStream stream1, PdfStream stream2, List<String> path) {
compareContents((PdfDictionary)stream1, (PdfDictionary)stream2, path);
byte[] bytes1 = stream1.getBytes();
byte[] bytes2 = stream2.getBytes();
if (!Arrays.equals(bytes1, bytes2)) {
differences.add(new DifferenceImplSimple(stream1, stream2, path, "Stream contents differ"));
LOGGER.info("Stream contents differ at {}.", path);
}
}
void compareContentsSimple(PdfObject object1, PdfObject object2, List<String> path) {
// vvv--- work-around for DEVSIX-4931, likely to be fixed in 7.1.15
if (object1 instanceof PdfNumber)
((PdfNumber)object1).getValue();
if (object2 instanceof PdfNumber)
((PdfNumber)object2).getValue();
// ^^^--- work-around for DEVSIX-4931, likely to be fixed in 7.1.15
if (!object1.equals(object2)) {
if (object1 instanceof PdfString) {
String string1 = object1.toString();
if (string1.length() > 40)
string1 = string1.substring(0, 40) + '\u22EF';
string1 = sanitize(string1);
String string2 = object2.toString();
if (string2.length() > 40)
string2 = string2.substring(0, 40) + '\u22EF';
string2 = sanitize(string2);
differences.add(new DifferenceImplSimple(object1, object2, path, String.format("String values differ, '%s' and '%s'", string1, string2)));
LOGGER.info("String values differ at {}, '{}' and '{}'.", path, string1, string2);
} else {
differences.add(new DifferenceImplSimple(object1, object2, path, String.format("Object values differ, '%s' and '%s'", object1, object2)));
LOGGER.info("Object values differ at {}, '{}' and '{}'.", path, object1, object2);
}
}
}
String sanitize(CharSequence string) {
char[] sanitized = new char[string.length()];
for (int i = 0; i < sanitized.length; i++) {
char c = string.charAt(i);
if (c >= 0 && c < ' ')
c = '\uFFFD';
sanitized[i] = c;
}
return new String(sanitized);
}
String getTypeName(byte type) {
switch (type) {
case PdfObject.ARRAY: return "ARRAY";
case PdfObject.BOOLEAN: return "BOOLEAN";
case PdfObject.DICTIONARY: return "DICTIONARY";
case PdfObject.LITERAL: return "LITERAL";
case PdfObject.INDIRECT_REFERENCE: return "REFERENCE";
case PdfObject.NAME: return "NAME";
case PdfObject.NULL: return "NULL";
case PdfObject.NUMBER: return "NUMBER";
case PdfObject.STREAM: return "STREAM";
case PdfObject.STRING: return "STRING";
default:
return "UNKNOWN";
}
}
List<String> join(List<String> path, String element) {
String[] array = path.toArray(new String[path.size() + 1]);
array[array.length-1] = element;
return Arrays.asList(array);
}
boolean alreadyCompared(PdfObject object1, PdfObject object2, List<String> path) {
Pair<PdfObject, PdfObject> pair = Pair.of(object1, object2);
if (compared.containsKey(pair)) {
//LOGGER.debug("Objects already compared at {}, previously at {}.", path, compared.get(pair));
Set<List<String>> paths = compared.get(pair);
boolean alreadyPresent = false;
// List<List<String>> toRemove = new ArrayList<>();
// for (List<String> formerPath : paths) {
// for (int i = 0; ; i++) {
// if (i == path.size()) {
// toRemove.add(formerPath);
// System.out.print('.');
// break;
// }
// if (i == formerPath.size()) {
// alreadyPresent = true;
// System.out.print(':');
// break;
// }
// if (!path.get(i).equals(formerPath.get(i)))
// break;
// }
// }
// paths.removeAll(toRemove);
if (!alreadyPresent)
paths.add(path);
return true;
}
compared.put(pair, new HashSet<>(Collections.singleton(path)));
return false;
}
List<String> getShortestPath(Pair<PdfObject, PdfObject> pair) {
Set<List<String>> paths = compared.get(pair);
//return (paths == null) ? null : Collections.min(paths, pathComparator);
return (paths == null || paths.isEmpty()) ? null : shortened.get(paths.stream().findFirst().get());
}
void shortenPaths() {
List<Map<List<String>, SortedSet<List<String>>>> data = new ArrayList<>();
for (Set<List<String>> set : compared.values()) {
SortedSet<List<String>> sortedSet = new TreeSet<List<String>>(pathComparator);
sortedSet.addAll(set);
for (List<String> path : sortedSet) {
while (path.size() >= data.size()) {
data.add(new HashMap<>());
}
SortedSet<List<String>> former = data.get(path.size()).put(path, sortedSet);
if (former != null) {
LOGGER.error("Path not well-defined for {}", path);
}
}
}
for (int pathSize = 3; pathSize < data.size(); pathSize++) {
for (Map.Entry<List<String>, SortedSet<List<String>>> pathEntry : data.get(pathSize).entrySet()) {
List<String> path = pathEntry.getKey();
SortedSet<List<String>> equivalents = pathEntry.getValue();
for (int subpathSize = 2; subpathSize < pathSize; subpathSize++) {
List<String> subpath = path.subList(0, subpathSize);
List<String> remainder = path.subList(subpathSize, pathSize);
SortedSet<List<String>> subequivalents = data.get(subpathSize).get(subpath);
if (subequivalents != null && subequivalents.size() > 1) {
List<String> subequivalent = subequivalents.first();
if (subequivalent.size() < subpathSize) {
List<String> replacement = join(subequivalent, remainder);
if (equivalents.add(replacement)) {
data.get(replacement.size()).put(replacement, equivalents);
}
}
}
}
}
}
shortened.clear();
for (Map<List<String>, SortedSet<List<String>>> singleLengthData : data) {
for (Map.Entry<List<String>, SortedSet<List<String>>> entry : singleLengthData.entrySet()) {
List<String> path = entry.getKey();
List<String> shortenedPath = entry.getValue().first();
shortened.put(path, shortenedPath);
}
}
}
List<String> join(List<String> path, List<String> elements) {
String[] array = path.toArray(new String[path.size() + elements.size()]);
for (int i = 0; i < elements.size(); i++) {
array[path.size() + i] = elements.get(i);
}
return Arrays.asList(array);
}
List<String> shorten(List<String> path) {
List<String> shortPath = path;
for (int subpathSize = path.size(); subpathSize > 2; subpathSize--) {
List<String> subpath = path.subList(0, subpathSize);
List<String> shortSubpath = shortened.get(subpath);
if (shortSubpath != null && shortSubpath.size() < subpathSize) {
List<String> remainder = path.subList(subpathSize, path.size());
List<String> replacement = join(shortSubpath, remainder);
if (replacement.size() < shortPath.size())
shortPath = replacement;
}
}
return shortPath;
}
final static Logger LOGGER = LoggerFactory.getLogger(PdfCompare.class);
final PdfDictionary trailer1;
final PdfDictionary trailer2;
final Map<Pair<PdfObject, PdfObject>, Set<List<String>>> compared = new HashMap<>();
final List<Difference> differences = new ArrayList<>();
final Map<List<String>, List<String>> shortened = new HashMap<>();
final static Comparator<List<String>> pathComparator = new Comparator<List<String>>() {
#Override
public int compare(List<String> o1, List<String> o2) {
int compare = Integer.compare(o1.size(), o2.size());
if (compare != 0)
return compare;
for (int i = 0; i < o1.size(); i++) {
compare = o1.get(i).compareTo(o2.get(i));
if (compare != 0)
return compare;
}
return 0;
}
};
}
(PdfCompare.java)
The tool to use this code for revision comparison is a subclass thereof:
public class PdfRevisionCompare extends PdfCompare {
public static void main(String[] args) throws IOException {
for (String arg : args) {
System.out.printf("\nComparing revisions of: %s\n***********************\n", args[0]);
try (PdfDocument pdfDocument = new PdfDocument(new PdfReader(arg))) {
SignatureUtil signatureUtil = new SignatureUtil(pdfDocument);
List<String> signatureNames = signatureUtil.getSignatureNames();
if (signatureNames.isEmpty()) {
System.out.println("No signed revisions detected. (no AcroForm)");
continue;
}
String previousRevision = signatureNames.get(0);
PdfDocument previousDocument = new PdfDocument(new PdfReader(signatureUtil.extractRevision(previousRevision)));
System.out.printf("* Initial signed revision: %s\n", previousRevision);
for (int i = 1; i < signatureNames.size(); i++) {
String currentRevision = signatureNames.get(i);
PdfDocument currentDocument = new PdfDocument(new PdfReader(signatureUtil.extractRevision(currentRevision)));
showDifferences(previousDocument, currentDocument);
System.out.printf("* Next signed revision (%d): %s\n", i+1, currentRevision);
previousDocument.close();
previousDocument = currentDocument;
previousRevision = currentRevision;
}
if (signatureUtil.signatureCoversWholeDocument(previousRevision)) {
System.out.println("No unsigned updates.");
} else {
showDifferences(previousDocument, pdfDocument);
System.out.println("* Final unsigned revision");
}
previousDocument.close();
}
}
}
static void showDifferences(PdfDocument previousDocument, PdfDocument currentDocument) {
PdfRevisionCompare pdfRevisionCompare = new PdfRevisionCompare(previousDocument, currentDocument);
pdfRevisionCompare.compare();
List<Difference> differences = pdfRevisionCompare.getDifferences();
if (differences == null || differences.isEmpty()) {
System.out.println("No differences found.");
} else {
System.out.printf("%d differences found:\n", differences.size());
for (Difference difference : differences) {
for (String element : difference.getPath()) {
System.out.print(element);
}
System.out.printf(" - %s\n", difference.getDescription());
}
}
}
public PdfRevisionCompare(PdfDocument pdfDocument1, PdfDocument pdfDocument2) {
super(pdfDocument1, pdfDocument2);
}
}
(PdfRevisionCompare.java)
I use this solution to call the JNA method from .dll/.so library:
Creating C++ Structures via JNA
And in Windows, this code works perfect, but in Linux, I receive truncated double values, so from the Pointer data
in Windows, I receive this value:
40.7
from these bytes in Pointer:
[64, 68, 89, -103, -103, -103, -103, -102]
But in Linux I have this instead:
40.0
[64, 68, 0, 0, 0, 0, 0, 0]
.so/.dll was compiled from the same source and from the Python(by using "ctypes") I obtain correct values in Linux from the same .so
I already tried to write/read byte[] to/from Pointer - nothing happened - in Windows, all OK, but in the Linux, doubles are truncated.
This is C++ Structure and Method, which it returns inside of .dll/.so:
struct emxArray_real_T
{
double *data;
int *size;
int allocatedSize;
int numDimensions;
boolean_T canFreeData;
};
emxArray_real_T *emxCreate_real_T(int rows, int cols)
{
emxArray_real_T *emx;
***
return emx;
}
And this code in Java:
#Structure.FieldOrder({"data", "size", "allocatedSize", "numDimensions", "canFreeData"})
public class emxArray_real_T extends Structure {
public Pointer data;
public Pointer size;
public int allocatedSize = 1;
public int numDimensions = 1;
public boolean canFreeData = false;
***
public double[] getData() {
if (data == null) {
return new double[0];
}
return data.getDoubleArray(0, allocatedSize);
}
public void setData(double[] data) {
if (data.length != allocatedSize) {
throw new IllegalArgumentException("Data must have a length of " + allocatedSize + " but was "
+ data.length);
}
this.data.write(0, data, 0, data.length);
}
***
}
UPD: Obtaining bytes from Pointer data:
public byte[] getDataByte() {
final int times = Double.SIZE / Byte.SIZE;
if (data == null) {
return new byte[0];
}
return data.getByteArray(0, allocatedSize * times);
}
UPD2: I receive this error when I try to write the double like this -0.0000847 in the Pointer data in Linux(so it seems in Linux .so somehow use C type 'int' to the double representation):
Expected a value representable in the C type 'int'. Found inf instead. Error in ballshaftpasses (line 61)
UPD3: double[] <-> byte[] conversions(trying to solve Linux issue):
Changes in getData() setData:
public double[] getData() {
final int times = Double.SIZE / Byte.SIZE;
if (data == null) {
return new double[0];
}
return ByteDoubleConverterUtils.toDoubleArray(data.getByteArray(0, allocatedSize * times));
}
public void setData(double[] data) {
final byte[] bytes = ByteDoubleConverterUtils.toByteArray(data);
this.data.write(0, bytes, 0, bytes.length);
}
Utility class:
public class ByteDoubleConverterUtils {
public static byte[] toByteArray(double[] doubleArray){
int times = Double.SIZE / Byte.SIZE;
byte[] bytes = new byte[doubleArray.length * times];
for(int i=0;i<doubleArray.length;i++){
final ByteBuffer byteBuffer = ByteBuffer.wrap(bytes, i * times, times);
byteBuffer.order(ByteOrder.LITTLE_ENDIAN);
byteBuffer.putDouble(doubleArray[i]);
}
return bytes;
}
public static double[] toDoubleArray(byte[] byteArray){
int times = Double.SIZE / Byte.SIZE;
double[] doubles = new double[byteArray.length / times];
for(int i=0;i<doubles.length;i++){
final ByteBuffer byteBuffer = ByteBuffer.wrap(byteArray, i * times, times);
byteBuffer.order(ByteOrder.LITTLE_ENDIAN);
doubles[i] = byteBuffer.getDouble();
}
return doubles;
}
}
UPD4:
C++ Function Example
This is C++ function example to reproduce problems in Linux:
void emx_copy(const emxArray_real_T *in, emxArray_real_T *res)
{
int i0;
int loop_ub;
i0 = res->size[0] * res->size[1];
res->size[0] = in->size[0];
res->size[1] = in->size[1];
emxEnsureCapacity_real_T(res, i0);
loop_ub = in->size[0] * in->size[1];
for (i0 = 0; i0 < loop_ub; i0++) {
res->data[i0] = in->data[i0] * in->data[i0];
}
}
Java JNA Interface:
public interface EmxJna extends Library {
FcnOpjJna INSTANCE = (EmxJna)Native.load(Native.load(Platform.isWindows() ? "emx.dll" : "emx.so", EmxJna.class);
emxArray_real_T emxCreate_real_T(int rows, int cols);
void emx_copy(emxArray_real_T in, emxArray_real_T res);
}
Java test Method to call lib and set values 40.7 and -0.0000847 in Pointer data:
private void emxTest() {
final emxArray_real_T input = FcnOpjJna.INSTANCE.emxCreate_real_T(128, 32);
final double[] inputData = input.getData();
inputData[0] = 40.7d;
inputData[1] = -0.0000847d;
input.setData(inputData);
final emxArray_real_T output = FcnOpjJna.INSTANCE.emxCreate_real_T(128, 32);
FcnOpjJna.INSTANCE.emx_copy(input, output);
}
My current assignment includes taking all of the objects out of the pdf file and then using the parsed out objects. But there is an issue that I have noticed where some of the stream objects are being flat out skipped over by my code.
I am completely confused and hoping someone can help indicate what is going wrong here.
Here is the main parsing code.
void parseRawPDFFile() {
//Transform the bytes obtained from the file into a byte character sequence. This byte character sequence
//object is what allows us to use it in regex.
ByteCharSequence byteCharSequence = new ByteCharSequence(bytesFromFile.toByteArray());
byteCharSequence.getStringFromData();
Pattern pattern = Pattern.compile(SINGLE_OBJECT_REGEX);
Matcher matcher = pattern.matcher(byteCharSequence);
//While we have a match (apparently only one match exists at a time) keep looping over the list.
//When a match is found, get the starting and ending indices and manually cut these out char by char
//and assemble them into a new "ByteArrayOutputStream".
int counterOfDoom = 1;
while (matcher.find() ) {
for (int i = 0; i < matcher.groupCount(); i++) {
ByteArrayOutputStream cutOutArray = cutOutByteArrayOutputStreamFromOriginal(matcher.start(), matcher.end());
System.out.println("----------------------------------------------------");
System.out.println(cutOutArray);
//At this point we have cut out the object and can now send it for processing.
createPDFObject(cutOutArray);
System.out.println(counterOfDoom);
System.out.println("----------------------------------------------------");
counterOfDoom++;
}
}
}
Here is the code for the ByteCharSequence
(Credits for the core of this code here: http://blog.sarah-happy.ca/2013/01/java-regular-expression-on-byte-array.html)
public class ByteCharSequence implements CharSequence {
private final byte[] data;
private final int length;
private final int offset;
public ByteCharSequence(byte[] data) {
this(data, 0, data.length);
}
public ByteCharSequence(byte[] data, int offset, int length) {
this.data = data;
this.offset = offset;
this.length = length;
}
#Override
public int length() {
return this.length;
}
#Override
public char charAt(int index) {
return (char) (data[offset + index] & 0xff);
}
#Override
public CharSequence subSequence(int start, int end) {
return new ByteCharSequence(data, offset + start, end - start);
}
/**
* Get the string from the ByteCharSequence data.
* #return
*/
public String getStringFromData() {
//Load it into the method I know works to convert it to a string... Optimized? Probably not at all.
//But it works...
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
for (byte individualByte: data
) {
byteArrayOutputStream.write(individualByte);
}
return byteArrayOutputStream.toString();
}
}
The pdf data that I am processing at present:
10 0 obj
<</Filter/FlateDecode/Length 1040>>stream
(Bunch of bytes)
endstream
endobj
12 0 obj
<</Filter/FlateDecode/Length 2574/N 3>>stream
(Bunch of bytes)
endstream
endobj
Some information that I was trying to look into.
1: From what I understand there should be no limitation on how much can be fit into the data structures. So size shouldn't be an issue????
Add the DOTALL flag to the pattern compile call so that your pattern matches newline characters =)
Hi Team, I am trying to find a String "Henry" in a binary file and change the String to a different string. FYI the file is the output of serialisation of an object. Original Question here
I am new to searching bytes and imagined this code would search for my byte[] and exchange it. But it doesn't come close to working it doesn't even find a match.
{
byte[] bytesHenry = new String("Henry").getBytes();
byte[] bytesSwap = new String("Zsswd").getBytes();
byte[] seekHenry = new byte[bytesHenry.length];
RandomAccessFile file = new RandomAccessFile(fileString,"rw");
long filePointer;
while (seekHenry != null) {
filePointer = file.getFilePointer();
file.readFully(seekHenry);
if (bytesHenry == seekHenry) {
file.seek(filePointer);
file.write(bytesSwap);
break;
}
}
}
Okay I see the bytesHenry==seekHenry problem and will swap to Arrays.equals( bytesHenry , seekHenry )
I think I need to move along by -4 byte positions each time i read 5 bytes.
Bingo it finds it now
while (seekHenry != null) {
filePointer = file.getFilePointer();
file.readFully(seekHenry);;
if (Arrays.equals(bytesHenry,
seekHenry)) {
file.seek(filePointer);
file.write(bytesSwap);
break;
}
file.seek(filePointer);
file.read();
}
The following could work for you, see the method search(byte[] input, byte[] searchedFor) which returns the index where the first match matches, or -1.
public class SearchBuffer {
public static void main(String[] args) throws UnsupportedEncodingException {
String charset= "US-ASCII";
byte[] searchedFor = "ciao".getBytes(charset);
byte[] input = "aaaciaaaciaojjcia".getBytes(charset);
int idx = search(input, searchedFor);
System.out.println("index: "+idx); //should be 8
}
public static int search(byte[] input, byte[] searchedFor) {
//convert byte[] to Byte[]
Byte[] searchedForB = new Byte[searchedFor.length];
for(int x = 0; x<searchedFor.length; x++){
searchedForB[x] = searchedFor[x];
}
int idx = -1;
//search:
Deque<Byte> q = new ArrayDeque<Byte>(input.length);
for(int i=0; i<input.length; i++){
if(q.size() == searchedForB.length){
//here I can check
Byte[] cur = q.toArray(new Byte[]{});
if(Arrays.equals(cur, searchedForB)){
//found!
idx = i - searchedForB.length;
break;
} else {
//not found
q.pop();
q.addLast(input[i]);
}
} else {
q.addLast(input[i]);
}
}
return idx;
}
}
From Fastest way to find a string in a text file with java:
The best realization I've found in MIMEParser: https://github.com/samskivert/ikvm-openjdk/blob/master/build/linux-amd64/impsrc/com/sun/xml/internal/org/jvnet/mimepull/MIMEParser.java
/**
* Finds the boundary in the given buffer using Boyer-Moore algo.
* Copied from java.util.regex.Pattern.java
*
* #param mybuf boundary to be searched in this mybuf
* #param off start index in mybuf
* #param len number of bytes in mybuf
*
* #return -1 if there is no match or index where the match starts
*/
private int match(byte[] mybuf, int off, int len) {
Needed also:
private void compileBoundaryPattern();
So I have two AtomicBoolean and I need to check both of them. Something like that:
if (atomicBoolean1.get() == true && atomicBoolean2.get() == false) {
// ...
}
But there is a race condition in between :(
Is there a way to combine two atomic boolean checks into a single one without using synchronization (i.e. synchronized blocks) ?
Well I can think of a couple ways but it depends on the functionality you need.
One way is to "cheat" and use an AtomicMarkableReference<Boolean>:
final AtomicMarkableReference<Boolean> twoBooleans = (
new AtomicMarkableReference<Boolean>(true, false)
);
void somewhere() {
boolean b0;
boolean[] b1 = new boolean[1];
b0 = twoBooleans.get(b1);
b0 = false;
b1[0] = true;
twoBooleans.set(b0, b1);
}
But that's kind of a pain and only gets you two values.
So then you can use AtomicInteger with bit flags:
static final int FLAG0 = 1;
static final int FLAG1 = 1 << 1;
final AtomicInteger intFlags = new AtomicInteger(FLAG0);
void somewhere() {
int flags = intFlags.get();
int both = FLAG0 | FLAG1;
if((flags & both) == FLAG0) { // if FLAG0 has a 1 and FLAG1 has a 0
something();
}
flags &= ~FLAG0; // set FLAG0 to 0 (false)
flags |= FLAG1; // set FLAG1 to 1 (true)
intFlags.set(flags);
}
Also kind of a pain but it gets you 32 values. You could probably create a wrapper class around this if you really wanted. For example:
public class AtomicBooleanArray {
private final AtomicInteger intFlags = new AtomicInteger();
public void get(boolean[] arr) {
int flags = intFlags.get();
int f = 1;
for(int i = 0; i < 32; i++) {
arr[i] = (flags & f) != 0;
f <<= 1;
}
}
public void set(boolean[] arr) {
int flags = 0;
int f = 1;
for(int i = 0; i < 32; i++) {
if(arr[i]) {
flags |= f;
}
f <<= 1;
}
intFlags.set(flags);
}
public boolean get(int index) {
return (intFlags.get() & (1 << index)) != 0;
}
public void set(int index, boolean b) {
int f = 1 << index;
int current, updated;
do {
current = intFlags.get();
updated = b ? (current | f) : (current & ~f);
} while(!intFlags.compareAndSet(current, updated));
}
}
That's pretty good. Maybe a set is performed while the array is being copied in get but the point is you can get or set all 32 atomically. (The compare and set do-while loop is major ugly but it's how the atomic classes themselves work for things like getAndAdd.)
AtomicReference seems impractical here. It allows atomic gets and sets but once you have your hands on the internal object you are no longer updating atomically. You'd have to create a brand new object each time.
final AtomicReference<boolean[]> booleanRefs = (
new AtomicReference<boolean[]>(new boolean[] { true, true })
);
void somewhere() {
boolean[] refs = booleanRefs.get();
refs[0] = false; // not atomic!!
boolean[] copy = booleanRefs.get().clone(); // pretty safe
copy[0] = false;
booleanRefs.set(copy);
}
If you want to perform an interim operation on the data atomically (get -> change -> set, without interference) you have to use a lock or synchronization. Personally I would use a lock or synchronization since it's usually the case that the entire update is what you want to hold on to.
** UNSAFE !! **
Don't do this!
This can (possibly) be done with sun.misc.Unsafe. Here's a class that uses Unsafe to write to two halves of a volatile long, cowboy style.
public class UnsafeBooleanPair {
private static final Unsafe UNSAFE;
private static final long[] OFFS = new long[2];
private static final long[] MASKS = new long[] {
-1L >>> 32L, -1L << 32L
};
static {
try {
UNSAFE = getTheUnsafe();
Field pair = UnsafeBooleanPair.class.getDeclaredField("pair");
OFFS[0] = UNSAFE.objectFieldOffset(pair);
OFFS[1] = OFFS[0] + 4L;
} catch(Exception e) {
throw new RuntimeException(e);
}
}
private volatile long pair;
public void set(int ind, boolean val) {
UNSAFE.putIntVolatile(this, OFFS[ind], val ? 1 : 0);
}
public boolean get(int ind) {
return (pair & MASKS[ind]) != 0L;
}
public boolean[] get(boolean[] vals) {
long p = pair;
vals[0] = (p & MASKS[0]) != 0L;
vals[1] = (p & MASKS[1]) != 0L;
return vals;
}
private static Unsafe getTheUnsafe()
throws Exception {
Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
theUnsafe.setAccessible(true);
return (Unsafe)theUnsafe.get(null);
}
}
Importantly, the Javadoc in the Open JDK source for fieldOffset says not to do arithmetic with the offset. However, doing arithmetic with it appears to actually work in that I don't get garbage.
This nets a single volatile read for the entire word, but also (potentially) a volatile write to either half of it. Potentially putByteVolatile could be used to split a long in to 8 segments.
I wouldn't recommend that anybody use this (don't use this!) but it's kind of interesting as an oddity.
I can only think of two ways: use the lower two bits of an AtomicInteger or use a spinlock. I think Hotspot can optimize certain locks down to spinlocks on its own.
Use a Lock:
Lock l = ...;
l.lock();
try {
// access the resource protected by this lock
} finally {
l.unlock();
}
it's technically not a synchronized block, even though it's a form of synchronization, i think that what you're asking for is the very definition of synchronization, so i don't think is possible doing it 'without synchronization'.