Related
I'm developing an application that should verify signatures of pdf files. The application should detect full history of updates done on the file content before each signature is applied.
For example:
Signer 1 signed the plain pdf file
Signer 2 added comment to the signed file, then signed it
How can application detect that Signer 2 added a comment before his signature.
I have tried to use itext and pdfbox
As already explained in a comment, neither iText nor PDFBox bring along a high-level API telling you what changed in an incremental update in terms of UI objects (comments, text content, ...).
You can use them to render the different revisions of the PDF as bitmaps and compare those images.
Or you can use them to tell you the changes in terms of low level COS objects (dictionaries, arrays, numbers, strings, ...).
But analyzing the changes in those images or low level objects and determining their meaning in terms of UI objects, that e.g. a comment and only a comment has been added, is highly non-trivial.
In response you asked
Can you explain more, how can I detect changes in low level COS objects.
What to Compare And What Changes to Consider
First of all you have to be clear about what document states you can compare to detect changes.
The PDF format allows to append changes to a PDF in so called incremental updates. This allows changes to signed documents without cryptographically breaking those signatures as the original signed bytes are left as is:
There can be more incremental updates in-between, though, which are not signed; e.g. the "Changes for version 2" might include multiple incremental updates.
One might consider comparing the revisions created by arbitrary incremental updates. The problem here is, though, that you cannot identify the person who applied an incremental update without signature.
Thus, it usually makes more sense to compare the signed revisions only and to hold each signer responsible for all changes since the previous signed revision. The only exception here is the whole file which as the current version of the PDF is of special interest even if it there is no signature covering all of it.
Next you have to decide what you consider a change. In particular:
Is every object override in an incremental update a change? Even those that override the original object with an identical copy?
What about changes that make a direct object indirect (or vice versa) but keep all contents and references intact?
What about addition of new objects that are not referred to from anywhere in the standard structure?
What about addition of objects that are not referenced from the cross reference streams or tables?
What about addition of data that's not following PDF syntax at all?
If you are indeed interested in such changes, too, existing PDF libraries out-of-the-box usually don't provide you the means to determine them; you most likely will at least have to change their code for traversing the chain of cross reference tables/streams or even analyze the file bytes in the update directly.
If you are not interested in such changes, though, there usually is no need to change or replace library routines.
As the enumerated and similar changes make no difference when the PDF is processed by specification conform PDF processors, one can usually ignore such changes.
If this is your position, too, the following example tool might give you a starting point.
An Example Tool Based on iText 7
With the limitations explained above you can compare signed revisions of a PDF using iText 7 without changes to the library by loading the revisions to compare into separate PdfDocument instances and recursively comparing the PDF objects starting with the trailer.
I once implemented this as a small helper tool for personal use (so it is not completely finished yet, more work-in-progress). First there is the base class that allows comparing two arbitrary documents:
public class PdfCompare {
public static void main(String[] args) throws IOException {
System.out.printf("Comparing:\n* %s\n* %s\n", args[0], args[1]);
try ( PdfDocument pdfDocument1 = new PdfDocument(new PdfReader(args[0]));
PdfDocument pdfDocument2 = new PdfDocument(new PdfReader(args[1])) ) {
PdfCompare pdfCompare = new PdfCompare(pdfDocument1, pdfDocument2);
pdfCompare.compare();
List<Difference> differences = pdfCompare.getDifferences();
if (differences == null || differences.isEmpty()) {
System.out.println("No differences found.");
} else {
System.out.printf("%d differences found:\n", differences.size());
for (Difference difference : pdfCompare.getDifferences()) {
for (String element : difference.getPath()) {
System.out.print(element);
}
System.out.printf(" - %s\n", difference.getDescription());
}
}
}
}
public interface Difference {
List<String> getPath();
String getDescription();
}
public PdfCompare(PdfDocument pdfDocument1, PdfDocument pdfDocument2) {
trailer1 = pdfDocument1.getTrailer();
trailer2 = pdfDocument2.getTrailer();
}
public void compare() {
LOGGER.info("Starting comparison");
try {
compared.clear();
differences.clear();
LOGGER.info("START COMPARE");
compare(trailer1, trailer2, Collections.singletonList("trailer"));
LOGGER.info("START SHORTEN PATHS");
shortenPaths();
} finally {
LOGGER.info("Finished comparison and shortening");
}
}
public List<Difference> getDifferences() {
return differences;
}
class DifferenceImplSimple implements Difference {
DifferenceImplSimple(PdfObject object1, PdfObject object2, List<String> path, String description) {
this.pair = Pair.of(object1, object2);
this.path = path;
this.description = description;
}
#Override
public List<String> getPath() {
List<String> byPair = getShortestPath(pair);
return byPair != null ? byPair : shorten(path);
}
#Override public String getDescription() { return description; }
final Pair<PdfObject, PdfObject> pair;
final List<String> path;
final String description;
}
void compare(PdfObject object1, PdfObject object2, List<String> path) {
LOGGER.debug("Comparing objects at {}.", path);
if (object1 == null && object2 == null)
{
LOGGER.debug("Both objects are null at {}.", path);
return;
}
if (object1 == null) {
differences.add(new DifferenceImplSimple(object1, object2, path, "Missing in document 1"));
LOGGER.info("Object in document 1 is missing at {}.", path);
return;
}
if (object2 == null) {
differences.add(new DifferenceImplSimple(object1, object2, path, "Missing in document 2"));
LOGGER.info("Object in document 2 is missing at {}.", path);
return;
}
if (object1.getType() != object2.getType()) {
differences.add(new DifferenceImplSimple(object1, object2, path,
String.format("Type difference, %s in document 1 and %s in document 2",
getTypeName(object1.getType()), getTypeName(object2.getType()))));
LOGGER.info("Objects have different types at {}, {} and {}.", path, getTypeName(object1.getType()), getTypeName(object2.getType()));
return;
}
switch (object1.getType()) {
case PdfObject.ARRAY:
compareContents((PdfArray) object1, (PdfArray) object2, path);
break;
case PdfObject.DICTIONARY:
compareContents((PdfDictionary) object1, (PdfDictionary) object2, path);
break;
case PdfObject.STREAM:
compareContents((PdfStream)object1, (PdfStream)object2, path);
break;
case PdfObject.BOOLEAN:
case PdfObject.INDIRECT_REFERENCE:
case PdfObject.LITERAL:
case PdfObject.NAME:
case PdfObject.NULL:
case PdfObject.NUMBER:
case PdfObject.STRING:
compareContentsSimple(object1, object2, path);
break;
default:
differences.add(new DifferenceImplSimple(object1, object2, path, "Unknown object type " + object1.getType() + "; cannot compare"));
LOGGER.warn("Unknown object type at {}, {}.", path, object1.getType());
break;
}
}
void compareContents(PdfArray array1, PdfArray array2, List<String> path) {
int count1 = array1.size();
int count2 = array2.size();
if (count1 < count2) {
differences.add(new DifferenceImplSimple(array1, array2, path, "Document 1 misses " + (count2-count1) + " array entries"));
LOGGER.info("Array in document 1 is missing {} entries at {} for {}.", (count2-count1), path);
}
if (count1 > count2) {
differences.add(new DifferenceImplSimple(array1, array2, path, "Document 2 misses " + (count1-count2) + " array entries"));
LOGGER.info("Array in document 2 is missing {} entries at {} for {}.", (count1-count2), path);
}
if (alreadyCompared(array1, array2, path)) {
return;
}
int count = Math.min(count1, count2);
for (int i = 0; i < count; i++) {
compare(array1.get(i), array2.get(i), join(path, String.format("[%d]", i)));
}
}
void compareContents(PdfDictionary dictionary1, PdfDictionary dictionary2, List<String> path) {
List<PdfName> missing1 = new ArrayList<PdfName>(dictionary2.keySet());
missing1.removeAll(dictionary1.keySet());
if (!missing1.isEmpty()) {
differences.add(new DifferenceImplSimple(dictionary1, dictionary2, path, "Document 1 misses dictionary entries for " + missing1));
LOGGER.info("Dictionary in document 1 is missing entries at {} for {}.", path, missing1);
}
List<PdfName> missing2 = new ArrayList<PdfName>(dictionary1.keySet());
missing2.removeAll(dictionary2.keySet());
if (!missing2.isEmpty()) {
differences.add(new DifferenceImplSimple(dictionary1, dictionary2, path, "Document 2 misses dictionary entries for " + missing2));
LOGGER.info("Dictionary in document 2 is missing entries at {} for {}.", path, missing2);
}
if (alreadyCompared(dictionary1, dictionary2, path)) {
return;
}
List<PdfName> common = new ArrayList<PdfName>(dictionary1.keySet());
common.retainAll(dictionary2.keySet());
for (PdfName name : common) {
compare(dictionary1.get(name), dictionary2.get(name), join(path, name.toString()));
}
}
void compareContents(PdfStream stream1, PdfStream stream2, List<String> path) {
compareContents((PdfDictionary)stream1, (PdfDictionary)stream2, path);
byte[] bytes1 = stream1.getBytes();
byte[] bytes2 = stream2.getBytes();
if (!Arrays.equals(bytes1, bytes2)) {
differences.add(new DifferenceImplSimple(stream1, stream2, path, "Stream contents differ"));
LOGGER.info("Stream contents differ at {}.", path);
}
}
void compareContentsSimple(PdfObject object1, PdfObject object2, List<String> path) {
// vvv--- work-around for DEVSIX-4931, likely to be fixed in 7.1.15
if (object1 instanceof PdfNumber)
((PdfNumber)object1).getValue();
if (object2 instanceof PdfNumber)
((PdfNumber)object2).getValue();
// ^^^--- work-around for DEVSIX-4931, likely to be fixed in 7.1.15
if (!object1.equals(object2)) {
if (object1 instanceof PdfString) {
String string1 = object1.toString();
if (string1.length() > 40)
string1 = string1.substring(0, 40) + '\u22EF';
string1 = sanitize(string1);
String string2 = object2.toString();
if (string2.length() > 40)
string2 = string2.substring(0, 40) + '\u22EF';
string2 = sanitize(string2);
differences.add(new DifferenceImplSimple(object1, object2, path, String.format("String values differ, '%s' and '%s'", string1, string2)));
LOGGER.info("String values differ at {}, '{}' and '{}'.", path, string1, string2);
} else {
differences.add(new DifferenceImplSimple(object1, object2, path, String.format("Object values differ, '%s' and '%s'", object1, object2)));
LOGGER.info("Object values differ at {}, '{}' and '{}'.", path, object1, object2);
}
}
}
String sanitize(CharSequence string) {
char[] sanitized = new char[string.length()];
for (int i = 0; i < sanitized.length; i++) {
char c = string.charAt(i);
if (c >= 0 && c < ' ')
c = '\uFFFD';
sanitized[i] = c;
}
return new String(sanitized);
}
String getTypeName(byte type) {
switch (type) {
case PdfObject.ARRAY: return "ARRAY";
case PdfObject.BOOLEAN: return "BOOLEAN";
case PdfObject.DICTIONARY: return "DICTIONARY";
case PdfObject.LITERAL: return "LITERAL";
case PdfObject.INDIRECT_REFERENCE: return "REFERENCE";
case PdfObject.NAME: return "NAME";
case PdfObject.NULL: return "NULL";
case PdfObject.NUMBER: return "NUMBER";
case PdfObject.STREAM: return "STREAM";
case PdfObject.STRING: return "STRING";
default:
return "UNKNOWN";
}
}
List<String> join(List<String> path, String element) {
String[] array = path.toArray(new String[path.size() + 1]);
array[array.length-1] = element;
return Arrays.asList(array);
}
boolean alreadyCompared(PdfObject object1, PdfObject object2, List<String> path) {
Pair<PdfObject, PdfObject> pair = Pair.of(object1, object2);
if (compared.containsKey(pair)) {
//LOGGER.debug("Objects already compared at {}, previously at {}.", path, compared.get(pair));
Set<List<String>> paths = compared.get(pair);
boolean alreadyPresent = false;
// List<List<String>> toRemove = new ArrayList<>();
// for (List<String> formerPath : paths) {
// for (int i = 0; ; i++) {
// if (i == path.size()) {
// toRemove.add(formerPath);
// System.out.print('.');
// break;
// }
// if (i == formerPath.size()) {
// alreadyPresent = true;
// System.out.print(':');
// break;
// }
// if (!path.get(i).equals(formerPath.get(i)))
// break;
// }
// }
// paths.removeAll(toRemove);
if (!alreadyPresent)
paths.add(path);
return true;
}
compared.put(pair, new HashSet<>(Collections.singleton(path)));
return false;
}
List<String> getShortestPath(Pair<PdfObject, PdfObject> pair) {
Set<List<String>> paths = compared.get(pair);
//return (paths == null) ? null : Collections.min(paths, pathComparator);
return (paths == null || paths.isEmpty()) ? null : shortened.get(paths.stream().findFirst().get());
}
void shortenPaths() {
List<Map<List<String>, SortedSet<List<String>>>> data = new ArrayList<>();
for (Set<List<String>> set : compared.values()) {
SortedSet<List<String>> sortedSet = new TreeSet<List<String>>(pathComparator);
sortedSet.addAll(set);
for (List<String> path : sortedSet) {
while (path.size() >= data.size()) {
data.add(new HashMap<>());
}
SortedSet<List<String>> former = data.get(path.size()).put(path, sortedSet);
if (former != null) {
LOGGER.error("Path not well-defined for {}", path);
}
}
}
for (int pathSize = 3; pathSize < data.size(); pathSize++) {
for (Map.Entry<List<String>, SortedSet<List<String>>> pathEntry : data.get(pathSize).entrySet()) {
List<String> path = pathEntry.getKey();
SortedSet<List<String>> equivalents = pathEntry.getValue();
for (int subpathSize = 2; subpathSize < pathSize; subpathSize++) {
List<String> subpath = path.subList(0, subpathSize);
List<String> remainder = path.subList(subpathSize, pathSize);
SortedSet<List<String>> subequivalents = data.get(subpathSize).get(subpath);
if (subequivalents != null && subequivalents.size() > 1) {
List<String> subequivalent = subequivalents.first();
if (subequivalent.size() < subpathSize) {
List<String> replacement = join(subequivalent, remainder);
if (equivalents.add(replacement)) {
data.get(replacement.size()).put(replacement, equivalents);
}
}
}
}
}
}
shortened.clear();
for (Map<List<String>, SortedSet<List<String>>> singleLengthData : data) {
for (Map.Entry<List<String>, SortedSet<List<String>>> entry : singleLengthData.entrySet()) {
List<String> path = entry.getKey();
List<String> shortenedPath = entry.getValue().first();
shortened.put(path, shortenedPath);
}
}
}
List<String> join(List<String> path, List<String> elements) {
String[] array = path.toArray(new String[path.size() + elements.size()]);
for (int i = 0; i < elements.size(); i++) {
array[path.size() + i] = elements.get(i);
}
return Arrays.asList(array);
}
List<String> shorten(List<String> path) {
List<String> shortPath = path;
for (int subpathSize = path.size(); subpathSize > 2; subpathSize--) {
List<String> subpath = path.subList(0, subpathSize);
List<String> shortSubpath = shortened.get(subpath);
if (shortSubpath != null && shortSubpath.size() < subpathSize) {
List<String> remainder = path.subList(subpathSize, path.size());
List<String> replacement = join(shortSubpath, remainder);
if (replacement.size() < shortPath.size())
shortPath = replacement;
}
}
return shortPath;
}
final static Logger LOGGER = LoggerFactory.getLogger(PdfCompare.class);
final PdfDictionary trailer1;
final PdfDictionary trailer2;
final Map<Pair<PdfObject, PdfObject>, Set<List<String>>> compared = new HashMap<>();
final List<Difference> differences = new ArrayList<>();
final Map<List<String>, List<String>> shortened = new HashMap<>();
final static Comparator<List<String>> pathComparator = new Comparator<List<String>>() {
#Override
public int compare(List<String> o1, List<String> o2) {
int compare = Integer.compare(o1.size(), o2.size());
if (compare != 0)
return compare;
for (int i = 0; i < o1.size(); i++) {
compare = o1.get(i).compareTo(o2.get(i));
if (compare != 0)
return compare;
}
return 0;
}
};
}
(PdfCompare.java)
The tool to use this code for revision comparison is a subclass thereof:
public class PdfRevisionCompare extends PdfCompare {
public static void main(String[] args) throws IOException {
for (String arg : args) {
System.out.printf("\nComparing revisions of: %s\n***********************\n", args[0]);
try (PdfDocument pdfDocument = new PdfDocument(new PdfReader(arg))) {
SignatureUtil signatureUtil = new SignatureUtil(pdfDocument);
List<String> signatureNames = signatureUtil.getSignatureNames();
if (signatureNames.isEmpty()) {
System.out.println("No signed revisions detected. (no AcroForm)");
continue;
}
String previousRevision = signatureNames.get(0);
PdfDocument previousDocument = new PdfDocument(new PdfReader(signatureUtil.extractRevision(previousRevision)));
System.out.printf("* Initial signed revision: %s\n", previousRevision);
for (int i = 1; i < signatureNames.size(); i++) {
String currentRevision = signatureNames.get(i);
PdfDocument currentDocument = new PdfDocument(new PdfReader(signatureUtil.extractRevision(currentRevision)));
showDifferences(previousDocument, currentDocument);
System.out.printf("* Next signed revision (%d): %s\n", i+1, currentRevision);
previousDocument.close();
previousDocument = currentDocument;
previousRevision = currentRevision;
}
if (signatureUtil.signatureCoversWholeDocument(previousRevision)) {
System.out.println("No unsigned updates.");
} else {
showDifferences(previousDocument, pdfDocument);
System.out.println("* Final unsigned revision");
}
previousDocument.close();
}
}
}
static void showDifferences(PdfDocument previousDocument, PdfDocument currentDocument) {
PdfRevisionCompare pdfRevisionCompare = new PdfRevisionCompare(previousDocument, currentDocument);
pdfRevisionCompare.compare();
List<Difference> differences = pdfRevisionCompare.getDifferences();
if (differences == null || differences.isEmpty()) {
System.out.println("No differences found.");
} else {
System.out.printf("%d differences found:\n", differences.size());
for (Difference difference : differences) {
for (String element : difference.getPath()) {
System.out.print(element);
}
System.out.printf(" - %s\n", difference.getDescription());
}
}
}
public PdfRevisionCompare(PdfDocument pdfDocument1, PdfDocument pdfDocument2) {
super(pdfDocument1, pdfDocument2);
}
}
(PdfRevisionCompare.java)
I am new in java and I have to use the code below but the code it does work because I have to specifying the path for input data and output data. The code is got it from the internet. please help me
class Svm_scale
{
private BufferedReader rewind(BufferedReader fp, String filename) throws IOException
{
fp.close();
return new BufferedReader(new FileReader(filename));
}
private void output_target(double value)
{
LnCount++;
if(y_scaling)
{
if(value == y_min)
value = y_lower;
else if(value == y_max)
value = y_upper;
else
value = y_lower + (y_upper-y_lower) *
(value-y_min) / (y_max-y_min);
}
formatterscaled.format(value+" ");
System.out.println(" Line Number "+LnCount + " ");
}
private void output(int index, double value)
{
count++;
double Threshold1=0,Threshold2=0;
Threshold1= Avg[index]+(STDV[index]/2);
Threshold2= Avg[index]-(STDV[index]/2);
if(value > Threshold1 )
value = 2;
else if(value < Threshold2 )
value = -2;
else
value = 0;
formatterscaled.format( formatter.format(value) + ",");
// System.out.println(" Counter "+count);
// }
}
String save_filename =Save1; // = null?
String restore_filename =null;
String scale_data_filename =Disc; // set this to the path where the output should be stored, e.g. = "C:\\temp\\scaled";
String data_filename =Libsvm; // set this to the path where the input can be get, e.g. = "C:\\temp\\inputdata"
These are the Strings you need to adapt in order that the program can read and write. Save1, Disc, Libsvm are not in your code, so it can only be guessed where they come from.
data_filename and scale_data_filename are required. save_filename seems to be optional and may be set to null.
Here is the background. I have the following input for my MapReduce job (example):
Apache Hadoop
Apache Lucene
StackOverflow
....
(Actually each line represents a user query. Not important here.) And I want my RecordReader class read one line and then pass several key-value pairs to mappers. For example, if RecordReader gets Apache Hadoop, then I want it to generate the following key-value pairs and pass it to mappers:
Apache Hadoop - 1
Apache Hadoop - 2
Apache Hadoop - 3
("-" is the separator here.) And I found RecordReader pass key-values in next() method:
next(key, value);
Every time a RecordReader.next() is called, only one key and one value will be passed as argument. So how should I get my work done?
I believe you can simply use this:
public static class MultiMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
for (int i = 1; i <= n; i++) {
context.write(value, new IntWritable(i));
}
}
}
Here n is the number of values you want to pass. For example for the key-value pairs you specified:
Apache Hadoop - 1
Apache Hadoop - 2
Apache Hadoop - 3
n would be 3.
I think if you want to send to the mapper use the same key; you must implement your owner RecordReader; for example you can wirte a MutliRecordReader to extends the LineRecordReade; and here you must change the nextKeyValue method;
this is the original Code from LineRecordReadeļ¼
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end) {
newSize = in.readLine(value, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
and you can change it like this:
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new Text();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
while (getFilePosition() <= end && n<=3) {
newSize = in.readLine(key, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));//change value --> key
value =Text(n);
n++;
if(n ==3 )// we don't go to next until the N is three;
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
I think this can suit for you
Try not giving key:-
context.write(NullWritable.get(), new Text("Apache Hadoop - 1"));
context.write(NullWritable.get(), new Text("Apache Hadoop - 2"));
context.write(NullWritable.get(), new Text("Apache Hadoop - 3"));
I am trying to convert a Java function into equivalent Groovy code, but I am not able to find anything which does && operation in loop. Can anyone guide me through..
So far this is what I got
public List getAlert(def searchParameters, def numOfResult) throws UnsupportedEncodingException
{
List respList=null
respList = new ArrayList()
String[] searchStrings = searchParameters.split(",")
try
{
for(strIndex in searchStrings)
{
IQueryResult result = search(searchStrings[strIndex])
if(result!=null)
{
def count = 0
/*The below line gives me error*/
for(it in result.document && count < numOfResult)
{
}
}
}
}
catch(Exception e)
{
e.printStackTrace()
}
}
My Java code
public List getAlert(String searchParameters, int numOfResult) throws UnsupportedEncodingException
{
List respList = null
respList = new ArrayList()
String[] searchStrings = searchParameters.split(",")
try {
for (int strIndex = 0; strIndex < searchStrings.length; strIndex++) {
IQueryResult result = search(searchStrings[strIndex])
if (result != null) {
ListIterator it = result.documents()
int count = 0
while ((it.hasNext()) && (count < numOfResult)) {
IDocumentSummary summary = (IDocumentSummary)it.next()
if (summary != null) {
String docid = summary.getSummaryField("infadocid").getStringValue()
int index = docid.indexOf("#")
docid = docid.substring(index + 1)
String url = summary.getSummaryField("url").getStringValue()
int i = url.indexOf("/", 8)
String endURL = url.substring(i + 1, url.length())
String body = summary.getSummaryField("infadocumenttitle").getStringValue()
String frontURL = produrl + endURL
String strURL
strURL = frontURL
strURL = body
String strDocId
strDocId = frontURL
strDocId = docid
count++
}
}
}
result = null
}
} catch (Exception e) {
e.printStackTrace()
return respList
}
return respList
}
It seems to me like
def summary = result.documents.first()
if (summary) {
String docid = summary.getSummaryField("infadocid").getStringValue()
...
strDocId = docid
}
is all you really need, because the for loop actually doesn't make much sense when all you want is to process the first record.
If there is a possibility that result.documents contains nulls, then replace first() with find()
Edit: To process more than one result:
def summaries = result.documents.take(numOfResult)
// above code assumes result.documents contains no nulls; otherwise:
// def count=0
// def summaries = result.documents.findAll { it && count++<numOfResult }
summaries.each { summary ->
String docid = summary.getSummaryField("infadocid").getStringValue()
...
strDocId = docid
}
In idiomatic Groovy code, many loops are replace by iterating methods like each()
You know the while statement also exists in Groovy ?
As a consequence, there is no reason to transform it into a for loop.
/*The below line gives me error*/
for(it in result.document && count < 1)
{
}
This line is giving you an error, because result.document will try to call result.getDocument() which doesn't exist.
Also, you should avoid using it as a variable name in Groovy, because within the scope of a closure it is the default name of the first closure parameter.
I haven't looked at the code thoroughly (or as the kids say, "tl;dr"), but I suspect if you just rename the file from .java to .groovy, it will probably work.
I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search
(essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)
I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?
Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.
I am a big fan of Java's MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?
I have similar code (seek, read, repeat until done) in my own application, benchmarked
java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.
Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.
To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.
public class StusMagicLargeFileReader {
private static final long PAGE_SIZE = Integer.MAX_VALUE;
private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
private final byte raw[] = new byte[1];
public static void main(String[] args) throws IOException {
File file = new File("/Users/stu/test.txt");
FileChannel fc = (new FileInputStream(file)).getChannel();
StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
long position = file.length() / 2;
String candidate = buffer.getString(position--);
while (position >=0 && !candidate.equals('\n'))
candidate = buffer.getString(position--);
//have newline position or start of file...do other stuff
}
StusMagicLargeFileReader(FileChannel channel) throws IOException {
long start = 0, length = 0;
for (long index = 0; start + length < channel.size(); index++) {
if ((channel.size() / PAGE_SIZE) == index)
length = (channel.size() - index * PAGE_SIZE) ;
else
length = PAGE_SIZE;
start = index * PAGE_SIZE;
buffers.add(index, channel.map(READ_ONLY, start, length));
}
}
public String getString(long bytePosition) {
int page = (int) (bytePosition / PAGE_SIZE);
int index = (int) (bytePosition % PAGE_SIZE);
raw[0] = buffers.get(page).get(index);
return new String(raw);
}
}
I have the same problem. I am trying to find all lines that start with some prefix in a sorted file.
Here is a method I cooked up which is largely a port of Python code found here: http://www.logarithmic.net/pfh/blog/01186620415
I have tested it but not thoroughly just yet. It does not use memory mapping, though.
public static List<String> binarySearch(String filename, String string) {
List<String> result = new ArrayList<String>();
try {
File file = new File(filename);
RandomAccessFile raf = new RandomAccessFile(file, "r");
long low = 0;
long high = file.length();
long p = -1;
while (low < high) {
long mid = (low + high) / 2;
p = mid;
while (p >= 0) {
raf.seek(p);
char c = (char) raf.readByte();
//System.out.println(p + "\t" + c);
if (c == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
String line = raf.readLine();
//System.out.println("-- " + mid + " " + line);
if (line.compareTo(string) < 0)
low = mid + 1;
else
high = mid;
}
p = low;
while (p >= 0) {
raf.seek(p);
if (((char) raf.readByte()) == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
while (true) {
String line = raf.readLine();
if (line == null || !line.startsWith(string))
break;
result.add(line);
}
raf.close();
} catch (IOException e) {
System.out.println("IOException:");
e.printStackTrace();
}
return result;
}
I am not aware of any library that has that functionality. However, a correct code for a external binary search in Java should be similar to this:
class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
this.file = new RandomAccessFile(f, "r");
this.test = test;
}
public String search(String element) throws IOException {
long l = file.length();
return search(element, -1, l-1);
}
/**
* Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
* In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
*/
private String search(String element, long low, long high) throws IOException {
if(high - low < 1024) {
// search directly
long p = low;
while(p < high) {
String line = nextLine(p);
int r = test.compare(line,element);
if(r > 0) {
return null;
} else if (r < 0) {
p += line.length();
} else {
return line;
}
}
return null;
} else {
long m = low + ((high - low) / 2);
String line = nextLine(m);
int r = test.compare(line, element);
if(r > 0) {
return search(element, low, m);
} else if (r < 0) {
return search(element, m, high);
} else {
return line;
}
}
}
private String nextLine(long low) throws IOException {
if(low == -1) { // Beginning of file
file.seek(0);
} else {
file.seek(low);
}
int bufferLength = 65 * 1024;
byte[] buffer = new byte[bufferLength];
int r = file.read(buffer);
int lineBeginIndex = -1;
// search beginning of line
if(low == -1) { //beginning of file
lineBeginIndex = 0;
} else {
//normal mode
for(int i = 0; i < 1024; i++) {
if(buffer[i] == '\n') {
lineBeginIndex = i + 1;
break;
}
}
}
if(lineBeginIndex == -1) {
// no line begins within next 1024 bytes
return null;
}
int start = lineBeginIndex;
for(int i = start; i < r; i++) {
if(buffer[i] == '\n') {
// Found end of line
return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
return line.toString();
}
}
throw new IllegalArgumentException("Line to long");
}
}
Please note: I made up this code ad-hoc: Corner cases are not tested nearly good enough, the code assumes that no single line is larger than 64K, etc.
I also think that building an index of the offsets where lines start might be a good idea. For a 500 GB file, that index should be stored in an index file. You should gain a not-so-small constant factor with that index because than there is no need to search for the next line in each step.
I know that was not the question, but building a prefix tree data structure like (Patrica) Tries (on disk/SSD) might be a good idea to do the prefix search.
This is a simple example of what you want to achieve. I would probably first index the file, keeping track of the file position for each string. I'm assuming the strings are separated by newlines (or carriage returns):
RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
List<Long> indexList = new ArrayList();
long pos = 0;
while (file.readLine() != null)
{
Long linePos = new Long(pos);
indexList.add(linePos);
pos = file.getFilePointer();
}
int indexSize = indexList.size();
Long[] indexArray = new Long[indexSize];
indexList.toArray(indexArray);
The last step is to convert to an array for a slight speed improvement when doing lots of lookups. I would probably convert the Long[] to a long[] also, but I did not show that above. Finally the code to read the string from a given indexed position:
int i; // Initialize this appropriately for your algorithm.
file.seek(indexArray[i]);
String line = file.readLine();
// At this point, line contains the string #i.
If you are dealing with a 500GB file, then you might want to use a faster lookup method than binary search - namely a radix sort which is essentially a variant of hashing. The best method for doing this really depends on your data distributions and types of lookup, but if you are looking for string prefixes there should be a good way to do this.
I posted an example of a radix sort solution for integers, but you can use the same idea - basically to cut down the sort time by dividing the data into buckets, then using O(1) lookup to retrieve the bucket of data that is relevant.
Option Strict On
Option Explicit On
Module Module1
Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0
Private Sub generateData()
' fill with random numbers between 0 and MAX_SIZE - 1
For i = 0 To MAX_SIZE - 1
m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
Next
End Sub
Private Sub sortData()
For i As Integer = 0 To MAX_SIZE - 1
Dim x = m_input(i)
If m_table(x) Is Nothing Then
m_table(x) = New List(Of Integer)
End If
m_table(x).Add(x)
' clearly this is simply going to be MAX_SIZE -1
m_operations = m_operations + 1
Next
End Sub
Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
If start < 0 Or start > MAX_SIZE - 1 Then
Throw New Exception("printData - start out of range")
End If
If finish < 0 Or finish > MAX_SIZE - 1 Then
Throw New Exception("printData - finish out of range")
End If
For i As Integer = start To finish
If m_table(i) IsNot Nothing Then
For Each x In m_table(i)
Console.WriteLine(x)
Next
End If
Next
End Sub
' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
m_operations = 0
generateData()
Console.WriteLine("Time started = " & Now.ToString())
sortData()
Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
' print out a random 100 segment from the sorted array
Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
printData(start, start + 100)
End Sub
Sub Main()
test()
Console.ReadLine()
End Sub
End Module
I post a gist https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c
that is rather complete example based on what I found on stack overflow and some blogs hopefully someone else can use it
import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;
import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
public class FileUtils {
private FileUtils() {
}
private static boolean found(final String candidate, final String prefix) {
return isBlank(candidate) || candidate.startsWith(prefix);
}
private static boolean before(final String candidate, final String prefix) {
return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
}
public static MappedByteBuffer getMappedByteBuffer(final Path path) {
FileChannel fileChannel = null;
try {
fileChannel = FileChannel.open(path, READ);
return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
}
catch (Exception e) {
throw new RuntimeException(e);
}
finally {
closeQuietly(fileChannel);
}
}
public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
if (buffer == null) {
return null;
}
try {
long low = 0;
long high = buffer.limit();
while (low < high) {
int mid = (int) ((low + high) / 2);
final String candidate = getLine(mid, buffer);
if (found(candidate, prefix)) {
return trimToNull(candidate);
}
else if (before(candidate, prefix)) {
high = mid;
}
else {
low = mid + 1;
}
}
}
catch (Exception e) {
throw new RuntimeException(e);
}
return null;
}
private static String getLine(int position, final MappedByteBuffer buffer) {
// search backwards to the find the proceeding new line
// then search forwards again until the next new line
// return the string in between
final StringBuilder stringBuilder = new StringBuilder();
// walk it back
char candidate = (char)buffer.get(position);
while (position > 0 && candidate != '\n') {
candidate = (char)buffer.get(--position);
}
// we either are at the beginning of the file or a new line
if (position == 0) {
// we are at the beginning at the first char
candidate = (char)buffer.get(position);
stringBuilder.append(candidate);
}
// there is/are char(s) after new line / first char
if (isInBuffer(buffer, position)) {
//first char after new line
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
//walk it forward
while (isInBuffer(buffer, position) && candidate != ('\n')) {
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
}
}
return stringBuilder.toString();
}
private static boolean isInBuffer(final Buffer buffer, int position) {
return position + 1 < buffer.limit();
}
public static File getOrCreateDirectory(final String dirName) {
final File directory = new File(dirName);
try {
forceMkdir(directory);
isWritable(directory.toPath());
}
catch (IOException e) {
throw new RuntimeException(e);
}
return directory;
}
}
I had similar problem, so I created (Scala) library from solutions provided in this thread:
https://github.com/avast/BigMap
It contains utility for sorting huge file and binary search in this sorted file...
If you truly want to try memory mapping the file, I found a tutorial on how to use memory mapping in Java nio.