An application uses a jEditorPane to display html pages, which also has the ability to print said html page. We construct the MediaPrintableArea for the printerJob attributeSet like so:
float mediaWidth = mediaSize.getX(Size2DSyntax.MM);
float mediaHeight = mediaSize.getY(Size2DSyntax.MM);
float imageableX = 18;
float imageableY = 25;
float imageableWidth = (mediaWidth - (2 * imageableX));
float imageableHeight = (mediaHeight - (2 * imageableY));
MediaPrintableArea imageableArea = new MediaPrintableArea(imageableX, imageableY, imageableWidth, imageableHeight, Size2DSyntax.MM);
So we control the printable area of the page. However, when the moons align and a single line is just the right length, the end of the last character in the line is being cut off.
EX: if a line ends with the word "to", there will only be the left-most half of the 'o' visible on the printed page. I would expect if this were to run off the edge of the printable are, "to" would wrap to the next line, but its not.
Is there some other method of defining the printable area besides using the MediaPrintableArea? Is there anything that can be causing the words to not wrap or how java calculates the placement of the words?
We've also tested several other printers and printed from browsers where we can print beyond there our java print job is cutting off, so I don't think hardware problems should be considered.
You're probably rendering the JEditorPane starting from (0, 0) instead of from PageFormat.getImageableX(), PageFormat.getImageableY(). See http://java.sun.com/developer/onlineTraining/Programming/JDCBook/advprint.html for more information.
Related
How can I start a new page at paragraph level? I know I can do this at document level, but this would break my formatting.
Perhaps there is function to find the remaining lines on page depending on font and fontsize?
You can definitely do this.
But it requires some knowledge about how iText renders its content.
Internally, when a Paragraph object is rendered, it uses a ParagraphRenderer. Each ParagraphRenderer has one or more LineRenderer objects as its children. And similarly, each LineRenderer has one or more TextRenderer objects.
In order to get information about where a paragraph would be split, you can ask the Paragraph object to perform layout against a given LayoutContext object (which contains the width and height of available space, as well as some other useful information), and get the LayoutResult back.
LayoutResult will be able to tell you where the Paragraph was split.
Try the following piece of code.
w denotes the available width
h denotes the available height
Rectangle layoutRect = new Rectangle(w,h);
LayoutArea layoutArea = new LayoutArea(1,layoutRect);
LayoutContext context = new LayoutContext(layoutArea);
Text layoutText = new Text(s);
layoutText.setTextRise(0f);
layoutText.setSplitCharacters(new DefaultSplitCharacters());
layoutText.setFont(font);
layoutText.setFontSize(fontSize);
Paragraph p = new Paragraph().add(layoutText);
LayoutResult hwResult = p.createRendererSubTree().layout(context);
At that point, use an IDE to inspect the LayoutResult object. If the text was split, you should see a SplitRenderer, which can give you more information.
I have a couple of huge images which can't be loaded into the memory in whole. I know that the images are tiled and all the methods in the class ImageReader give me plausible non zero return values for
getTileGridXOffset(int),
getTileGridYOffset(int),
getTileWidth(int) and
getTileHeight(int).
My problem now is that I want to read one tile only to avoid having to load the entire image into memory using the ImageReader.readtTile(int, int, int) method. But how do I determine what the valid values for the tile coordinates are?
There is the method getNumXTiles() and getNumYTiles() in the interface RenderedImage but all attempts to create a rendered image from the source results into a out of memory/java heap space error.
The tile coordinates can theoretically be anything and I tried readtTile(0, -1, -1) which also works for a few images I tested.
I also tried to reach the metadata for those images but I didn't find any useful information regarding the image layout.
Is there anyone who can tell me how to get the values for the tile coordinates without having to read the entire image into memory? Is there another way which does not require an instance of ImageLayout?
Thank you very much for your assistance.
First of all, you should check that the ImageReader in question supports tiling for the given image, using the isImageTiled(imageIndex). If it doesn't, you can't expect useful values from the other methods.
Then if it does, all tiles for a given image must be equal in size (but the last tile in each column/the last row may be truncated). This is also the case for all tiled file formats that I know of (ie. TIFF). So, using this knowledge, the number of tiles in both dimensions can be calculated:
// Calculate number of x tiles/y tiles:
int cols = (int) Math.ceil(reader.getWidth(imageIndex) / (double) reader.getTileWidth(imageIndex));
int rows = (int) Math.ceil(reader.getHeight(imageIndex) / (double) reader.getTileHeight(imageIndex));
You can then, loop over the tile indexes (the first tile is always 0,0):
for (int row = 0; row < rows; row++) {
for (int col = 0; col < cols; col++) {
BufferedImage tile = reader.readTile(imageIndex, col, row);
// ...do more processing...
}
}
Or, if you only want to get a single tile, you obviously don't need the double for loops. :-)
Note: For ImageReaders/images that don't support tiling, the getTileWidth and getTileHeight methods will just return the same as getWidthand getHeight, respectively.
Also, the readTile API docs says:
If the arguments are out of range, an IllegalArgumentException is thrown. If the image is not tiled, the values 0, 0 will return the entire image; any other values will cause an IllegalArgumentException to be thrown.
This means your example, readtTile(0, -1, -1) should always throw an IllegalArgumentException regardless of the tiling... I suspect some implementations may disregard the tile coordinates completely, and give you the entire image anyway.
PS: The RenderedImage interface could in theory help you. But it would require a special implementation in the ImageReader. In most cases you will just get a normal BufferedImage (which implements RenderedImage), and is a single (1x1) tile.
I need to read a plan exported by AutoCAD to PDF and place some markers with text on it with PDFBox.
Everything works fine, except the calculation of the width of the text, which is written next to the markers.
I skimmed through the whole PDF specification and read in detail the parts, which deal with the graphic and the text, but to no avail. As far as I understand, the glyph coordinate space is set up in a 1/1000 of the user coordinate space. Hence the width need to be scale up by 1000, but it's still a fraction of the real width.
This is what I am doing to position the text:
float textWidth = font.getStringWidth(marker.id) * 0.043f;
contentStream.beginText();
contentStream.setTextScaling(1, 1, 0, 0);
contentStream.moveTextPositionByAmount(
marker.endX + marker.getXTextOffset(textWidth, fontPadding),
marker.endY + marker.getYTextOffset(fontSize, fontPadding));
contentStream.drawString(marker.id);
contentStream.endText();
The * 0.043f works as an approximation for one document, but fails for the next.
Do I need to reset any other transformation matrix except the text matrix?
EDIT: A full idea example project is on github with tests and example pdfs: https://github.com/ascheucher/pdf-stamp-prototype
Thanks for your help!
Unfortunately the question and comments merely include (by running the sample project) the actual result for two source documents and the description
The annotating text should be center aligned on the top and bottom marker, aligned to the left on the right marker and aligned to the right on the left marker. The alignment is not working for me, as the font.getSTringWidth( .. ) returns only a fraction of what it seems to be. And the discrepance seems to be different in both PDFs.
but not a concrete sample discrepancy to repair.
There are several issues in the code, though, which may lead to such observations (and other ones, too!). Fixing them should be done first; this may already resolve the issues observed by the OP.
Which box to take
The code of the OP derives several values from the media box:
PDRectangle pageSize = page.findMediaBox();
float pageWidth = pageSize.getWidth();
float pageHeight = pageSize.getHeight();
float lineWidth = Math.max(pageWidth, pageHeight) / 1000;
float markerRadius = lineWidth * 10;
float fontSize = Math.min(pageWidth, pageHeight) / 20;
float fontPadding = Math.max(pageWidth, pageHeight) / 100;
These seem to be chosen to be optically pleasing in relation to the page size. But the media box is not, in general, the final displayed or printed page size, the crop box is. Thus, it should be
PDRectangle pageSize = page.findCropBox();
(Actually the trim box, the intended dimensions of the finished page after trimming, might even be more apropos; the trim box defaults to the crop box. For details read here.)
This is not relevant for the given sample documents as they do not contain explicit crop box definitions, so the crop box defaults to the media box. It might be relevant for other documents, though, e.g. those the OP could not include.
Which PDPageContentStream constructor to use
The code of the OP adds a content stream to the page at hand using this constructor:
PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true);
This constructor appends (first true) and compresses (second true) but unfortunately it continues in the graphics state left behind by the pre-existing content.
Details of the graphics state of importance for the observations at hand:
Transformation matrix - it may have been changed to scale (or rotate, skew, move ...) any new content added
Character spacing - it may have been changed to put any new characters added nearer to or farther from each other
Word spacing - it may have been changed to put any new words added nearer to or farther from each other
Horizontal scaling - it may have been changed to scale any new characters added
Text rise - it may have been changed to displace any new characters added vertically
Thus, a constructor should be chosen which also resets the graphics state:
PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true, true);
The third true tells PDFBox to reset the graphics state, i.e. to surround the former content with a save-state/restore-state operator pair.
This is relevant for the given sample documents, at least the transformation matrix is changed.
Setting and using the CalRGB color space
The OP's code sets the stroking and non-stroking color spaces to a calibrated color space:
contentStream.setStrokingColorSpace(new PDCalRGB());
contentStream.setNonStrokingColorSpace(new PDCalRGB());
Unfortunately new PDCalRGB() does not create a valid CalRGB color space object, its required WhitePoint value is missing. Thus, before selecting a calibrated color space, initialize it properly.
Thereafter the OP's code sets the colors using
contentStream.setStrokingColor(marker.color.r, marker.color.g, marker.color.b);
contentStream.setNonStrokingColor(marker.color.r, marker.color.g, marker.color.b);
These (int, int, int) overloads unfortunately use the RG and rg operators implicitly selecting the DeviceRGB color space. To not overwrite the current color space, use the (float[]) overloads with normalized (0..1) values instead.
While this is not relevant for the observed issue, it causes error messages by PDF viewers.
Calculating the width of a drawn string
The OP's code calculates the width of a drawn string using
float textWidth = font.getStringWidth(marker.id) * 0.043f;
and the OP is surprised
The * 0.043f works as an approximation for one document, but fails for the next.
There are two factors building this "magic" number:
As the OP has remarked the glyph coordinate space is set up in a 1/1000 of the user coordinate space and that number is in glyph space, thus a factor of 0.001.
As the OP has ignored he wants the width for the string using the font size he selected. But the font object has no knowledge of the current font size and returns the width for a font size of 1. As the OP selects the font size dynamically as Math.min(pageWidth, pageHeight) / 20, this factor varies. In case of the two given sample documents about 42 but probably totally different in other documents.
Positioning text
The OP's code positions the text like this starting from identity text matrices:
contentStream.moveTextPositionByAmount(
marker.endX + marker.getXTextOffset(textWidth, fontPadding),
marker.endY + marker.getYTextOffset(fontSize, fontPadding));
using methods getXTextOffset and getYTextOffset:
public float getXTextOffset(float textWidth, float fontPadding) {
if (getLocation() == Location.TOP)
return (textWidth / 2 + fontPadding) * -1;
else if (getLocation() == Location.BOTTOM)
return (textWidth / 2 + fontPadding) * -1;
else if (getLocation() == Location.RIGHT)
return 0 + fontPadding;
else
return (textWidth + fontPadding) * -1;
}
public float getYTextOffset(float fontSize, float fontPadding) {
if (getLocation() == Location.TOP)
return 0 + fontPadding;
else if (getLocation() == Location.BOTTOM)
return (fontSize + fontPadding) * -1f;
else
return fontSize / 2 * -1;
}
In case of getXTextOffset I doubt that adding fontPadding for Location.TOP and Location.BOTTOM makes sense, especially in the light of the OP's desire
The annotating text should be center aligned on the top and bottom marker
For the text to be centered it should not be shifted off-center.
The case of getYTextOffset is more difficult. The OP's code is built upon two misunderstandings: It assumes
that the text position selected by moveTextPositionByAmount is the lower left, and
that the font size is the character height.
Actually the text position is positioned on the base line, the glyph origin of the next drawn glyph will be positioned there, e.g.
Thus, the y positioned either has to be corrected to take the descent into account (for centering on the whole glyph height) or only use the ascent (for centering on the above-baseline glyph height).
And a font size does not denote the actual character height but is arranged so that the nominal height of tightly spaced lines of text is 1 unit for font size 1. "Tightly spaced" implies that some small amount of additional inter-line space is contained in the font size.
In essence for centering vertically one has to decide what to center on, whole height or above-baseline height, first letter only, whole label, or all font glyphs. PDFBox does not readily supply the necessary information for all cases but methods like PDFont.getFontBoundingBox() should help.
I have to perpare a Trainging set for my Machine Learning Course, in which for a given face image it gives you an answer representing the side of the head ( straight , left , right , up )
For this purpose i need to read a .pgm image file in java and store its pixels in one row of matrix X, and then store the appropriate right answer of this image in a y vector. finally i will save these two arrays in a .mat file.
The problem is when trying to read the pixel values from a (P2 .pgm) image and printing them to console , they don't give identical values with the matlab matrix viewer. what would be the problem?
This is my code:
try{
InputStream f = Main.class.getResourceAsStream("an2i_left_angry_open.pgm");
BufferedReader d = new BufferedReader(new InputStreamReader(f));
String magic = d.readLine(); // first line contains P2 or P5
String line = d.readLine(); // second line contains height and width
while (line.startsWith("#")) { // ignoring comment lines
line = d.readLine();
}
Scanner s = new Scanner(line);
int width = s.nextInt();
int height = s.nextInt();
line = d.readLine();// third line contains maxVal
s = new Scanner(line);
int maxVal = s.nextInt();
for(int i=0;i<30;i++) /* printing first 30 values from the image including spaces*/
System.out.println((byte)d.read());
} catch (EOFException eof) {
eof.printStackTrace(System.out) ;
}
these are the values i get:
50
49
32
50
32
49
32
48
32
50
32
49
56
32
53
57
while this photo is what is indeed in the image from MATLAB Viewer:
(sorry i can't post images because of lack of reputationS)
and this is what you find when you open the .pgm file via notepad++
Take a look at this post in particular. I've experienced similar issues with imread and with Java's ImageIO class and for the longest time, I could not find this link as proof that other people have experienced the same thing... until now. Similarly, someone experienced related issues in this post but it isn't quite the same at what you're experiencing.
Essentially, the reason why images loaded in both Java and MATLAB are different is due to enhancement purposes. MATLAB scales the intensities so the image isn't mostly black. Essentially, the maximum intensity in your PGM gets scaled to 255 while the other intensities are linearly scaled to suit the dynamic range of [0,255]. So for example, if your image had a dynamic range from [0-100] in your PGM file before loading it in with imread, this would get scaled to [0-255] and not be the original scale of [0-100]. As such, you would have to know the maximum intensity value of the image before you loaded it in (by scanning through the file yourself). That is very easily done by reading the third line of the file. In your case, this would be 156. Once you find this, you would need to scale every value in your image so that it is rescaled to what it originally was before you read it in.
To confirm that this is the case, take a look at the first pixel in your image, which has intensity 21 in the original PGM file. MATLAB would thus scale the intensities such that:
scaled = round(val*(255/156));
val would be the input intensity and scaled is the output intensity. As such, if val = 21, then scaled would be:
scaled = round(21*(255/156)) = 34
This matches up with the first pixel when reading it out in MATLAB. Similarly, the sixth pixel in the first row, the original value is 18. MATLAB would scale it such that:
scaled = round(18*(255/156)) = 29
This again matches up with what you see in MATLAB. Starting to see the pattern now? Basically, to undo the scaling, you would need to multiply by the reciprocal of the scaling factor. As such, given that A is the image you loaded in, you need to do:
A_scaled = uint8(double(A)*(max_value/255));
A_scaled is the output image and max_value is the maximum intensity found in your PGM file before you loaded it in with imread. This undoes the scaling, as MATLAB scales the images from [0-255]. Note that I need to cast the image to double first, do the multiplication with the scaling factor as this will most likely produce floating point values, then re-cast back to uint8. Therefore, to bring it back to [0-max_value], you would have to scale in the opposite way.
Specifically in your case, you would need to do:
A_scaled = uint8(double(A)*(156/255));
The disadvantage here is that you need to know what the maximum value is prior to working with your image, which can get annoying. One possibility is to use MATLAB and actually open up the file with file pointers and get the value of the third line yourself. This is also an annoying step, but I have an alternative for you.
Alternative... probably better for you
Alternatively, here are two links to functions written in MATLAB that read and write PGM files without doing that unnecessary scaling, and it'll provide the results that you are expecting (unscaled).
Reading: http://people.sc.fsu.edu/~jburkardt/m_src/pgma_io/pgma_read.m.
Writing: http://people.sc.fsu.edu/~jburkardt/m_src/pgma_io/pgma_write.m
How the read function works is that it opens up the image using file pointers and manually parses in the data and stores the values into a matrix. You probably want to use this function instead of relying on imread. To save the images, file pointers are again used and the values are written such that the PGM standard is maintained and again, your intensities are unscaled.
Your java implementation is printing the ASCII values of the text bytes "21 2 1" etc.
50->2
51->1
32->SPACE
50->2
32->SPACE
51->1
etc.
Some PGM files use a text header, but binary representation for the pixels themselves. These are marked with a different magic string at the beginning. It looks like the java code is reading the file as if it had binary pixels.
Instead, your PGM file has ASCII-coded pixels, where you want to scan a whitespace-separated value for each pixel. You do this the same way you read the width and height.
The debug code might look like this:
line = d.readLine(); // first image line
s = new Scanner(line);
for(int i=0;i<30;i++) /* printing first 30 values from the image including spaces*/
System.out.println((byte)s.nextInt());
I have a block of text I'm trying to interpret in java (or with grep/awk/etc) looking like the following:
Somewhat differently, plaques of the rN8 and rN9 mutants and human coronavirus OC43 as well as the more divergent
were of fully wild-type size, indicating that the suppressor mu- SARS-CoV, human coronavirus HKU1, and bat coronaviruses
tations, in isolation, were not noticeably deleterious to the HKU4, HKU5, and HKU9 (Fig. 6B). Thus, not only do mem-
--
able effect on the viral phenotype. A potentially related obser- sented for the existence of an interaction between nsp9
vation is that the mutation A2U, which is also neutral by itself, nsp8 (56). A hexadecameric complex of SARS-CoV nsp8 and
is lethal in combination with the AACAAG insertion (data not nsp7 has been found to bind to double-stranded RNA. The
And what I'd like to do is split it into two parts: left and right. I'm having trouble coming up with a regex or any other method that would split a block of text obviously visually split, but not obvious to a programming language. The lengths of the lines are variable.
I've considered looking for the first block and then finding the second by looking for multiple spaces, but I'm not sure that that's a robust solution. Any ideas, snippets, pseudo code, links, etc?
Text Source
The text has been ran as follows through pdftotext pdftotext -layout MyPdf.pdf
Blur the text and come up with an array of the character density per column of text. Then look for gaps and split there.
String blurredText = text.replaceAll("(?<=\\S) (?=\\S)", ".");
String[] blurredLines = text.split("\r\n?|\n");
int maxRowLength = 0;
for (String blurredLine : blurredLines) {
maxRowLength = Math.max(maxRowLength, blurredLine.length());
}
int[] columnCounts = new int[maxRowLength];
for (String blurredLine : blurredLines) {
for (int i = 0, n = blurredLine.length(); i < n; ++i) {
if (blurredLine.charAt(i) != ' ') { ++columnCounts[i]; }
}
}
// Look for runs of zero of at least length 3.
// Alternatively, you might look for the n longest runs of zeros.
// Alternatively, you might look for runs of length min(columnCounts) to ignore
// horizontal rules.
int minBreakLen = 3; // A tuning parameter.
List<Integer> breaks = new ArrayList<Integer>();
outer: for (int i = 0; i < maxRowLength - minBreakLen; ++i) {
if (columnCounts[i] != 0) { continue; }
int runLength = 1;
while (i + runLength < maxRowLength && 0 == columnCounts[i + runLength]) {
++runLength;
}
if (runLength >= minBreakLen) {
breaks.add(i);
}
i += runLength - 1;
}
System.out.println(breaks);
I doubt there is any robust solution to this. I would go for some sort of heuristic approach.
Off the top of my head, I would calculate a histogram of the column index of the first character of each word, and split on the column with the highest score (the idea being to find lots of words that are all aligned horizontally). I might also choose to weight this based on the number of preceding spaces.
I work in this general area. I am surprised that a double-column bioscience text of recent times (SARS, etc.) would be rendered in double-column monospace as the original - it would be typeset in proportional font or in HTML. So I suspect your text came from some other format (such as PDF). If so then you should try to get that format. PDF is horrible to parse, but PDF flattened to monospace is probably worse.
If you possibly can find someone who has worked in the area and see what they have done. If you have multiple documents (e.g. from different journals or reports) then your problem is worse. Yes, I could write an algorithm to solve the example you have posted, but my guess is it will break on the next set of documents. You will end up customising this for each different source (I and others have had to do this).
UPDATE: Thanks. As it's PDF then I would start by asking around. We collaborate with the group at Penn State (who have also done Citeseer). I also have colleagues at Cambridge who have spent months on a PDF reader.
If you want to do it yourself - and it will take time - then I'd start with PDFBox. I've done quite a lot with this and I think it's better for this than pdf2text or pdftotext. I can't remember whether it has double column option - I think so
UPDATE Here is a recent answer of several ways of tackling double-column PDF
http://metaoptimize.com/qa/questions/3943/methods-for-extracting-two-column-text-from-a-pdf
I'd certainly see what other people have done.
FWIW I spend a lot of time trying to convince people that scientists should not create their output in PDF because it destroys machine parsing - as you and I have found
UPDATE. You get the PDFs from your PI (== Principal Investigator?) In which case you'll gets lots of different sources which makes it worse.
What is the real problem you are trying to solve? I may be able to help