Android Studio ignores some breakpoints, while others work - java

in Android Studio I would like to analyze my variables with breakpoints, but some of them are just ignored, while others work. I have no idea why.
I am especially interested in the variables of the following method.
Do have any idea why the debugger ignores the breakpoints in it?
private Entry readEntry(XmlPullParser parser) throws XmlPullParserException, IOException {
parser.require(XmlPullParser.START_TAG, ns, "searchresults");
String longitude = null;
String latitude = null;
String place_id = null;
while (parser.next() != XmlPullParser.END_TAG) {
if (parser.getEventType() != XmlPullParser.START_TAG) {
continue;
}
String name = parser.getName();
if (name.equals("place")) {
longitude = readLongitude(parser);
latitude = readLatitude(parser);
place_id = readPlace_id(parser);
} else {
skip(parser);
}
}
return new Entry(longitude, latitude, place_id);
}
As you might have guessed I am trying to parse an XML document in an Async Task and this is one step of the way. I am trying to find out, what I did wrong while parsing this xml document:
<searchresults timestamp="Fri, 26 Aug 16 10:38:53 +0000" attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright" querystring="warmbandwerk" polygon="false" exclude_place_ids="129799074,151481629" more_url="http://open.mapquestapi.com/nominatim/v1/search?format=xml&exclude_place_ids=129799074,151481629&accept-language=de,en-US;q=0.7,en;q=0.3&q=warmbandwerk">
<place place_id="129799074" osm_type="way" osm_id="220625788" place_rank="30" boundingbox="51.4947849,51.497683,6.7382846,6.746241" lat="51.49646145" lon="6.74226302912559" display_name="Warmbandwerk 1, Kaiser-Wilhelm-Straße, ThyssenKrupp Steel Europe AG, Hamborn, Duisburg, Regierungsbezirk Düsseldorf, Nordrhein-Westfalen, 47166, Deutschland" class="building" type="yes" importance="0.101"/>
<place place_id="151481629" osm_type="relation" osm_id="2917388" place_rank="30" boundingbox="51.4860521,51.4891516,6.7050164,6.7144352" lat="51.48741265" lon="6.70882226405504" display_name="Warmbandwerk 2, Werkstraße 1, ThyssenKrupp Steel Europe AG, Beeckerwerth, Meiderich/Beeck, Duisburg, Regierungsbezirk Düsseldorf, Nordrhein-Westfalen, 47139, Deutschland" class="building" type="yes" importance="0.101"/>
</searchresults>
edit: Because someone asked which breakpoints do not work. In the above method none of them work. In the method below all of them work until the for-statement. The for-statement and return breakpoints are ignored.
private String loadXmlFromNetwork(String urlString) throws XmlPullParserException, IOException {
InputStream stream = null;
// Instantiate the parser
XMLParser XMLParser = new XMLParser();
List<XMLParser.Entry> entries = null;
StringBuilder htmlString = new StringBuilder();
htmlString.append("<h3>" + getResources().getString(R.string.page_title) + "</h3>");
htmlString.append("<em>" + getResources().getString(R.string.updated));
try {
stream = downloadUrl(urlString);
entries = XMLParser.parse(stream);
// Makes sure that the InputStream is closed after the app is
// finished using it.
} finally {
if (stream != null) {
stream.close();
}
}
// StackOverflowXmlParser returns a List (called "entries") of Entry objects.
// Each Entry object represents a single post in the XML feed.
// This section processes the entries list to combine each entry with HTML markup.
// Each entry is displayed in the UI as a link that optionally includes
// a text summary.
for (XMLParser.Entry entry : entries) {
htmlString.append("<p><a href='");
htmlString.append(entry.longitude);
htmlString.append(entry.latitude);
htmlString.append(entry.place_id);
}
return htmlString.toString();
}

Related

How to remove Images from PDF File?

Hello ,thank you for answer my question.This proble is perplex me for a long time.
I have search this QS for a long time,I read so many article in stack overFlow and google,but those articles is outdated or fragmented,so I have to seek for help.
I hope some one can help me ,please.
public class TEST04 {
public static void main(String[] args) throws IOException {
System.out.println("Hi");
//ori pdf file
String oriPDFFile = IFileUtils.getDesktopPath().getAbsoluteFile() + "\\1.pdf";
//out pdf file
String outPDFFile = IFileUtils.getDesktopPath().getAbsoluteFile() + "\\2.pdf";
strip(oriPDFFile, outPDFFile);
}
//parse
public static void strip(String pdfFile, String pdfFileOut) throws IOException {
//load ori pdf file
PDDocument document = PDDocument.load(new File(pdfFile));
//get All pages
List<PDPage> pageList = IterUtil.toList(document.getDocumentCatalog().getPages());
for (int i = 0; i < pageList.size(); i++) {
PDPage page = pageList.get(i);
COSDictionary newDictionary = new COSDictionary(page.getCOSObject());
PDFStreamParser parser = new PDFStreamParser(page);
List tokens = parser.getTokens();
List newTokens = new ArrayList();
for (int j = 0; j < tokens.size(); j++) {
Object token = tokens.get(j);
if (token instanceof Operator) {
Operator operator = (Operator) token;
if (operator.getName().equals("Do")) {
COSName cosName = (COSName) newTokens.remove(newTokens.size() - 1);
deleteObject(newDictionary, cosName);
continue;
}
}
newTokens.add(token);
}
PDStream newContents = new PDStream(document);
try (OutputStream outputStream = newContents.createOutputStream()) {
ContentStreamWriter writer = new ContentStreamWriter(outputStream);
writer.writeTokens(newTokens);
}
page.setContents(newContents);
// ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
// writer.writeTokens( newTokens );
// page.setContents(newContents);
PDResources newResources = new PDResources(newDictionary);
page.setResources(newResources);
}
document.save(pdfFileOut);
document.close();
}
//delete
public static boolean deleteObject(COSDictionary d, COSName name) {
for(COSName key : d.keySet()) {
if( name.equals(key) ) {
d.removeItem(key);
return true;
}
COSBase object = d.getDictionaryObject(key);
if(object instanceof COSDictionary) {
if( deleteObject((COSDictionary)object, name) ) {
return true;
}
}
}
return false;
}
}
The stack trace:
It works same way like it does in example RemoveAllText.java, just with different tag.
Use code from this example, just use "Do" instead of "Tj".
Of course, if you need to load metadata, etc, you should enumerate and check images threw page resources (like in my example)
Following the tip in Ali Yavari's answer you created a test class. Unfortunately that test code produced an exception. This answer focuses on fixing your code.
According to the stack trace you posted an image of the exception occurred while saving the document; some stream was asked to provide an InputStream and it failed with the message "Cannot read while there is an open stream writer".
So, let's have a look where your code opens a stream writer but does not close it again:
PDStream newContents = new PDStream(document);
ContentStreamWriter writer = new ContentStreamWriter(newContents.createOutputStream());
writer.writeTokens( newTokens );
page.setContents(newContents);
Indeed, here you ask a stream (the PDStream newContents) for something to write to (newContents.createOutputStream()) but don't close it.
You can do that like this:
PDStream newContents = new PDStream(document);
try (OutputStream outputStream = newContents.createOutputStream()) {
ContentStreamWriter writer = new ContentStreamWriter(outputStream);
writer.writeTokens(newTokens);
}
page.setContents(newContents);
A side note, you will have to re-write what you do with the newDictionary object. Currently you
initialize it with the page dictionary entries,
recursively remove all entries with a key that is a name of an image you remove, and
set the page resources to this dictionary.
Item 2 can delete much more than you actually want, the same name in a different dictionary may refer to an entry with a completely different meaning. Furthermore, you recurse without further checks; if there is a circular relation among the dictionaries, this may result in an infinite recursion, i.e. a stack overflow exception.
Item 3 sets this manipulated page clone inappropriately as the resources of the original page. This create a completely broken page structure.
Instead you should retrieve the resources from the page (resources = page.getResources()) and remove the images by putting them to null (resources.put(cosName, (PDXObject)null)).
In my other answer I focused on advise on how to fix the code in the question. Here I focus on a different approach to the task.
In your code you try to remove the bitmap images by inspecting the page content streams, finding Do operations therein drawing XObjects, and removing both this instruction and the referenced XObject.
It is a bit easier to instead simply replace all image XObjects in the resources by an empty form XObject. This is the approach used here.
As that approach is very easy to implement, I extended it to not only go through the immediate resources of the pages but also iterate into embedded form XObjects and patterns.
void replaceBitmapImagesResources(PDDocument document) throws IOException {
PDFormXObject pdFormXObject = new PDFormXObject(document);
pdFormXObject.setBBox(new PDRectangle(1, 1));
for (PDPage pdPage : document.getPages()) {
replaceBitmapImagesResources(pdPage.getResources(), pdFormXObject);
}
}
void replaceBitmapImagesResources(PDResources resources, PDFormXObject formXObject) throws IOException {
if (resources == null)
return;
for (COSName cosName : resources.getPatternNames()) {
PDAbstractPattern pdAbstractPattern = resources.getPattern(cosName);
if (pdAbstractPattern instanceof PDTilingPattern) {
PDTilingPattern pdTilingPattern = (PDTilingPattern) pdAbstractPattern;
replaceBitmapImagesResources(pdTilingPattern.getResources(), formXObject);
}
}
List<COSName> xobjectsToReplace = new ArrayList<>();
for (COSName cosName : resources.getXObjectNames()) {
PDXObject pdxObject = resources.getXObject(cosName);
if (pdxObject instanceof PDImageXObject) {
xobjectsToReplace.add(cosName);
} else if (pdxObject instanceof PDFormXObject) {
PDFormXObject pdFormXObject = (PDFormXObject) pdxObject;
replaceBitmapImagesResources(pdFormXObject.getResources(), formXObject);
}
}
for (COSName cosName : xobjectsToReplace) {
resources.put(cosName, formXObject);
}
}
(RemoveImages helper methods)
To apply this approach to a PDDocument simply call the first replaceBitmapImagesResources with that document as parameter.
Beware: I tried to keep the code simple; for production use remember to limit the recursion here to prevent endless recursions as in some PDFs XObjects or patterns call themselves directly or indirectly. Also you may want to inspect page annotations and the resources of template pages.

Unsupported data type when getting mail JPG images

I'm trying to get the inline images of a mail, for which I have the following code:
protected void setCidAttachments(Message message, MensajeEmail mensajeEmail) {
try {
MimeMultipart mimeMultipart = (MimeMultipart) message.getDataHandler().getContent();
for (int k = 0; k < mimeMultipart.getCount(); k++) {
MimeBodyPart part = (MimeBodyPart) mimeMultipart.getBodyPart(k);
processPart(part, mensajeEmail);
}
}
catch (Exception e) {
log.error("Error obtendo adxuntos con cid", e);
}
}
private void processPart (BodyPart part, MensajeEmail mensajeEmail) throws MessagingException, IOException {
String type = getContentType(part);
StringBuilder content = new StringBuilder(mensajeEmail.getContenido());
if (isImage(type) && part.getDataHandler() != null && part.getDataHandler().getContent() != null) {
if (part.getDataHandler().getContent() instanceof MimeMultipart) {
MimeMultipart p = (MimeMultipart) part.getDataHandler().getContent();
for (int i = 0; i < p.getCount(); i++) {
BodyPart subpart = p.getBodyPart(i != p.getCount() - 1 ? i + 1 : i);
processPart(subpart, mensajeEmail);
}
} else {
mensajeEmail.setContenido(getInlineImage(part, content));
}
}
}
private String getInlineImage (BodyPart part, StringBuilder content) throws MessagingException, IOException {
Base64 decoder64 = new Base64();
ByteArrayOutputStream bos = new ByteArrayOutputStream();
// Get type
String type = getContentType(part);
// Get Content-ID
String contentId = getContentId(part);
// Replace
if (contentId.length() > 0) {
part.getDataHandler().writeTo(bos);
int start = content.indexOf("src=\"cid:" + contentId + "\"") + 5;
if (start > 4) {
int length = contentId.length() + 4;
content.replace(start, start + length, "data:" + (isImage(type) ? type : "image/png;") + " base64," + decoder64.encodeToString(bos.toByteArray()));
}
}
bos.close();
return content.toString();
}
private String getContentId (BodyPart part) throws MessagingException {
Enumeration headers = part.getAllHeaders();
while (headers.hasMoreElements()) {
Header header = (Header)headers.nextElement();
if (header.getName().equalsIgnoreCase("Content-ID"))
return cleanContentId(header.getValue());
}
return "";
}
private String getContentType (BodyPart part) throws MessagingException {
return part.getContentType().split(" ")[0];
}
private boolean isImage (String mime) {
return !mime.equals("text/html;") && !mime.equals("text/plain;");
}
private String cleanContentId (String contentId) {
if (contentId.charAt(0) == '<') contentId = contentId.substring(1, contentId.length() - 1);
return contentId;
}
This works perfectly fine when I send PNG images (which makes me think my code is indeed correct). However, when I try to send a JPG image, I get the following exception:
javax.activation.UnsupportedDataTypeException: Unknown image type image/jpeg; name=sony-car-796x418.jpg
at org.apache.geronimo.activation.handlers.AbstractImageHandler.getContent(AbstractImageHandler.java:57)
at javax.activation.DataSourceDataContentHandler.getContent(DataHandler.java:795)
at javax.activation.DataHandler.getContent(DataHandler.java:542)
at es.enxenio.fcpw.plinper.daemons.email.AbstractProtocoloObtencionEmail.processPart(AbstractProtocoloObtencionEmail.java:378)
...
Is the framework really not able to work with JPG images? Is there some way I can fix this?
EDIT: Gmail doesn't even let me send JPG images so it's probably not a very common format for mail images, which makes me think might not be widely implemented and that could be the reason why Java doesn't seem to be able to work with it
I found the problem. This line
if (isImage(type) && part.getDataHandler() != null && part.getDataHandler().getContent()
shouldn't check whether the type is an image but whether it is a multipart. Otherwise we could be processing a jpg image as a multipart. For some reason png images don't mind this and that's why it was working. Here are the changed parts of the code:
protected void setCidAttachments(Message message, MensajeEmail mensajeEmail) {
try {
processPart(message, mensajeEmail);
}
catch (Exception e) {
log.error("Error obtendo adxuntos con cid", e);
}
}
private void processPart(Part part, MensajeEmail mensajeEmail) throws MessagingException, IOException {
String type = getContentType(part);
StringBuilder content = new StringBuilder(mensajeEmail.getContenido());
if (isMultipart(type) && part.getDataHandler() != null && part.getDataHandler().getContent() != null && part.getDataHandler().getContent() instanceof MimeMultipart) {
MimeMultipart multipart = (MimeMultipart) part.getDataHandler().getContent();
for (int i = 0; i < multipart.getCount(); i++) {
BodyPart subpart = multipart.getBodyPart(i);
processPart(subpart, mensajeEmail);
}
}
else {
mensajeEmail.setContenido(getInlineImage(part, content));
}
}
private boolean isMultipart(String mime) {
return (Pattern.matches("multipart/.*", mime));
}
I got a similar exception running an app on eclipse osgi with java 11 and with bundles javax.mail.glassfish 1.4.1 and javax.activation 1.1.0 (got these 2 from https://eclipse.org/orbit):
javax.activation.UnsupportedDataTypeException: Unknown image type image/jpeg; name=image001.jpg
at org.apache.geronimo.activation.handlers.AbstractImageHandler.getContent(AbstractImageHandler.java:57)
at javax.activation.DataHandler.getContent(DataHandler.java:147)
at javax.mail.internet.MimeBodyPart.getContent(MimeBodyPart.java:652)
at my.code.calling.getcontent.MyClass(MyClass.java:802)
The package org.apache.geronimo.activation.handlers is included in the javax.transaction 1.1.0 bundle.
I resolved the problem by #-commenting the image/gif, image/jpeg handlers in the file META-INF/mailcap inside the javax.activation bundle:
## <apache license disclaimer> http://www.apache.org/licenses/LICENSE-2.0
##
## $Rev$ $Date: 2008/04/09 19:25:23 $
##
text/plain;; x-java-content-handler=org.apache.geronimo.activation.handlers.TextPlainHandler
text/html;; x-java-content-handler=org.apache.geronimo.activation.handlers.TextHtmlHandler
text/xml;; x-java-content-handler=org.apache.geronimo.activation.handlers.TextXmlHandler
#image/gif;; x-java-content-handler=org.apache.geronimo.activation.handlers.ImageGifHandler
#image/jpeg;; x-java-content-handler=org.apache.geronimo.activation.handlers.ImageJpegHandler
multipart/*;; x-java-content-handler=org.apache.geronimo.activation.handlers.MultipartHandler
There's no image/png here, that's why pngs are not a problem in the first place.
So by commenting gif and jpeg handlers, attachments of these types are now handled like pngs: getContent() will just yield an InputStream, instead of an AWT Image, which I think those geronimo ImageHandlers would produce if everything worked as intended.
Some internals: Geronimo AbstractImageHandler of javax.activation 1.1.0 tries to determine the type of image from javax.mail.glassfish 1.4.1 method IMAPBodyPart.getContent(), but this returns the mime-type incl. parameters, e.g. "image/jpeg; name=sony-car-796x418.jpg", which isn't understood and ultimately leads to the UnsupportedDataTypeException.
javax.mail.glassfish also has an META-INF/mailcap file, whose image/* section interestingly looks like this:
# can't support image types because java.awt.Toolkit doesn't work on servers
#
#image/gif;; x-java-content-handler=com.sun.mail.handlers.image_gif
#image/jpeg;; x-java-content-handler=com.sun.mail.handlers.image_jpeg
That could be a lead, I still did get the original jpeg exception also in a gui application, though.
Another thing, this error doesn't occur for me when running the same setup with java 8 instead of 11, probably got something to do with java 8 having its own javax.activation package.

Using PDFBox to remove Optional Content Groups that are not enabled

I'm using apache PDFBox from java, and I have a source PDF with multiple optional content groups. What I am wanting to do is export a version of the PDF that includes only the standard content and the optional content groups that were enabled. It is important for my purposes that I preserve any dynamic aspects of the original.... so text fields are still text fields, vector images are still vector images, etc. The reason that this is required is because I intend to ultimately be using a pdf form editor program that does not know how to handle optional content, and would blindly render all of them, so I want to preprocess the source pdf, and use the form editing program on a less cluttered destination pdf.
I've been trying to find something that could give me any hints on how to do this with google, but to no avail. I don't know if I'm just using the wrong search terms, or if this is just something that is outside of what the PDFBox API was designed for. I rather hope it's not the latter. The info shown here does not seem to work (converting the C# code to java), because despite the pdf I'm trying to import having optional content, there does not seem to be any OC resources when I examine the tokens on each page.
for(PDPage page:pages) {
PDResources resources = page.getResources();
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
Collection tokens = parser.getTokens();
...
}
I'm truly sorry for not having any more code to show what I've tried so far, but I've just been poring over the java API docs for about 8 hours now trying to figure out what I might need to do this, and just haven't been able to figure it out.
What I DO know how to do is add text, lines, and images to a new PDPage, but I do not know how to retrieve that information from a given source page to copy it over, nor how to tell which optional content group such information is part of (if any). I am also not sure how to copy form fields in the source pdf over to the destination, nor how to copy the font information over.
Honestly, if there's a web page out there that I wasn't able to find with google with the searches that I tried, I'd be entirely happy to read up more about it, but I am really quite stuck here, and I don't know anyone personally that knows about this library.
Please help.
EDIT:
Trying what I understand from what was suggested below, I've written a loop to examine each XObject on the page as follows:
PDResources resources = pdPage.getResources();
Iterable<COSName> names = resources.getXObjectNames();
for(COSName name:names) {
PDXObject xobj = resources.getXObject(name);
PDFStreamParser parser = new PDFStreamParser(xobj.getStream().toByteArray());
parser.parse();
Object [] tokens = parser.getTokens().toArray();
for(int i = 0;i<tokens.length-1;i++) {
Object obj = tokens[i];
if (obj instanceof COSName && obj.equals(COSName.OC)) {
i++;
Object obj = tokens[i];
if (obj instanceof COSName) {
PDPropertyList props = resources.getProperties((COSName)obj);
if (props != null) {
...
However, after an OC key, the next entry in the tokens array is always an Operator tagged as "BMC". Nowhere am I finding any info that I can recognize from the named optional content groups.
Here's a robust solution for removing marked content blocks (open to feedback if anyone finds anything that isn't working right). You should be able to adjust for OC blocks...
This code properly handles nesting and removal of resources (xobject, graphics state and fonts - easy to add others if needed).
public class MarkedContentRemover {
private final MarkedContentMatcher matcher;
/**
*
*/
public MarkedContentRemover(MarkedContentMatcher matcher) {
this.matcher = matcher;
}
public int removeMarkedContent(PDDocument doc, PDPage page) throws IOException {
ResourceSuppressionTracker resourceSuppressionTracker = new ResourceSuppressionTracker();
PDResources pdResources = page.getResources();
PDFStreamParser pdParser = new PDFStreamParser(page);
PDStream newContents = new PDStream(doc);
OutputStream newContentOutput = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter newContentWriter = new ContentStreamWriter(newContentOutput);
List<Object> operands = new ArrayList<>();
Operator operator = null;
Object token;
int suppressDepth = 0;
boolean resumeOutputOnNextOperator = false;
int removedCount = 0;
while (true) {
operands.clear();
token = pdParser.parseNextToken();
while(token != null && !(token instanceof Operator)) {
operands.add(token);
token = pdParser.parseNextToken();
}
operator = (Operator)token;
if (operator == null) break;
if (resumeOutputOnNextOperator) {
resumeOutputOnNextOperator = false;
suppressDepth--;
if (suppressDepth == 0)
removedCount++;
}
if (OperatorName.BEGIN_MARKED_CONTENT_SEQ.equals(operator.getName())
|| OperatorName.BEGIN_MARKED_CONTENT.equals(operator.getName())) {
COSName contentId = (COSName)operands.get(0);
final COSDictionary properties;
if (operands.size() > 1) {
Object propsOperand = operands.get(1);
if (propsOperand instanceof COSDictionary) {
properties = (COSDictionary) propsOperand;
} else if (propsOperand instanceof COSName) {
properties = pdResources.getProperties((COSName)propsOperand).getCOSObject();
} else {
properties = new COSDictionary();
}
} else {
properties = new COSDictionary();
}
if (matcher.matches(contentId, properties)) {
suppressDepth++;
}
}
if (OperatorName.END_MARKED_CONTENT.equals(operator.getName())) {
if (suppressDepth > 0)
resumeOutputOnNextOperator = true;
}
else if (OperatorName.SET_GRAPHICS_STATE_PARAMS.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.EXT_G_STATE, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.DRAW_OBJECT.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.XOBJECT, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.SET_FONT_AND_SIZE.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.FONT, operands.get(0), suppressDepth == 0);
}
if (suppressDepth == 0) {
newContentWriter.writeTokens(operands);
newContentWriter.writeTokens(operator);
}
}
if (resumeOutputOnNextOperator)
removedCount++;
newContentOutput.close();
page.setContents(newContents);
resourceSuppressionTracker.updateResources(pdResources);
return removedCount;
}
private static class ResourceSuppressionTracker{
// if the boolean is TRUE, then the resource should be removed. If the boolean is FALSE, the resource should not be removed
private final Map<COSName, Map<COSName, Boolean>> tracker = new HashMap<>();
public void markForOperator(COSName resourceType, Object resourceNameOperand, boolean preserve) {
if (!(resourceNameOperand instanceof COSName)) return;
if (preserve) {
markForPreservation(resourceType, (COSName)resourceNameOperand);
} else {
markForRemoval(resourceType, (COSName)resourceNameOperand);
}
}
public void markForRemoval(COSName resourceType, COSName refId) {
if (!resourceIsPreserved(resourceType, refId)) {
getResourceTracker(resourceType).put(refId, Boolean.TRUE);
}
}
public void markForPreservation(COSName resourceType, COSName refId) {
getResourceTracker(resourceType).put(refId, Boolean.FALSE);
}
public void updateResources(PDResources pdResources) {
for (Map.Entry<COSName, Map<COSName, Boolean>> resourceEntry : tracker.entrySet()) {
for(Map.Entry<COSName, Boolean> refEntry : resourceEntry.getValue().entrySet()) {
if (refEntry.getValue().equals(Boolean.TRUE)) {
pdResources.getCOSObject().getCOSDictionary(COSName.XOBJECT).removeItem(refEntry.getKey());
}
}
}
}
private boolean resourceIsPreserved(COSName resourceType, COSName refId) {
return getResourceTracker(resourceType).getOrDefault(refId, Boolean.FALSE);
}
private Map<COSName, Boolean> getResourceTracker(COSName resourceType){
if (!tracker.containsKey(resourceType)) {
tracker.put(resourceType, new HashMap<>());
}
return tracker.get(resourceType);
}
}
}
Helper class:
public interface MarkedContentMatcher {
public boolean matches(COSName contentId, COSDictionary props);
}
Optional Content Groups are marked with BDC and EMC. You will have to navigate through all of the tokens returned from the parser and remove the "section" from the array. Here is some C# Code that was posted a while ago - [1]: How to delete an optional content group alongwith its content from pdf using pdfbox?
I investigated that (converting to Java) but couldn't get it work as expected. I managed to remove the content between BDC and EMC and then save the result using the same technique as the sample but the PDF was corrupted. Perhaps that is my lack of C# Knowledge (related to Tuples etc.)
Here is what I came up with, as I said it doesn't work perhaps you or someone else (mkl, Tilman Hausherr) can spot the flaw.
OCGDelete (PDDocument doc, int pageNum, String OCName) {
PDPage pdPage = (PDPage) doc.getDocumentCatalog().getPages().get(pageNum);
PDResources pdResources = pdPage.getResources();
PDFStreamParser pdParser = new PDFStreamParser(pdPage);
int ocgStart
int ocgLength
Collection tokens = pdParser.getTokens();
Object[] newTokens = tokens.toArray()
try {
for (int index = 0; index < newTokens.length; index++) {
obj = newTokens[index]
if (obj instanceof COSName && obj.equals(COSName.OC)) {
// println "Found COSName at "+index /// Found Optional Content
startIndex = index
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if (obj instanceof COSName) {
prop = pdRes.getProperties(obj)
if (prop != null && prop instanceof PDOptionalContentGroup) {
if ((prop.getName()).equals(delLayer)) {
println "Found the Layer to be deleted"
println "prop Name was " + prop.getName()
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if ((obj.getName()).equals("BDC")) {
ocgStart = index
println("OCG Start " + ocgStart)
ocgLength = -1
index++
while (index < newTokens.size()) {
ocgLength++
obj = newTokens[index]
println " Loop through relevant OCG Tokens " + obj
if (obj instanceof Operator && (obj.getName()).equals("EMC")) {
println "the next obj was " + obj
println "after that " + newTokens[index + 1] + "and then " + newTokens[index + 2]
println("OCG End " + ocgLength++)
break
}
index++
}
if (endIndex > 0) {
println "End Index was something " + (startIndex + ocgLength)
}
}
}
}
}
}
}
}
}
}
catch (Exception ex){
println ex.message()
}
for (int i = ocgStart; i < ocgStart+ ocgLength; i++){
newTokens.removeAt(i)
}
PDStream newContents = new PDStream(doc);
OutputStream output = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(output);
writer.writeTokens(newTokens);
output.close();
pdPage.setContents(newContents);
}

Parsing HTML issues with Apache Tika

I am crawling a webpage and after crawling it extract all the links from that webpage and then I am trying to parse all the url using Apache Tika and BoilerPipe by using below code so for some url it is parsing very well but for some I get error like this. And it shows some error on HTMLParser.java: line number 102. This is line number 102 in HTMLParser.java
String parsedText = tika.parseToString(htmlStream, md);
I have provided the HTMLParse code also.
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser#67c28a6a
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:203)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.Tika.parseToString(Tika.java:357)
at edu.uci.ics.crawler4j.crawler.HTMLParser.parse(HTMLParser.java:102)
at edu.uci.ics.crawler4j.crawler.WebCrawler.handleHtml(WebCrawler.java:227)
at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:299)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:118)
at java.lang.Thread.run(Unknown Source)
Caused by: java.util.zip.ZipException: invalid block type
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.ZipInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:114)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:82)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:220)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:65)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:67)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 8 more
This is my HTMLParser.java file-
public void parse(String htmlContent, String contextURL) {
InputStream htmlStream = null;
text = null;
title = null;
metaData = new HashMap<String, String>();
urls = new HashSet<String>();
char[] chars = htmlContent.toCharArray();
bulletParser.setCallback(textExtractor);
bulletParser.parse(chars);
try {
text = articleExtractor.getText(htmlContent);
} catch (BoilerpipeProcessingException e) {
e.printStackTrace();
}
if (text == null){
text = textExtractor.text.toString().trim();
}
title = textExtractor.title.toString().trim();
try {
Metadata md = new Metadata();
String utfHtmlContent = new String(htmlContent.getBytes(),"UTF-8");
htmlStream = new ByteArrayInputStream(utfHtmlContent.getBytes());
//The below line is at the line number 102 according to error above
String parsedText = tika.parseToString(htmlStream, md);
//very unlikely to happen
if (text == null){
text = parsedText.trim();
}
processMetaData(md);
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(htmlStream);
}
bulletParser.setCallback(linkExtractor);
bulletParser.parse(chars);
Iterator<String> it = linkExtractor.urls.iterator();
String baseURL = linkExtractor.base();
if (baseURL != null) {
contextURL = baseURL;
}
int urlCount = 0;
while
(it.hasNext()) {
String href = it.next();
href = href.trim();
if (href.length() == 0) {
continue;
}
String hrefWithoutProtocol = href.toLowerCase();
if (href.startsWith("http://")) {
hrefWithoutProtocol = href.substring(7);
}
if (hrefWithoutProtocol.indexOf("javascript:") < 0
&& hrefWithoutProtocol.indexOf("#") < 0) {
URL url = URLCanonicalizer.getCanonicalURL(href, contextURL);
if (url != null) {
urls.add(url.toExternalForm());
urlCount++;
if (urlCount > MAX_OUT_LINKS) {
break;
}
}
}
}
}
Any suggestions will be appreciated.
Sounds like a malformed OOXML document (.docx, .xlsx, etc.). To check whether the problem still occurs with the latest Tika version, you can download the tika-app jar and run it like this:
java -jar tika-app-1.0.jar --text http://url.of.the/troublesome/document.docx
This should print out the text contained in the document. If it doesn't work, please file a bug report with the URL of the troublesome document (or attach the document if it's not publicly available).
I had same issue, I found that documents(docx) files which I was trying to parse was not actually simple document, It was form developed in microsoft word with text and input fields beside label text.
I removed such files from folder and post rest of all files to Solr engine for parsing and indexing, It worked.

Android: Parsing XML DOM parser. Converting childnodes to string

Again a question. This time I'm parsing XML messages I receive from a server.
Someone thought to be smart and decided to place HTML pages in a XML message. Now I'm kind of facing problems because I want to extract that HTML page as a string from this XML message.
Ok this is the XML message I'm parsing:
<AmigoRequest>
<From></From>
<To></To>
<MessageType>showMessage</MessageType>
<Param0>general message</Param0>
<Param1><html><head>test</head><body>Testhtml</body></html></Param1>
</AmigoRequest>
You see that in Param1 a HTML page is specified. I've tried to extract the message the following way:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
return results.item(0).getFirstChild().getNodeValue();
}
}
return "";
}
Where d is the XML message in document form.
It always returns me a null value, because getNodeValue() returns null.
When i try results.item(0).getFirstChild().hasChildNodes() it will return true because he sees there is a tag in the message.
How can i extract the html message <html><head>test</head><body>Testhtml</body></html> from Param0 in a string?
I'm using Android sdk 1.5 (well almost java) and a DOM Parser.
Thanks for your time and replies.
Antek
You could take the content of param1, like this:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
// String extractHTMLTags(String s) is a function that you have
// to implement in a way that will extract all the HTML tags inside a string.
return extractHTMLTags(results.item(0).getTextContent());
}
}
return "";
}
All you have to do is to implement a function:
String extractHTMLTags(String s)
that will remove all HTML tag occurrences from a string.
For that you can take a look at this post: Remove HTML tags from a String
after checking a lot and scratching my head thousands of times I came up with simple alteration that it needs to change your API level to 8
EDIT: I just saw your comment above about getTextContent() not being supported on Android. I'm going to leave this answer up in case it's useful to someone who's on a different platform.
If your DOM API supports it, you can call getTextContent(), as follows:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results != null) {
return results.getTextContent();
}
}
return "";
}
However, getTextContent() is a DOM Level 3 API call; not all parsers are guaranteed to support it. Xerces-J does.
By the way, in your original example, your check for null is in the wrong place; it should be:
if (results != null && results.getLength() > 0) {
Otherwise, you'd get a NPE if results really does come back as null.
Since getTextContent() isn't available to you, another option would be to write it -- it isn't hard. In fact, if you're writing this solely for your own use -- or your employer doesn't have overly strict rules about open source -- you could look at Apache's implementation as a starting point; lines 610-646 seem to contain most of what you need. (Please be respectful of Apache's copyright and license.)
Otherwise, some rough pseudocode for the method would be:
String getTextContent(Node node) {
if (node has no children)
return "";
if (node has 1 child)
return getTextContent(node.getFirstChild());
return getTextContent(new StringBuffer()).toString();
}
StringBuffer getTextContent(Node node, StringBuffer sb) {
for each child of node {
if (child is a text node) sb.append(child's text)
else getTextContent(child, sb);
}
return sb;
}
Well i was almost there with the code...
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db;
Element node = (Element) results.item(0); // get the value of Param1
Document doc2 = null;
try {
db = dbf.newDocumentBuilder();
doc2 = db.newDocument(); //create new document
doc2.appendChild(doc2.importNode(node, true)); //import the <html>...</html> result in doc2
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
Log.d(TAG, " Exception ", e);
} catch (DOMException e) {
// TODO: handle exception
Log.d(TAG, " Exception ", e);
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace(); }
return doc2. .....// All I'm missing is something to convert a Document to a string.
}
}
return "";
}
Like explained in the comment of my code. All I am missing is to make a String out of a Document. You can't use the Transform class in Android... doc2.toString() will give you a serialization of the object..
But my next step is write my own parser if this doesnt work out ;)
Not the best code but a temponary solution.
public String getParam1(String b) {
return b
.substring(b.indexOf("<Param1>") + "<Param1>".length(), b.indexOf("</Param1>"));
}
Where String b is the XML document string.

Categories

Resources