How can I get the LineNumber of the element when using Jsoup? - java

such as:
Document doc = Jsoup.parse(file,"UTF-8");
Elements eles = doc.getElementsByTag("style");
How can I get the lineNumber of eles[0] in the file?

There is no way for you to do it with Jsoup API. I have checked on their source code: org.jsoup.parser.Parser maintains no position information of the element in the original input.
Please, refer to sources on Grep Code
Provided that Jsoup is build for extracting and manipulating data I don't believe that they will have such feature in future as it is ambigous what element position is after manipulation and costly to maintain actual references.

There is no direct way. But there is an indirect way.
Once you find the point of interest like an attribute, simply add a token as html before the element, and write the file to another temporary file. The next step is do a search for the token, using text editing tools.
code is as follows.
Step-1:
// get an element
for (Element element : doc.getAllElements()) {
... some code to get attributes of element ...
String myAttr = attribute.getKey();
if (myAttr.equals("some-attribute-name-of-interest") {
System.out.println(attribute.getKey() + "::" + attribute.getValue());
element.before("<!-- My Special Token : ABCDEFG -->");
}
Step-2:
// write the doc back to a temporary file
// see: How to save a jsoup document as text file
Step-3:
The last step is search for "My Special Token : ABCDEFG" in the output file using a text editing tool.
jsoup is a nice library. I thought this would help others.

Related

Extract Reading Order Sequence in a Tagged PDF

I'm currently validating the correct order of the content in a Tagged PDF File.
Is there any way to extract the reading order numbers of Tagged PDF Files programmatically?
I've tried converting the tagged PDF to XML but I can't figure out which tags belong to a certain text.
I've tried the following Libraries:
Syncfusion
IText7
but I can't find any methods that get its reading order numbers.
Is it really possible? Thanks in advance!
You can extract the marked content tree of tagged pdf using the PdfPig (.Net) library. My understanding is that the reading order is indicated by the Marked-content identifier (MCID).
If a marked content element does not contain an MCID (like pagination elements), the MCID is set to -1.
Each MarkedContentElement will contain the letters, images and paths that belong to it:
using UglyToad.PdfPig;
[...]
using (PdfDocument document = PdfDocument.Open(pathToFile))
{
for (int p = 0; p < document.NumberOfPages; p++)
{
var page = document.GetPage(p + 1);
// extract the page's marked content
var markedContents = page.GetMarkedContents();
var orderedMarkedContents = markedContents
.OrderBy(mc => mc.MarkedContentIdentifier);
foreach (var mc in orderedMarkedContents)
{
// do something
}
}
}
If you want to extract the result to XML, you can have a look at the PageXmlTextExporter class. Have a look at the wiki for more information on ITextExporter and IReadingOrderDetector.
Note: I am an active contributer to this library.

Inserting elements in xml

I am trying to add elements to xml. I followed Java DOM - Inserting an element, after another
but that did not work for me. Here is my code:
Element e = dom.createElement("mapping");
e.setAttribute("resource", "/some/path/to/file");
Element lastChild = (Element)nList.item(nList.getLength()-1);
Element parent= (Element)nList.item(nList.getLength()-1).getParentNode();
lastChild.getParentNode().insertBefore(e, lastChild);
I also tried parent.appendChild(e); but none of them work. It doesn't seem like there is problem with the code. What could be the problem?
I am using Netbeans on macosx. Is this because of file permissions ?
The Elements you are working with are a DOM in memory, they are not directly tired to your file contents. The code snippet you posted only modifies that DOM. You will only see change in the file if you write your data back to the file.

Unusual output when using Jsoup in Java

I am getting this output when trying to use Jsoup to extract text from Wikipedia:
I dont have enough rep to post pictures as I am new to this site but its basically like this:
[]{k[]q[]f[]d[]d etc..
Here is part of my code:
public static void scrapeTopic(String url)
{
String html = getUrl("http://www.wikipedia.org/" + url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("*").first().text();
System.out.println(contentText);
}
It appears to get all the information but in the wrong format!
I appreciate any help given
Thanks in advance
Here are some suggestion for you. While fetching general webpage, which doesn't require HTTP header's field to be set like cookie, user-agent just call:
Document doc = Jsoup.connect("givenURL").get();
This function read the webpage using a GET request. When you are selecting element using *, it returns any element, that is all the element of the document. Hence, calling doc.select("*").first() is returning the #root element. Try printing it to see:
System.out.println(doc.select("*").first().tagName()); // #root
System.out.println(doc.select("*").first()); // will print the whole document,
System.out.println(doc); //print the whole document, the above action is pointless
System.out.println(doc.select("*").first()==doc);
// check whither they are equal, and it will print TRUE
I am assuming that you are just playing around to learn about this API, although selector is much powerful, but a good start should be trying general document manipulation function e.g., doc.getElementsByTag().
However, in my local machine, i was successful to fetch the Document and parsing it using your getURL() function !!

Get an XML from a url using dom

I have store in a String variable(link) the url that I get the xml response, I use a dom to parse the xml data.
In order to be sure that I extract the data correctly I store the xml in the local drive, build my parser and I took the data:
document = builder.parse(new File(filepath));
So when I try to get it from url I used:
document = builder.parse(new URL(link).openStream());
And it didn't work. What am I missing?
The data of the xml are stored in a list which then are shown in a jsf datatable.
Well the above works just fine, the problem was the index of elements of the nodelist. For some reason when i was reading from file
obj.setattribute1(cDetails.item(1).getTextContent());
obj.setattribute2(cDetails.item(3).getTextContent());
see that the item are increased by 2 each time
now that i read a URL the increment is 1 every time
Now i am sure that there is a reason for this which i don't understand probably cause of my limited yet knowledge but the above work and the index of the item increases 1 for the next item in the nodelist.

Copy a whole ODT (Openoffice Writer) document section to other document with Openoffice Java API (UNO API)

I need to use the OpenOffice Java API to copy a document section and paste it over another document section. So far I have managed to copy the text of the section of the source document and paste it over the section at the target document (see the example below).
However, the problem is that non-text elements (graphics, formats, tables, etc.) don't get pasted on the destination document.
The code I have used to extract the text of the source section is:
// Read source file text
XComponent xComponentSource = this.ooHelper.loadDocument("file://" + fSource);
// Get sections
XTextSectionsSupplier textSectionsSupplierSource = (XTextSectionsSupplier)UnoRuntime.queryInterface(XTextSectionsSupplier.class, xComponentSource);
XNameAccess nameAccessSource = textSectionsSupplierOrigen.getTextSections();
// Get sections by name
XTextSection textSectionSource = (XTextSection)UnoRuntime.queryInterface(XTextSection.class, nameAccessOrigen.getByName("SeccEditable"));
//Get section text
String sectionSource = textSectionSource.getAnchor().getString();
To paste the text over the target section, the code to select the section is the same, and I set the string:
textSectionDest.getAnchor().setString(sectionSource);
I have read the API Javadoc, and I haven't found any method to copy the entire section. Is there any way to do it?
I was having this same problem. I ended up solving by creating two cursors, one at the start of the content of what I wanted duplicated, then another at the end of the content by using, then extending the cursor selection of the first one to the second. This used the gotoRange method on the first cursor, passing in the second cursor and a True to tell it to expand selection.
Cursor Example:
http://api.openoffice.org/docs/DevelopersGuide/Text/Text.xhtml#1_3_1_1_Editing_Text
Then I created an autoText container, group and element containing the selection. and inserted /pasted the content at a cursor position using the applyTo method of the autotext entry. I used a guid for the name of the autoText container so it would be unique and then deleted the container when I was done.
AutoText Example:
http://api.openoffice.org/docs/DevelopersGuide/Text/Text.xhtml#1_3_1_6_Auto_Text
I can post my code if you want, however it's written in Python.

Categories

Resources