How to parse this site with Jsoup or another parser?

How to parse this site with Jsoup or another parser? - java

I'am trying to parse a page which has no defined encoding in its header, in the HTML it defines ISO-8859-1 as encoding. Jsoup isn't able to parse it with default settings (also HTMLunit and PHP's Simple HTML Dom Parser can't handle it by default). Even if I define the encoding for Jsoup myself it still isn't working. Can't figure out why.
Here's my code:
String url = "http://www.parkett.de";
Document doc = null;
try {
doc = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url);
// doc = Jsoup.parse(new URL(url).openStream(), "CP1252", url);
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
Element extractHtml = null;
Elements elements = null;
String title = null;
elements = doc.select("h1");
if(!elements.isEmpty()) {
extractHtml = elements.get(0);
title = extractHtml.text();
}
System.out.println(title);
Thanks for any suggestions!

When working with URLs, chapters 4 & 9 of the cookbook recommend using Jsoup.connect(...).get(). Chapter 5 suggests using Jsoup.parse() when loading a document from a local file.
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect("http://www.parkett.de/").get();
} catch (IOException e) {
e.printStackTrace();
}
Element firstH1 = doc.select("h1").first();
System.out.println((firstH1 != null) ? firstH1.text() : "First <h1> not found.");
}

Related

IText html to pdf wrapping line

Hello I'm creating javafx app with iText. I have html editor to write text and I want to create pdf from it. Everything works but when I have a really long line that is wrapped in html editor, in pdf it isn't wrapped, its out of page, how can I set wrapping page? here is my code:
PdfWriter writer = null;
try {
writer = new PdfWriter("doc.pdf");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
//Initialize PDF document
PdfDocument pdf = new PdfDocument(writer);
// Initialize document
Document document = new Document(pdf, PageSize.A4);
List<IElement> list = null;
try {
list = HtmlConverter.convertToElements(editor.getHtmlText());
} catch (IOException e) {
e.printStackTrace();
}
// add elements to document
for (IElement p : list) {
document.add((IBlockElement) p);
}
// close document
document.close();
I also want to set line spacing for this text
Thank you for help

I don't get any errors for the following code:
public class stack_overflow_0008 extends AbstractSupportTicket{
private static String LONG_PIECE_OF_TEXT =
"Once upon a midnight dreary, while I pondered, weak and weary," +
"Over many a quaint and curious volume of forgotten lore—" +
"While I nodded, nearly napping, suddenly there came a tapping," +
"As of some one gently rapping, rapping at my chamber door." +
"Tis some visitor,” I muttered, “tapping at my chamber door—" +
"Only this and nothing more.";
public static void main(String[] args)
{
PdfWriter writer = null;
try {
writer = new PdfWriter(getOutputFile());
} catch (FileNotFoundException e) {
e.printStackTrace();
}
//Initialize PDF document
PdfDocument pdf = new PdfDocument(writer);
// Initialize document
Document document = new Document(pdf, PageSize.A4);
List<IElement> list = null;
try {
list = HtmlConverter.convertToElements("<p>" + LONG_PIECE_OF_TEXT + "</p>");
} catch (IOException e) {
e.printStackTrace();
}
for (IElement p : list) {
document.add((IBlockElement) p);
}
document.close();
}
}
The document is a single (A4) page PDF with one string neatly wrapped.
I think perhaps the content of your string is to blame?
Could you post the HTML you get from this editor object?
Update:
Using the code from this answer on the HTML shared in a new comment to the question, I get the following result:
As you can see, the content is distributed over two lines. No content "falls off the page."

Apache POI Table of contents not updating

I am using Apache POI XWPF components and java, to extract data from a .xml file into a word document. So far so good, but I am struggling to create a table of contents.
I have to create a table of contents at the start of the method and then I update it at the end to get all the new headers. Currently I use doc.createTOC(), where doc is a variable created from XWPFDocument, to create the table at the start and then I use doc.enforceUpdateFields() to update everything at the end of the document. But when I open the document after I ran the program, the table of contents is empty, but the navigation panel does include some of the headers I specified.
A comment recommended that I include some code. So i started off by create the document from a template:
XWPFDocument doc = new XWPFDocument(new FileInputStream("D://Template.docx"));
I then create a table of contents:
doc.createTOC();
Then throughout the method I add headers to the document:
XWPFParagraph documentControlHeading = doc.createParagraph();
documentControlHeading.setPageBreak(true);
documentControlHeading.setAlignment(ParagraphAlignment.LEFT);
documentControlHeading.setStyle("Tier1Header");
After all the headers are added, I want to update the document so that all the new headers will appear in the table of contents. I do this buy using the following command:
doc.enforceUpdateFields();

Hmmm... I am looking at the createTOC() method code, and it appears that it looks for styles that look like Heading #. So Tier1Header would not be found. Try creating your text first, and use styles like Heading 1 for your headings. Then add the TOC using createTOC(). It should find all the headings when the TOC is created. I do not know if enforceUpdateFields() affects the TOC.

//Your docx template should contain the following or something similar text //which will be searched for and replaced with a WORD TOC.
//${TOC}
public static void main(String[] args) throws IOException, OpenXML4JException {
XWPFDocument docTemplate = null;
try {
File file = new File(PATH_TO_FILE); //"C:\\Reports\\Template.docx";
FileInputStream fis = new FileInputStream(file);
docTemplate = new XWPFDocument(fis);
generateTOC(docTemplate);
saveDocument(docTemplate);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (docTemplate != null) {
docTemplate.close();
}
}
}
private static void saveDocument(XWPFDocument docTemplate) throws FileNotFoundException, IOException {
FileOutputStream outputFile = null;
try {
outputFile = new FileOutputStream(OUTFILENAME);
docTemplate.write(outputFile);
} finally {
if (outputFile != null) {
outputFile.close();
}
}
}
public static void generateTOC(XWPFDocument document) throws InvalidFormatException, FileNotFoundException, IOException {
String findText = "${TOC}";
String replaceText = "";
for (XWPFParagraph p : document.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
int pos = r.getTextPosition();
String text = r.getText(pos);
if (text != null && text.contains(findText)) {
text = text.replace(findText, replaceText);
r.setText(text, 0);
addField(p, "TOC \\o \"1-3\" \\h \\z \\u");
break;
}
}
}
}
private static void addField(XWPFParagraph paragraph, String fieldName) {
CTSimpleField ctSimpleField = paragraph.getCTP().addNewFldSimple();
// ctSimpleField.setInstr(fieldName + " \\* MERGEFORMAT ");
ctSimpleField.setInstr(fieldName);
ctSimpleField.addNewR().addNewT().setStringValue("<<fieldName>>");
}

This is the code of createTOC(), obtained by inspecting XWPFDocument.class:
public void createTOC() {
CTSdtBlock block = getDocument().getBody().addNewSdt();
TOC toc = new TOC(block);
for (XWPFParagraph par : this.paragraphs) {
String parStyle = par.getStyle();
if ((parStyle != null) && (parStyle.startsWith("Heading"))) try {
int level = Integer.valueOf(parStyle.substring("Heading".length())).intValue();
toc.addRow(level, par.getText(), 1, "112723803");
} catch (NumberFormatException e) {
e.printStackTrace();
}
}
}
As you can see, it adds to the TOC all paragraphs having styles named "HeadingX", with X being a number. But, unfortunately, that's not sufficent. The method, in fact, is bugged/uncomplete in its implementation.
The page number passed to addRow() is always 1, it's not even calculated.
So, at the end, you will have a TOC with all your paragraphs and the trailing dots giving the proper indentation, but the pages will be always equal to "1".
EDIT
...but, there's a solution here.

Incoherent XPath output

I just discovered XPath and I'm trying to use it in order to parse an XML file. I read a few courses about it, but I am stuck with a problem. When I try to get a NodeList from the file, the getLength() method always returns 0.
However, when I try a
document.getElementsByTagName("crtx:env").getLength()
The output is correct (7 in my case).
I do not really understand, because my nodelist is built according to my Document, the output should be similar, isn't it ?
Here is a part of my code :
IFile f = (IFile) PlatformUI.getWorkbench().getActiveWorkbenchWindow().getActivePage().getActiveEditor().getEditorInput().getAdapter(IFile.class);
String fileURL = f.getLocation().toOSString();
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = null;
try {
builder = builderFactory.newDocumentBuilder();
} catch (ParserConfigurationException e) {
e.printStackTrace();
}
Document document = null;
if (builder != null){
try{
document = builder.parse(fileURL);
System.out.println("DOCUMENT URI : " + document.getDocumentURI());
} catch (Exception e){
e.printStackTrace();
}
} else {
System.out.println("builder null");
}
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodeList = null;
try {
nodeList = (NodeList) xPath.compile("crtx:env").evaluate(document, XPathConstants.NODESET);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("NODELIST SIZE : " + nodeList.getLength());
System.out.println(document.getElementsByTagName("crtx:env").getLength());
}
The first System.out.println() returns coherent output (a good URI), but the two last lines return a different number.
Any help would be appreciated.
Thanks for reading.

XPath is defined for XML with namespaces so set
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
builderFactory.setNamespaceAware(true);
Then to use a path with a namespace prefix you need to use https://docs.oracle.com/javase/7/docs/api/javax/xml/xpath/XPath.html#setNamespaceContext%28javax.xml.namespace.NamespaceContext%29 to bind the prefix(es) used to namespace URIs, see https://xml.apache.org/xalan-j/xpath_apis.html#namespacecontext for an example
With a namespace aware DOM you will need to change your getElementsByTagName call to use getElementsByTagNameNS however.

How to update html div content using java

I am working on a java rcp application. Whenever user updates the details in UI, we are suppose to update the same details in html report also. Is there a we can update/add the html elements using java. Using Jsoup I am able to get the required element ID, but not able to innert/update new element to it.
Document htmlFile = null;
try {
htmlFile = Jsoup.parse(new File("C:\\ItemDetails1.html"), "UTF-8");
} catch (IOException e) {
e.printStackTrace();
}
Element div = htmlFile.getElementById("row2_comment");
System.out.println("text: " + div.html());
div.html("<li><b>Comments</b></li><ul><li>Testing for comment</li></ul>");
Any thoughts

Try:
Element div =
htmlFile.getElementById("row2_comment");
div.appendElement("p").attr("class",
"beautiful").text("Some New Text")
To add a new paragraph with some style and text content

How to extract urls from an html file stored on my computer using java?

I need to find all the urls present in an html file which is stored in my computer itself and extract the links and store it to a variable. I'm using the code below to scan the file and get the lines. But I'm having a hard time extracting just the links. I would appreciate if someone could help me out.
Scanner htmlScanner = new Scanner(new File(args[0]));
PrintWriter output = new PrintWriter(new FileWriter(args[1]));
while(htmlScanner.hasNext()){
output.print(htmlScanner.next());
}
System.out.println("\nDone");
htmlScanner.close();
output.close();

You can actually do this with the Swing HTML parser. Though the Swing parser only understands HTML 3.2, tags introduced in later HTML versions will simply be treated as unknown, and all you actually want are links anyway.
static Collection<String> getLinks(Path file)
throws IOException,
MimeTypeParseException,
BadLocationException {
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc;
try {
htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
try (Reader reader =
Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {
htmlKit.read(reader, htmlDoc, 0);
}
} catch (ChangedCharSetException e) {
MimeType mimeType = new MimeType(e.getCharSetSpec());
String charset = mimeType.getParameter("charset");
htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
htmlDoc.putProperty("IgnoreCharsetDirective", true);
try (Reader reader =
Files.newBufferedReader(file, Charset.forName(charset))) {
htmlKit.read(reader, htmlDoc, 0);
}
}
Collection<String> links = new ArrayList<>();
for (HTML.Tag tag : Arrays.asList(HTML.Tag.LINK, HTML.Tag.A)) {
HTMLDocument.Iterator it = htmlDoc.getIterator(tag);
while (it.isValid()) {
String link = (String)
it.getAttributes().getAttribute(HTML.Attribute.HREF);
if (link != null) {
links.add(link);
}
it.next();
}
}
return links;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse this site with Jsoup or another parser? - java

Related

IText html to pdf wrapping line

Apache POI Table of contents not updating

Incoherent XPath output

How to update html div content using java

How to extract urls from an html file stored on my computer using java?

Categories

Resources