Jsoup parsing html duplication on writing to file

Jsoup parsing html duplication on writing to file - java

I seem to be having this error where text is being written to a file twice, the first time with incorrect formatting and the second with correct formatting. The method below takes in this URL after it's been converted properly. The method is supposed to get print a newline in between the text conversion of all of the children of dividers that are children of the divider "ffaq" where all the body text resides. Any help would be appreciated. I'm fairly new to using jsoup so an explanation would be nice as well.
/**
* Method to deal with HTML 5 Gamefaq entries.
* #param url The location of the HTML 5 entry to read.
**/
public static void htmlDocReader(URL url) {
try {
Document doc = Jsoup.parse(url.openStream(), "UTF-8", url.toString());
//parse pagination label
String[] num = doc.select("div.span12").
select("ul.paginate").
select("li").
first().
text().
split("\\s+");
//get the max page number
final int max_pagenum = Integer.parseInt(num[num.length - 1]);
//create a new file based on the url path
File file = urlFile(url);
PrintWriter outFile = new PrintWriter(file, "UTF-8");
//Add every page to the text file
for(int i = 0; i < max_pagenum; i++) {
//if not the first page then change the url
if(i != 0) {
String new_url = url.toString() + "?page=" + i;
doc = Jsoup.parse(new URL(new_url).openStream(), "UTF-8",
new_url.toString());
}
Elements walkthroughs = doc.select("div.ffaq");
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
outFile.println(inner.text());
}
}
}
outFile.close();
} catch(Exception e) {
e.printStackTrace();
System.exit(1);
}
}

For every element you call text() you print all the text of its structure.
Assume the below example
<div>
text of div
<span>text of span</span>
</div>
if you call text() for div element you will get
text of div text of span
Then if you call text() for span you will get
text of span
What you need, in order to avoid duplicates is to use ownText(). This will get only the direct text of the element, and not the text of its children.
Long story sort change this
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
outFile.println(inner.text());
}
}
To this
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
String line = inner.ownText().trim();
if(!line.equals("")) //Skip empty lines
outFile.println(line);
}
}

Related

Remove FixedLeading at the first line on each page

I want to remove setFixedLeading at the first line on each page (100+)
I read a bit text(more 100 page with help while). And I set padding and margin to 0 but I still have top indent. Why? Help me pls? How delete it?
public static final String DEST = "PDF.pdf";
public static void main(String[] args) throws FileNotFoundException {
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(DEST));
Document doc = new Document(pdfDoc);
doc.setMargins(0,0,0,0);
for (int i = 0; i <20 ; i++) {
Paragraph element = new Paragraph("p " + i);
element.setPadding(0);
element.setMargin(0);
element.setFixedLeading(55);
doc.add(element);
}
doc.close();
}
PDF file:
https://pdfhost.io/v/Byt9LHJcy_PDFpdf.pdf

At the time of element creation you don't know the page it will end up on nor its resultant position. I don't think there is a property that allows you to configure the behavior depending on whether it's the top element on a page (such property would be too custom and tied to a specific workflow).
Fortunately, the layout mechanism is quite flexible and you can implement the desired behavior in a couple of lines of code.
First off, let's not use setFixedLeading and set the top margin for all paragraphs instead:
Document doc = new Document(pdfDocument);
doc.setMargins(0, 0, 0, 0);
for (int i = 0; i < 20; i++) {
Paragraph element = new Paragraph("p " + i);
element.setPadding(0);
element.setMargin(0);
element.setMarginTop(50);
doc.add(element);
}
doc.close();
This does not pretty much change anything in the visual result - it's just another way of doing things.
Now, we need a custom renderer to tweak the behavior of a paragraph if it is rendered at the top of the page. We are going to override layout method and check if the area we are given is located at the top of the page - and if so, we will not apply the top margin:
private static class CustomParagraphRenderer extends ParagraphRenderer {
Document document;
public CustomParagraphRenderer(Paragraph modelElement, Document document) {
super(modelElement);
this.document = document;
}
#Override
public IRenderer getNextRenderer() {
return new ParagraphRenderer((Paragraph) modelElement);
}
#Override
public LayoutResult layout(LayoutContext layoutContext) {
if (layoutContext.getArea().getBBox().getTop() == document.getPdfDocument().getDefaultPageSize().getHeight()) {
((Paragraph)getModelElement()).setMarginTop(0);
}
return super.layout(layoutContext);
}
}
Now the only thing we need to do is to set the custom renderer instance to each paragraph in the loop:
element.setNextRenderer(new CustomParagraphRenderer(element, doc));
Visual result:

How to extract elements from a String with jsoup?

I want to write a small piece of code that will exctract the "Kategorie" out of a href with jsoup.
Herrscher des Mittelalters
In this case I am searching for Herrscher des Mittelalters.
My code reads the first line of a .txt file with the BufferedReader.
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(new File(FilePath)), Charset.forName("UTF-8")));
Document doc = Jsoup.parse(r.readLine());
Element elem = doc;
I know there are commands to get the href-link but I don't know commands to search for elements in the href-link.
Any suggestions?
Additional information: My .txt file contains full Wikipedia HTML pages.

This should get you all titles from links. You can split the titles further as you need:
Document d = Jsoup.parse("Herrscher des Mittelalters");
Elements links = d.select("a");
Set<String> categories = new HashSet<>();
for (Element script : links) {
String title = script.attr("title");
if (title.length() > 0) {
categories.add(title);
}
}
System.out.println(categories);

You can use getElementsContainingText() method (org.jsoup.nodes.Document) to search for elements with with any text.
Elements elements = doc.getElementsContainingText("Herrscher des Mittelalters");
for(int i=0; i<elements.size();i++) {
Element element = elements.get(i);
System.out.println(element.text());
}

How to update html div content using java

I am working on a java rcp application. Whenever user updates the details in UI, we are suppose to update the same details in html report also. Is there a we can update/add the html elements using java. Using Jsoup I am able to get the required element ID, but not able to innert/update new element to it.
Document htmlFile = null;
try {
htmlFile = Jsoup.parse(new File("C:\\ItemDetails1.html"), "UTF-8");
} catch (IOException e) {
e.printStackTrace();
}
Element div = htmlFile.getElementById("row2_comment");
System.out.println("text: " + div.html());
div.html("<li><b>Comments</b></li><ul><li>Testing for comment</li></ul>");
Any thoughts

Try:
Element div =
htmlFile.getElementById("row2_comment");
div.appendElement("p").attr("class",
"beautiful").text("Some New Text")
To add a new paragraph with some style and text content

JSoup extract only specific parts from Wikipedia

I have managed to extract the information in the "tables" on the right side of a Wikipedia article. However I also want to get paragraphs from the main text of the articles.
The code I'm using atm is only working about 60% of the time(Nullpointers or no text at all). In the example below I'm only interested in the tho first paragraphs, however that is irrelevant for my question.
In the picture below I show what parts I want the text from. I want to be able to iterate through all ... parts in the < divid="mw-content-text"....class="mw-content-ltr"> block.
StringBuilder sb = new StringBuilder();
String url = baseUrl + location;
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
Element elementTwo = firstParagraph.nextElementSibling();
if (elementTwo == null) {
for (int i = 0; i < 2; i++) {
sb.append(paragraphs.get(i).text());
}
} else {
sb.append(elementTwo.text());
}
return sb.toString();

JSOUP - Getting value of textarea from HTML - CLOSED

I've just started using JSOUP and I'm trying to get the value in the textarea. The below is the element info from the HTML;
The below is the code that I'm using to attempt to read the value in the textarea;
try {
String html = "http://aviprobo.doorfree.com/control.html";
Document doc = Jsoup.connect(html).get();
Element textarea = doc.getElementById("control");
System.out.println("textarea value = " + textarea.val());
} catch (IOException e) {
//
}
The value of textarea.val() is empty. Could someone please point me in the right direction.
Thanks.

Document doc = Jsoup.connect("http://sports.163.com/13/0830/22/97IFSI5I00051CD5.html").get();
**Entities.EscapeMode.base.getMap().clear();**
Elements elements = doc.select("textarea[id^=photoList]");
for(Element e:elements){
System.out.println(e.html());
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup parsing html duplication on writing to file - java

Related

Remove FixedLeading at the first line on each page

How to extract elements from a String with jsoup?

How to update html div content using java

JSoup extract only specific parts from Wikipedia

JSOUP - Getting value of textarea from HTML - CLOSED

Categories

Resources