How to extract elements from a String with jsoup?

How to extract elements from a String with jsoup? - java

I want to write a small piece of code that will exctract the "Kategorie" out of a href with jsoup.
Herrscher des Mittelalters
In this case I am searching for Herrscher des Mittelalters.
My code reads the first line of a .txt file with the BufferedReader.
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(new File(FilePath)), Charset.forName("UTF-8")));
Document doc = Jsoup.parse(r.readLine());
Element elem = doc;
I know there are commands to get the href-link but I don't know commands to search for elements in the href-link.
Any suggestions?
Additional information: My .txt file contains full Wikipedia HTML pages.

This should get you all titles from links. You can split the titles further as you need:
Document d = Jsoup.parse("Herrscher des Mittelalters");
Elements links = d.select("a");
Set<String> categories = new HashSet<>();
for (Element script : links) {
String title = script.attr("title");
if (title.length() > 0) {
categories.add(title);
}
}
System.out.println(categories);

You can use getElementsContainingText() method (org.jsoup.nodes.Document) to search for elements with with any text.
Elements elements = doc.getElementsContainingText("Herrscher des Mittelalters");
for(int i=0; i<elements.size();i++) {
Element element = elements.get(i);
System.out.println(element.text());
}

Related

Text associated to PDF paragraph in document content object wit PDFBox

I'm trying to get the text associated to a paragraph navigating through the content tree of a PDF file. I am using PDFBox and cannot find the link between the paragraph and the text that it contains (see code below):
public class ReadPdf {
public static void main( String[] args ) throws IOException{
MyBufferedWriter out = new MyBufferedWriter(new FileWriter(new File(
"C:/Users/wip.txt")));
RandomAccessFile raf = new RandomAccessFile(new File(
"C:/Users/mypdf.pdf"), "r");
PDFParser parser = new PDFParser(raf);
parser.parse();
COSDocument cosDoc = parser.getDocument();
out.write(cosDoc.getXrefTable().toString());
out.write(cosDoc.getObjects().toString());
PDDocument document = parser.getPDDocument()
document.getClass();
COSParser cosParser = new COSParser(raf);
PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
for (Object kid2 :((PDStructureElement)kid).getKids()){
PDStructureElement kid2c = (PDStructureElement)kid2;
if (kid2c.getStandardStructureType() == "P"){
for (Object kid3 : kid2c.getKids()){
if (kid3 instanceof PDStructureElement){
PDStructureElement kid3c = (PDStructureElement)kid3;
}
else{
for (Entry<COSName, COSBase>entry : kid2c.getCOSObject().entrySet()){
// Print all the Keys in the paragraph COSDictionary
System.out.println(entry.getKey().toString());
System.out.println(entry.getValue().toString());}
}}}}}}}
When I print the contents I get the following Keys:
/P : Reference to Parent
/A : Format of the paragraph
/K : Position of the paragraph in the section
/C : Name of the paragraph (!= text)
/Pg : Reference to the page
Example output:
COSName{K}
COSInt{2}
COSName{Pg}
COSObject{12, 0}
COSName{C}
COSName{Normal}
COSName{A}
COSObject{434, 0}
COSName{S}
COSName{Normal}
COSName{P}
COSObject{421, 0}
Now none of these points to the actual text inside the paragraph.
I know that the relation can be obtained as it is parsed when I open the document with acrobat (see pic below):

I found a way to do this through the parsing of the Content Stream from a page.
Navigating through the PDF Specification Chapter 10.6.3 there is a link between the numbering of each Text Stream which comes under \P \MCID and an attribute of the Tag (PDStructureElement in PDFBox) which can be found in the COSObject.
1) To get the text and the MCID:
PDPage pdPage;
Iterator<PDStream> inputStream = pdPage.getContentStreams();
while (inputStream.hasNext()) {
try {
PDFStreamParser parser2 = new PDFStreamParser((PDStream)inputStream.next());
parser2.parse();
List<Object> tokens = parser2.getTokens();
for (int j = 0; j < tokens.size(); j++){
tokenString = (tokenString + tokens.get(j).toString()}
// here comes the parsing of the string. Chapter 5 specifies what each of the operators Tj (actual text), Tm, BDC, BT, ET, EMC mean, MCID
Then to get the tags and their attribute that matches MCID:
PDStructureElement pDStructureElement;
pDStructureElement .getCOSObject().getInt(COSName.K)
That should do it. In documents without Tags (document.getDocumentCatalog().getStructureTreeRoot() is empty of children) this match cannot be performed but the text can still be read using step 1.

Jsoup parsing html duplication on writing to file

I seem to be having this error where text is being written to a file twice, the first time with incorrect formatting and the second with correct formatting. The method below takes in this URL after it's been converted properly. The method is supposed to get print a newline in between the text conversion of all of the children of dividers that are children of the divider "ffaq" where all the body text resides. Any help would be appreciated. I'm fairly new to using jsoup so an explanation would be nice as well.
/**
* Method to deal with HTML 5 Gamefaq entries.
* #param url The location of the HTML 5 entry to read.
**/
public static void htmlDocReader(URL url) {
try {
Document doc = Jsoup.parse(url.openStream(), "UTF-8", url.toString());
//parse pagination label
String[] num = doc.select("div.span12").
select("ul.paginate").
select("li").
first().
text().
split("\\s+");
//get the max page number
final int max_pagenum = Integer.parseInt(num[num.length - 1]);
//create a new file based on the url path
File file = urlFile(url);
PrintWriter outFile = new PrintWriter(file, "UTF-8");
//Add every page to the text file
for(int i = 0; i < max_pagenum; i++) {
//if not the first page then change the url
if(i != 0) {
String new_url = url.toString() + "?page=" + i;
doc = Jsoup.parse(new URL(new_url).openStream(), "UTF-8",
new_url.toString());
}
Elements walkthroughs = doc.select("div.ffaq");
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
outFile.println(inner.text());
}
}
}
outFile.close();
} catch(Exception e) {
e.printStackTrace();
System.exit(1);
}
}

For every element you call text() you print all the text of its structure.
Assume the below example
<div>
text of div
<span>text of span</span>
</div>
if you call text() for div element you will get
text of div text of span
Then if you call text() for span you will get
text of span
What you need, in order to avoid duplicates is to use ownText(). This will get only the direct text of the element, and not the text of its children.
Long story sort change this
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
outFile.println(inner.text());
}
}
To this
for(Element elem : walkthroughs.select("div")) {
for(Element inner : elem.children()) {
String line = inner.ownText().trim();
if(!line.equals("")) //Skip empty lines
outFile.println(line);
}
}

JSoup extract only specific parts from Wikipedia

I have managed to extract the information in the "tables" on the right side of a Wikipedia article. However I also want to get paragraphs from the main text of the articles.
The code I'm using atm is only working about 60% of the time(Nullpointers or no text at all). In the example below I'm only interested in the tho first paragraphs, however that is irrelevant for my question.
In the picture below I show what parts I want the text from. I want to be able to iterate through all ... parts in the < divid="mw-content-text"....class="mw-content-ltr"> block.
StringBuilder sb = new StringBuilder();
String url = baseUrl + location;
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
Element elementTwo = firstParagraph.nextElementSibling();
if (elementTwo == null) {
for (int i = 0; i < 2; i++) {
sb.append(paragraphs.get(i).text());
}
} else {
sb.append(elementTwo.text());
}
return sb.toString();

How do you find/replace a placeholder in a .docx file with Apache POI?

I have a file, "template.docx" that I would like to have placeholders (ie. [serial number]) that can be replaced with a string or maybe a table. I am using Apache POI and no i cannot use docx4j.
Is there a way to have the program iterate over all occurrences of "[serial number]" and replace them with a string? Many of these tags will be inside a large table so is there some equivalent command with the Apache POI to just pressing ctrl+f in word and using replace all?
Any suggestions would be appreciated, thanks

XWPFDocument (docx) has different kind of sub-elements like XWPFParagraphs, XWPFTables, XWPFNumbering etc.
Once you create XWPFDocument object via:
document = new XWPFDocument(inputStream);
You can iterate through all of Paragraphs:
document.getParagraphsIterator();
When you iterator through Paragraphs, For each Paragraph you will get multiple XWPFRuns which are multiple text blocks with same styling, some times same styling text blocks will be split into multiple XWPFRuns in which case you should look into this question to avoid splitting of your Runs, doing so will help identify your placeHolders without merging multiple Runs within same Paragraph. At this point you should expect that your placeHolder will not be split in multiple runs if that's the case then you can go ahead and Iterate over 'XWPFRun's for each paragraph and look for text matching your placeHolder, something like this will help:
XWPFParagraph para = (XWPFParagraph) xwpfParagraphElement;
for (XWPFRun run : para.getRuns()) {
if (run.getText(0) != null) {
String text = run.getText(0);
Matcher expressionMatcher = expression.matcher(text);
if (expressionMatcher.find() && expressionMatcher.groupCount() > 0) {
System.out.println("Expression Found...");
}
}
}
Where expressionMatcher is Matcher based on a RegularExpression for particular PlaceHolder. Try having regex that matches something optional before your PlaceHolder and after as well e.g \([]*)(PlaceHolderGroup)([]*)^, trust me it works best.
Once you find the right XWPFRun extract text of your interest in it and create a replacement text which should be easy enough, then you should replace new text with previous text in this particular run by:
run.setText(text, 0);
If you were to replace this whole XWPFRun with a completely a new XWPFRun or perhaps insert a new Paragraph/Table after the Paragraph owning this run, you would probably run into a few problems, like A. ConcurrentModificationException which means you cannot modify this List(of XWPFRuns) you are iterating and B. finding the position of new Element to insert. To resolve these issues you should have a List<XWPFParagraph> of XWPFParagarphs that can hold paras after which new Element is to be inserted. Once you have your List of replacement you can iterator over it and for each replacement Paragraph you simply get a cursor and insert new element at that cursor:
for (XWPFParagraph para: paras) {
XmlCursor cursor = (XmlCursor) para.getCTP().newCursor();
XWPFTable newTable = para.getBody().insertNewTbl(cursor);
//Generate your XWPF table based on what's inside para with your own logic
}
To create an XWPFTable, read this.
Hope this helps someone.

// Text nodes begin with w:t in the word document
final String XPATH_TO_SELECT_TEXT_NODES = "//w:t";
try {
// Open the input file
String fileName="test.docx";
String[] splited=fileName.split(".");
File dir=new File("D:\\temp\\test.docx");
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new FileInputStream(dir));
// Build a list of "text" elements
List<?> texts = wordMLPackage.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
HashMap<String, String> mappings = new HashMap<String, String>();
mappings.put("1", "one");
mappings.put("2", "two");
// Loop through all "text" elements
Text text = null;
for (Object obj : texts) {
text = (Text) ((JAXBElement<?>) obj).getValue();
String textToReplace = text.getValue();
if (mappings.keySet().contains(textToReplace)) {
text.setValue(mappings.get(textToReplace));
}
}
wordMLPackage.save(new java.io.File("D:/temp/forPrint.docx"));//your path
} catch (Exception e) {
}
}
}

How to extract links from Google HTML result page?

I am reading a text file that contains HTML code from Google search results. Then I parse it and I try to extract the links with this code:
FileReader in = new FileReader("A.txt");
BufferedReader p = new BufferedReader(in);
while(p.readLine() != null)
{
String html = p.readLine();
Document doc = Jsoup.parse(html);
Elements Link = doc.select("a[href");
for(Element element :Link)
{
if(element != null)
{
System.out.println(element);
}
}
}
But I got many non-link strings. How can I show the links, not anything else?

Please try again with a complete selector, not only "a[href":
Elements links = doc.select("a[href]"); // a with href
See the Selector document for the full support - especially the examples on the right side.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract elements from a String with jsoup? - java

Related

Text associated to PDF paragraph in document content object wit PDFBox

Jsoup parsing html duplication on writing to file

JSoup extract only specific parts from Wikipedia

How do you find/replace a placeholder in a .docx file with Apache POI?

How to extract links from Google HTML result page?

Categories

Resources