How to get text from nested span using Jsoup?

How to get text from nested span using Jsoup? - java

I'm trying to get the text in the span
using this code below. However the output is behaving as if the nested spans don't exist
Elements tags = document.select("div[id=tags]");
for (Element tag:tags){
Elements child_tags = tag.getElementsByTag("class");
String key = tag.html();
System.out.println(key); //only as a test
for (Element child_tag:child_tags){
System.out.println("\t" + child_tag.text());
}
My output is
<hr />Tags:
<span id="category"></span>
<span id="voteSelector" class="initially_hidden"> <br /> </span>

Assuming you are trying the code on https://chesstempo.com/chess-problems/15 and the data you want is shown in the below image
Now, Using Jsoup you will get the data whatever is rendered as a source code in the browser,for confirmation you can press CTRL+U in browser which will open up a new window where the actual contents which Jsoup will get will be displayed. Now coming to your questions the part which you are trying to retrieve itself is not present in the browser source code check that by pressing CTRL+U.
If the contents are rendered using JAVASCRIPT those will not be visible to JSOUP and hence you have to use something else which will run the javascript and provide you the details.
JSoup does not run Javascript and is not a browser.
EDIT
There is a turnaround using SELENIUM. Below is the working code to get the exact source code of the url and the required data which you are looking for:
import java.io.IOException;
import java.io.PrintWriter;
import org.json.simple.parser.ParseException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
public class JsoupDummy {
public static void main(String[] args) throws IOException, ParseException {
System.setProperty("webdriver.gecko.driver", "D:\\thirdPartyApis\\geckodriver-v0.19.1-win32\\geckodriver.exe");
WebDriver driver = new FirefoxDriver();
try {
driver.get("https://chesstempo.com/chess-problems/15");
Document doc = Jsoup.parse(driver.getPageSource());
Elements elements = doc.select("span.ct-active-tag");
for (Element element:elements){
System.out.println(element.html());
}
} catch (Exception e) {
e.printStackTrace();
} finally {
/*write.flush();
write.close();*/
driver.quit();
}
}
}
You need selenium web driver Selenium Web Driver which simulates the browser behaviour and allows you to render the html content written by scripts as well.

Elements child_tags = tag.getElementsByTag("class");
With this line you are trying to get an element with tag class i.e <class>...</class>, which dose not exist. Change that line to:
Elements child_tags = tag.getElementsByClass("tag");
to get elements by attribute value of class = tag or to:
Elements child_tags = tag.getElementsByTag("span");
to get elements by tag name = span.

Related

Extracting answers to a flattened PDF form with iText 7

We have a few forms created from Adobe LiveCycle where users fill the dynamic forms and submits the document to our office where we stamp it with our signature and flatten it (at least most of the time - I've seen a few documents in our system that haven't been flattened yet but that can be a separate question, I'll focus on the flattened documents here because that's most of what we have).
I'm trying to use iText 7 to parse/extract the user's answers to our forms for migrating to an electronic solution that will happen a few months from now. I was able to make the example work in Java but I don't understand the process.
/*
This file is part of the iText (R) project.
Copyright (c) 1998-2020 iText Group NV
Authors: iText Software.
For more information, please contact iText Software at this address:
sales#itextpdf.com
*/
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
*/
package ca.umanitoba.ad.research;
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.canvas.parser.EventType;
import com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor;
import com.itextpdf.kernel.pdf.canvas.parser.data.IEventData;
import com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo;
import com.itextpdf.kernel.pdf.canvas.parser.filter.TextRegionEventFilter;
import com.itextpdf.kernel.pdf.canvas.parser.listener.FilteredEventListener;
import com.itextpdf.kernel.pdf.canvas.parser.listener.LocationTextExtractionStrategy;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.FileOutputStream;
import java.io.Writer;
import java.io.BufferedWriter;
public class Main {
public static final String DEST = "./target/txt/parse_custom.txt";
public static final String SRC = "./src/main/resources/pdfs/nameddestinations.pdf";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new Main().manipulatePdf(DEST);
}
protected void manipulatePdf(String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
Rectangle rect = new Rectangle(36, 750, 523, 56);
CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.processPageContent(pdfDoc.getFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.getResultantText();
pdfDoc.close();
// See the resultant text in the console
System.out.println(actualText);
try (Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(dest)))) {
writer.write(actualText);
}
}
/*
* The custom filter filters only the text of which the font name ends with Bold or Oblique.
*/
protected class CustomFontFilter extends TextRegionEventFilter {
public CustomFontFilter(Rectangle filterRect) {
super(filterRect);
}
#Override
public boolean accept(IEventData data, EventType type) {
if (type.equals(EventType.RENDER_TEXT)) {
TextRenderInfo renderInfo = (TextRenderInfo) data;
PdfFont font = renderInfo.getFont();
if (null != font) {
String fontName = font.getFontProgram().getFontNames().getFontName();
return fontName.endsWith("Bold") || fontName.endsWith("Oblique");
}
}
return false;
}
}
}
Why is there a need to specify a Rectangle? Our forms are dynamic so users can add more fields as needed and we also accept paragraphs on some of the questions so the length will always vary so it's unlikely that the coordinates of the texts will be the same.
How can I change the flow so that I can perhaps just search for the question and then get the text right after it (presumably the answer) - I don't really know what the best way to parse a PDF is. If there's no other way except providing a Rectangle, can I programmatically determine the coordinates/dimensions of the rectangles?
From the example it looks like it's filtering the text based on whether it's bolded or italicized which I probably don't need but it looks to be easy enough to fix by modifying/removing the accept() method.

Please take a look at what that example is for: In the JavaDoc comment you can read
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
*/
and that stack overflow question starts with
I used the following code to get data in PDF from a particular location. I want to get bold text present in that location
When you wonder, therefore,
Why is there a need to specify a Rectangle?
the answer is: because the example is about finding bold text in a particular location.
You mention your forms were dynamic before flattening and fields don't have fixed positions. Thus, this filter probably is not optimal for your use case.
How can I change the flow so that I can perhaps just search for the question and then get the text right after it
In that case simply don't filter at all but use a plain LocationTextExtractionStrategy to extract text, search for the question text in the extracted text, and use the text thereafter up to the next question text.
Alternatively, if you still have the unflattened dynamic forms, you may consider extracting the xfa xml and extract the filled-in data from that xml.

How to rewrite one specific line of a text file in java

The image below shows the format of my settings file for a web bot I'm developing. If you look at line 31 in the image you will see it says chromeVersion. This is so the program knows which version of the chromedriver to use. If the user enters an invalid response or leaves the field blank the program will detect that and determine the version itself and save the version it determines to a string called chromeVersion. After this is done I want to replace line 31 of that file with
"(31) chromeVersion(76/77/78), if you don't know this field will be filled automatically upon the first run of the bot): " + chromeVersion
To be clear I do not want to rewrite the whole file I just want to either change the value assigned to chromeVersion in the text file or rewrite that line with the version included.
Any suggestions or ways to do this would be much appreciated.
image

You will need to rewrite the whole file, except the byte length of the file remains the same after your modification. Since this is not guaranteed to be the case or to find out is too cumbersome here is a simple procedure:
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.List;
public class Lab1 {
public static void main(String[] args) {
String chromVersion = "myChromeVersion";
try {
Path path = Paths.get("C:\\whatever\\path\\toYourFile.txt");
List<String> lines = Files.readAllLines(path, StandardCharsets.UTF_8);
int lineToModify = 31;
lines.set(lineToModify, lines.get(lineToModify)+ chromVersion);
Files.write(path, lines, StandardCharsets.UTF_8);
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
Note that this is not the best way to go for very large files. But for the small file you have it is not an issue.

Selenium moveByOffset doesn't do anything

I'm running latest selenium 2.41 with Firefox 28.0 on Linux Xubuntu 13.10
I'm trying to get FirefoxDriver to move the mouse over the page (in my test, I've used the wired webpage, that has a lot of hover-activated menus), but the moveByOffset is not doing anything noticeable to the mouse, at all:
package org.openqa.mytest;
import java.util.List;
import java.io.File;
import java.lang.*;
import org.openqa.selenium.By;
import org.openqa.selenium.OutputType;
import org.openqa.selenium.TakesScreenshot;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.*;
import org.openqa.selenium.firefox.*;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.interactions.*;
import org.apache.commons.io.FileUtils;
public class Example {
public static void main(String[] args) throws Exception {
// The Firefox driver supports javascript
FirefoxProfile profile = new FirefoxProfile();
profile.setEnableNativeEvents(true);
WebDriver driver = new FirefoxDriver(profile);
// Go to the Google Suggest home page
driver.get("http://www.wired.com");
File scrFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
// now save the screenshto to a file some place
FileUtils.copyFile(scrFile, new File("./screenshot.png"));
Actions builder = new Actions(driver);
Action moveM = builder.moveByOffset(40, 40).build();
moveM.perform();
Action click = builder.click().build();
click.perform();
//click.release();
Action moveM2 = builder.moveByOffset(50, 50).build();
moveM2.perform();
Action click2 = builder.click().build();
click2.perform();
//click2.release();
Action moveM3 = builder.moveByOffset(150, 540).build();
moveM3.perform();
for( int i=0; i < 1000; i++)
{
moveM = builder.moveByOffset(200, 200).build();
moveM.perform();
Thread.sleep(500);
moveM = builder.moveByOffset(-200, -200).build();
moveM.perform();
Thread.sleep(500);
}
//Action click3 = builder.click().build();
//click3.perform();
//click3.release();
scrFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
// now save the screenshto to a file some place
FileUtils.copyFile(scrFile, new File("./screenshot2.png"));
driver.quit();
}
}
I'm expecting the mouse the move over the different elements and trigger all the hover actions, but nothing is happening

The method moveByOffset of class Actions is or has been broken. See Selenium WebDriver Bug 3578
(The error is described some lines more down in this bug document).
A project member (barancev) claims that this error should have been fixed with Selenium version 2.42.
Nevertheless I found the same error in version 2.44 running on openSUSE 12.3 with Firefox 33.0. moveToElement works, moveToOffset doesn't.

I struggled as well getting drag and drop working.
It seems as if selenium has problems if the dragtarget is not visible, thus scrolling is requiered.
Anyway, that's the (Java) code that works. Note that I call "release()" without an argument - neither the dropable Element nor the dragable Element as argument worked for me. As well as "moveToElement(dropable)" didnt work for me, that's why I calculated the offset manually.
public void dragAndDrop(WebElement dragable, WebElement dropable,
int dropableOffsetX, int dropableOffsetY) {
Actions builder = new Actions(driver);
int offsetX = dropable.getLocation().x + dropableOffsetX
- dragable.getLocation().x;
int offsetY = dropable.getLocation().y + dropableOffsetY
- dragable.getLocation().y;
builder.clickAndHold(dragable).moveByOffset(offsetX, offsetY).release()
.perform();
}

i was also struggling with this and the solution that worked for me is below we have to add 1 to either X or Y co-ordinate.
Looks like (x,y) takes us to the edge of the element where its not clickable
Below worked for me
WebElement elm = drv.findElement(By.name(str));
Point pt = elm.getLocation();
int NumberX=pt.getX();
int NumberY=pt.getY();
Actions act= new Actions(drv);
act.moveByOffset(NumberX+1, NumberY).click().build().perform();
you could even try adding +1 to y coordinate that also works
act.moveByOffset(NumberX+1, NumberY).click().build().perform();

Please try using moveToElement. It should work.
Actions action = new Actions(webdriver);
WebElement we = webdriver.findElement(By.xpath("<XPATH HERE>"));
action.moveToElement(we).moveToElement(webdriver.findElement(By.xpath("/expression-here"))).click().build().perform();

i suggest that if your browser is not perform movetoelement and move to offset then you have put wrong offset of element
for find offset you use Cordinates plugin in chrome

Parsing xml file contents without knowing xml file structure

I've been working on learning some new tech using java to parse files and for the msot part it's going well. However, I'm at a lost as to how I could parse an xml file to where the structure is not known upon receipt. Lots of examples of how to do so if you know the structure (getElementByTagName seems to be the way to go), but no dynamic options, at least not that I've found.
So the tl;dr version of this question, how can I parse an xml file where I cannot rely on knowing it's structure?

Well the parsing part is easy; like helderdarocha stated in the comments, the parser only requires valid XML, it does not care about the structure. You can use Java's standard DocumentBuilder to obtain a Document:
InputStream in = new FileInputStream(...);
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
(If you're parsing multiple documents, you can keep reusing the same DocumentBuilder.)
Then you can start with the root document element and use familiar DOM methods from there on out:
Element root = doc.getDocumentElement(); // perform DOM operations starting here.
As for processing it, well it really depends on what you want to do with it, but you can use the methods of Node like getFirstChild() and getNextSibling() to iterate through children and process as you see fit based on structure, tags, and attributes.
Consider the following example:
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class XML {
public static void main (String[] args) throws Exception {
String xml = "<objects><circle color='red'/><circle color='green'/><rectangle>hello</rectangle><glumble/></objects>";
// parse
InputStream in = new ByteArrayInputStream(xml.getBytes("utf-8"));
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
// process
Node objects = doc.getDocumentElement();
for (Node object = objects.getFirstChild(); object != null; object = object.getNextSibling()) {
if (object instanceof Element) {
Element e = (Element)object;
if (e.getTagName().equalsIgnoreCase("circle")) {
String color = e.getAttribute("color");
System.out.println("It's a " + color + " circle!");
} else if (e.getTagName().equalsIgnoreCase("rectangle")) {
String text = e.getTextContent();
System.out.println("It's a rectangle that says \"" + text + "\".");
} else {
System.out.println("I don't know what a " + e.getTagName() + " is for.");
}
}
}
}
}
The input XML document (hard-coded for example) is:
<objects>
<circle color='red'/>
<circle color='green'/>
<rectangle>hello</rectangle>
<glumble/>
</objects>
The output is:
It's a red circle!
It's a green circle!
It's a rectangle that says "hello".
I don't know what a glumble is for.

How do I load a javascript file into the DOM using selenium?

I'm using Selenium WebDriver to try to insert an external javascript file into the DOM, rather than type out the entire thing into executeScript.
It looks like it properly places the node into the DOM, but then it just disregards the source, i.e. the function on said source js file doesn't run.
Here is my code:
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
public class Example {
public static void main(String[] args) {
WebDriver driver = new FirefoxDriver();
driver.get("http://google.com");
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("document.getElementsByTagName('head')[0].innerHTML += '<script src=\"<PATH_TO_FILE>\" type=\"text/javascript\"></script>';");
}
}
The code of the javascript file I am linking to is
alert("hi Nate");
I've placed the js file on my localhost, I called it using file:///, and I tried it on an external server. No dice.
Also, in the Java portion, I tried appending 'scr'+'ipt' using that trick, but it still didn't work. When I inspect the DOM using Firefox's inspect element, I can see it loads the script node properly, so I'm quite confused.
I also tried this solution, which apparently was made for another version of Selenium (not webdriver) and thus didn't work in the least bit: Load an external js file containing useful test functions in selenium

According to this: http://docs.seleniumhq.org/docs/appendix_migrating_from_rc_to_webdriver.jsp
You might be using the browserbot to obtain a handle to the current
window or document of the test. Fortunately, WebDriver always
evaluates JS in the context of the current window, so you can use
“window” or “document” directly.
Alternatively, you might be using the browserbot to locate elements.
In WebDriver, the idiom for doing this is to first locate the element,
and then pass that as an argument to the Javascript. Thus:
So does the following work in webdriver?
WebDriver driver = new FirefoxDriver();
((JavascriptExecutor) driver)
.executeScript("var s=window.document.createElement('script');\
s.src='somescript.js';\
window.document.head.appendChild(s);");

Injecting our JS-File into DOM
Injecting our JS-File into browsers application from our local server, so that we can access our function using document object.
injectingToDOM.js
var getHeadTag = document.getElementsByTagName('head')[0];
var newScriptTag = document.createElement('script');
newScriptTag.type='text/javascript';
newScriptTag.src='http://localhost:8088/WebApplication/OurOwnJavaScriptFile.js';
// adding <script> to <head>
getHeadTag.appendChild(newScriptTag);
OurSeleniumCode.java
String baseURL = "http://-----/";
driver = new FirefoxDriver();
driver.navigate().to(baseURL);
JavascriptExecutor jse = (JavascriptExecutor) driver;
Scanner sc = new Scanner(new FileInputStream(new File("injectingToDOM.js")));
String inject = "";
while (sc.hasNext()) {
String[] s = sc.next().split("\r\n");
for (int i = 0; i < s.length; i++) {
inject += s[i];
inject += " ";
}
}
jse.executeScript(inject);
jse.executeScript("return ourFunction");
OurOwnJavaScriptFile.js
document.ourFunction = function(){ .....}
Note : If you are passing JS-File as String to executeScript() then don't use any comments in between JavaScript code, like injectingToDOM.js remove all comments data.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get text from nested span using Jsoup? - java

Related

Extracting answers to a flattened PDF form with iText 7

How to rewrite one specific line of a text file in java

Selenium moveByOffset doesn't do anything

Parsing xml file contents without knowing xml file structure

How do I load a javascript file into the DOM using selenium?

Categories

Resources