I am playing around with nutch. I am trying to write something which also include detecting specific nodes in the DOM structure and extracting text data from around the node. e.g. text from parent nodes, sibling nodes etc. I researched and read some examples and then tried writing a plugin that will do this for an image node. Some of the code,
if("img".equalsIgnoreCase(nodeName) && nodeType == Node.ELEMENT_NODE){
String imageUrl = "No Url";
String altText = "No Text";
String imageName = "No Image Name"; //For the sake of simpler code, default values set to
//avoid nullpointerException in findMatches method
NamedNodeMap attributes = currentNode.getAttributes();
List<String>ParentNodesText = new ArrayList<String>();
ParentNodesText = getSurroundingText(currentNode);
//Analyze the attributes values inside the img node. <img src="xxx" alt="myPic">
for(int i = 0; i < attributes.getLength(); i++){
Attr attr = (Attr)attributes.item(i);
if("src".equalsIgnoreCase(attr.getName())){
imageUrl = getImageUrl(base, attr);
imageName = getImageName(imageUrl);
}
else if("alt".equalsIgnoreCase(attr.getName())){
altText = attr.getValue().toLowerCase();
}
}
private List<String> getSurroundingText(Node currentNode){
List<String> SurroundingText = new ArrayList<String>();
while(currentNode != null){
if(currentNode.getNodeType() == Node.TEXT_NODE){
String text = currentNode.getNodeValue().trim();
SurroundingText.add(text.toLowerCase());
}
if(currentNode.getPreviousSibling() != null && currentNode.getPreviousSibling().getNodeType() == Node.TEXT_NODE){
String text = currentNode.getPreviousSibling().getNodeValue().trim();
SurroundingText.add(text.toLowerCase());
}
currentNode = currentNode.getParentNode();
}
return SurroundingText;
}
This doesn't seem to work properly. img tag gets detected, Image name and URL gets retrieved but no more help. the getSurroundingText module looks too ugly, I tried but couldn't improve it. I don't have clear idea from where and how can I extract text which could be related to the image. Any help please?
you're on the right track, on the other hand, take a look at this example HTML of code:
<div>
<span>test1</span>
<img src="http://example.com" alt="test image" title="awesome title">
<span>test2</span>
</div>
In your case, I think that the problem lies in the sibling nodes of the img node, for instance you're looking for the direct siblings, and you may think that on the previous example these would be the span nodes, but in this case are some dummy text nodes so when you ask for the sibling node of the img you'll get this empty node with no actual text.
If we rewrite the previous HTML as: <div><span>test1</span><img src="http://example.com" alt="test image" title="awesome title"><span>test2</span></div> then the sibling nodes of the img would be the span nodes that you want.
I'm assuming that in the previous example you want to get both "text1" and "text2", in that case you need to actually keep moving until you find some Node.ELEMENT_NODE and then fetch the text inside that node. One good practice would be to not grab anything that you find, but limit your scope to p,span,div to improve the accuracy.
Related
In my word template file I have some tables and sometimes the second column of them is formatted as an enumeration.
Using docx4j I'm filling it with dynamic content and if there's only one entry I need to get rid of the enumeration style.
I found a place deep down in the structure that has a value for enumeration but when setting it to null, I don't see any changes in my template.
//This value is "Listenabsatz" (German) and I want to get rid of it
//Setting this value to "" or setting pStyle to null didn't help
((PStyle)((PPr)((P)((java.util.ArrayList)((Tc)((JAXBElement)templateRow.content.get(1)).value).content).get(0)).pPr).pStyle).val
In my actual code this is the place where I'm trying to change it:
Tr templateRow = (Tr) rows.get(0);
Tc cell = (Tc) ((javax.xml.bind.JAXBElement) templateRow.getContent().get(1)).getValue();
P par = (P) (cell.getContent().get(0));
PPr parStyle = par.getPPr();
if (parStyle.getPStyle() != null && parStyle.getPStyle().getVal() != null) {
parStyle.setPStyle(null);
//parStyle.getPStyle().setVal("");
}
How can I remove that enumeration style succesfully?
I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}
I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
I am trying to extract img using Jsoup. It works fine for images without any space in filename but it extract only the first part if there is a white space.
I tried with below.
String result = Jsoup.clean(content,"https://rally1.rallydev.com/", Whitelist.relaxed().preserveRelativeLinks(true), new Document.OutputSettings().prettyPrint(false));
Document doc = Jsoup.parse(result);
Elements images = doc.select("img");
e.g HTML content
Description:<div>some text content<br /></div>
<div><img src=/slm/attachment/43647556403/My file with space.png /></div>
<div><img src=/slm/attachment/43648152373/my_file_without_space.png/></div>
result content is:
Description:Some text content<br> <img src="/slm/attachment/43647556403/My"><img src="/slm/attachment/43648152373/my_file_without_space.png/">
in "result" for the image with space in file name has only first part "My". It ignored the content after whitespace.
How to extract filename if that contains space?
The problem can't be easily solved in Jsoup, since the src attribute value of the example with spaces actually is correctly identified to be only My. The file, with and space.png parts are in this example also attributes without values. Of course you can use JSoup to concatenate the attribute keys that follow the src attribute to its value. For example like this:
String test =""
+ "<div><img src=/slm/attachment/43647556403/My file with space.png /></div>"
+ "<div><img src=/slm/attachment/43647556403/My file with space.png name=whatever/></div>"
+ "<div><img src=/slm/attachment/43647556403/This breaks it.png name=whatever/></div>"
+ "<div><img src=\"/slm/attachment/43647556403/This works.png\" name=whatever/></div>"
+ "<div><img src=/slm/attachment/43648152373/my_file_without_space.png/></div>";
Document doc = Jsoup.parse(test);
Elements imgs = doc.select("img");
for (Element img : imgs){
Attribute src = null;
StringBuffer newSrcVal = new StringBuffer();
List<String> toRemove = new ArrayList<>();
for (Attribute a : img.attributes()){
if (a.getKey().equals("src")){
newSrcVal.append(a.getValue());
src = a;
}
else if (newSrcVal.length()>0){
//we already found the scr tag
if (a.getValue().isEmpty()){
newSrcVal.append(" ").append(a.getKey());
toRemove.add(a.getKey());
}
else{
//the empty attributes, i.e. file name parts are over
break;
}
}
}
for (String toRemAttr : toRemove){
img.removeAttr(toRemAttr);
}
src.setValue(newSrcVal.toString());
}
System.out.println(doc);
This algorithm cycles over all img elements and within each img it cycles over its attributes. When it finds the src attribute it keeps it for reference and starts to fill the newSrcBuf StringBuffer. All following value-less attributes will be added to to newSrcBuf until either another attribute with value is found or there are no more attributes. Finally the scr attribute value is reset with the contents of newSrcBuf and the former empty attributes are removed from the DOM.
Note that this will not work when your filename contains two or more consecutive spaces. JSoup discards those spaces between attributes and therefore you can't get them back after parsing. If you need that, then you need to manipulate the input html before parsing.
You can something like this:
Elements images = doc.select("img");
for(Element image: images){
String imgSrc = image.attr("src");
imgSrc = imgSrc.subString(imgSrc.lastIndexOf("/"), imgSrc.length()); // this will give you name.png
}
I want to parse following with htmlparser.I wrote code for title and its working fine.i tried for following tag but nothing is working.please help i am doing this kind of programmming for the first time.
1)
I want to retrieve img src url from img tag
<div id="images">
<img src="../images/abc.jpg" align="right" style="padding-right:5px;">
2) I want to retrieve text content between <li> tags.
<ul>
<li>hello</li>
<li>how r u?</li>
<li>bye</li>
</ul>
I tried following code to retrieve img tag src url.But it throws nullpointer exception.
Parser parser=new Parser();
HasAttributeFilter imgfil=new HasAttributeFilter("align","right");
NodeList img=parser.parse(imgfil);
Node node1=img.elementAt(0);
ImageTag tg=(ImageTag) node1;
String url=tg.getText();
System.out.println(url);
I tried following snippet too.But nothing works.
NodeList img=parser.extractAllNodesThatMatch(new AndFilter(new TagNameFilter("img"),new HasAttributeFilter("align","right")));
SimpleNodeIterator iterate=img.elements();
while (iterate.hasMoreNodes())
{
Node node1 = iterate.nextNode();
ImageTag tag = (ImageTag)node1;
System.out.println(tag.getImageURL());
}
the second bit of code you tried will work if corrected. The first line has the problem:
NodeList img=parser.extractAllNodesThatMatch(new AndFilter(new TagNameFilter("img"),new HasAttributeFilter("align","right")));
I think I understand how to fix the problem. You don't use parser.extractAllNodesThatMatch(), use parser.parse() and see if that helps.
Here's an example of what I mean:
NodeFilter filter1 = new AndFilter(new TagNameFilter("IMG"), new HasParentFilter(new HasAttributeFilter("id", "featured_story_1"), true));
NodeList list = parser.parse(filter1);
for(int i = 0; i < list.size(); i++)
{
Node node = list.elementAt(i);
ImageTag image = (ImageTag)node;
System.out.println(image.getImageURL());
}
Hope this helps!
I am currently working on an academic project, developing in Java and XML. Actual task is to parse XML, passing required values preferably in HashMap for further processing. Here is the short snippet of actual XML.
<root>
<BugReport ID = "1">
<Title>"(495584) Firefox - search suggestions passes wrong previous result to form history"</Title>
<Turn>
<Date>'2009-06-14 18:55:25'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "3.1"> Created an attachment (id=383211) [details] Patch v.2</Sentence>
<Sentence ID = "3.2"> Ah. So, there's a ._formHistoryResult in the....</Sentence>
<Sentence ID = "3.3"> The simple fix it to just discard the service's form history result.</Sentence>
<Sentence ID = "3.4"> Otherwise it's trying to use a old form history result that no longer applies for the search string.</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 12:07:34'</Date>
<From>'Gavin Sharp'</From>
<Text>
<Sentence ID = "4.1"> (From update of attachment 383211 [details])</Sentence>
<Sentence ID = "4.2"> Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
</Text>
</Turn>
<Turn>
<Date>'2009-06-19 13:17:56'</Date>
<From>'Justin Dolske'</From>
<Text>
<Sentence ID = "5.1"> (In reply to comment #3)</Sentence>
<Sentence ID = "5.2"> > (From update of attachment 383211 [details] [details])</Sentence>
<Sentence ID = "5.3"> > Perhaps we should rename one of them to _fhResult just to reduce confusion?</Sentence>
<Sentence ID = "5.4"> Good point.</Sentence>
<Sentence ID = "5.5"> I renamed the one in the wrapper to _formHistResult. </Sentence>
<Sentence ID = "5.6"> fhResult seemed maybe a bit too short.</Sentence>
</Text>
</Turn>
.....
and so on
</BugReport>
There are many commenter like 'Justin Dolske' who have commented on this report and what I actually looking for is the list of commenter and all sentences they have written in a whole XML file. Something like if(from == justin dolske) getHisAllSentences(). Similarly for other commenters (for all). I have tried many different ways to get the sentences only for 'Justin dolske' or other commenters, even in a generic form for all using XPath, SAX and DOM but failed. I am quite new to these technologies including JAVA and any don't know how to achieve it.
Can anyone guide me specifically how could I get it with any of above technologies or is there any other better strategy to do it?
(Note: Later I want to put it in a hashmap such as like this HashMap (key, value) where key = name of commenter (justin dolske) and value is (all sentences))
Urgent help will be highly appreciated.
There're several ways using which you can achieve your requirement.
One way would be use JAXB. There're several tutorials available on this on the web, so feel free to refer to them.
You can also think of creating a DOM and then extracting data from it and then put it into your HashMap.
One reference implementation would be something like this:
import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
public class XMLReader {
private HashMap<String,ArrayList<String>> namesSentencesMap;
public XMLReader() {
namesSentencesMap = new HashMap<String, ArrayList<String>>();
}
private Document getDocument(String fileName){
Document document = null;
try{
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File(fileName));
}catch(Exception exe){
//handle exception
}
return document;
}
private void buildNamesSentencesMap(Document document){
if(document == null){
return;
}
//Get each Turn block
NodeList turnList = document.getElementsByTagName("Turn");
String fromName = null;
NodeList sentenceNodeList = null;
for(int turnIndex = 0; turnIndex < turnList.getLength(); turnIndex++){
Element turnElement = (Element)turnList.item(turnIndex);
//Assumption: <From> element
Element fromElement = (Element) turnElement.getElementsByTagName("From").item(0);
fromName = fromElement.getTextContent();
//Extracting sentences - First check whether the map contains
//an ArrayList corresponding to the name. If yes, then use that,
//else create a new one
ArrayList<String> sentenceList = namesSentencesMap.get(fromName);
if(sentenceList == null){
sentenceList = new ArrayList<String>();
}
//Extract sentences from the Turn node
try{
sentenceNodeList = turnElement.getElementsByTagName("Sentence");
for(int sentenceIndex = 0; sentenceIndex < sentenceNodeList.getLength(); sentenceIndex++){
sentenceList.add(((Element)sentenceNodeList.item(sentenceIndex)).getTextContent());
}
}finally{
sentenceNodeList = null;
}
//Put the list back in the map
namesSentencesMap.put(fromName, sentenceList);
}
}
public static void main(String[] args) {
XMLReader reader = new XMLReader();
reader.buildNamesSentencesMap(reader.getDocument("<your_xml_file>"));
for(String names: reader.namesSentencesMap.keySet()){
System.out.println("Name: "+names+"\tTotal Sentences: "+reader.namesSentencesMap.get(names).size());
}
}
}
Note: This is just a demonstration and you would need to modify it to suit your need. I've created it based on your XML to show one way of doing it.
I suggest to use JAXB to creates a Data Model reflecting your XML structure.
One done, you can load the XML into Java instances.
Put each 'Turn' into a Map< String, List< Turn >>, using Turn.From as key.
Once done, you'll can write:
List< Turn > justinsTurn = allTurns.get( "'Justin Dolske'" );