Get the link from html file

Get the link from html file - java

I use htmlcleaner to parse HTML files. here is example of an html file.
.......<div class="name">Name</div>;......
I get the word Name using this construction in my code
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
rootNode = cleaner.clean(htmlPage);
TagNode linkElements[] = rootNode.getElementsByName("div",true);
for (int i = 0; linkElements != null && i < linkElements.length; i++)
{
String classType = linkElements.getAttributeByName("name");
if (classType != null)
{
if(classType.equals(class)&& classType.equals(CSSClassname)) { linkList.add(linkElements); }
}
System.out.println("TagNode" + linkElements.getText());
linkList.add(linkElements);
}
and then add all of this name's to listview using
TagNode=linkelements.getText().toString()
;
But I don't understand how I can get the link in my example. I want to get the link http://exxample.com but I don't know what to do.
Please help me. I read the tutorial and used the function but can't.
P.S. Sorry for my bad English

I don't use HtmlCleaner, but according to the javadoc you do it this way:
List<String> links = new ArrayList<String> ();
for (TagNode aTag : linkElements[i].getElementListByName ("a", false))
{
String link = aTag.getAttributeByName ("href");
if (link != null && link.length () > 0) links.add (link);
}
P.S.: you posted clearly uncompilable code
P.P.S.: why don't you use some library that creates an ordinary DOM tree from html? This way you'll be able to work with parsed document using a common-known API.

Related

iText 7.0.5: How to combine PDF and have existing bookmarks indented under new bookmarks for each document?

Problem:
com.itextpdf.kernel.PdfException: Pdf indirect object belongs to other PDF document. Copy object to current pdf document.
I want to combine PDF documents with an edited set of bookmarks that keeps a clear pairing of the bookmarks with each original document. I also want a new top level bookmark describing the set as a whole to improve combining with yet other documents later if the user chooses. The number of documents combined and the number of bookmarks in each is unknown and some documents might not have any bookmarks.
For simplicity, assume I have two documents with two pages and a bookmark to the second page in each. I would want the combined document to have a bookmark structure like this where "NEW" are the ones I am creating based on meta data I have about each source document and "EXISTING" are whatever I copy from the individual documents:
-- NEW Combined Document meta(page 1)
---- NEW Document one meta (page 1)
------ EXISTING Doc one link (page 2)
---- NEW Document two meta (page 3)
------ EXISTING Doc two link (page 4)
Code:
private static String combinePdf(List<String> allFile, LinkedHashMap<String, String> bookmarkMetaMap, Connection conn) throws IOException {
System.out.println("=== combinePdf() ENTER"); // TODO REMOVE
File outFile = File.createTempFile("combinePdf", "pdf", new File(DocumentObj.TEMP_DIR_ON_SERVER));
if (!outFile.exists() || !outFile.canWrite()) {
throw new IOException("Unable to create writeable file in " + DocumentObj.TEMP_DIR_ON_SERVER);
}
if (bookmarkMetaMap == null || bookmarkMetaMap.isEmpty()) {
bookmarkMetaMap = new LinkedHashMap<>(); // prevent NullPointer below
bookmarkMetaMap.put("Documents", "Documents");
}
try ( PdfDocument allPdfDoc = new PdfDocument(new PdfWriter(outFile)) ) {
allPdfDoc.initializeOutlines();
allPdfDoc.getCatalog().setPageMode(PdfName.UseOutlines);
PdfMerger allPdfMerger = new PdfMerger(allPdfDoc, true, false); // build own outline
Iterator<Map.Entry<String, String>> itr = bookmarkMetaMap.entrySet().iterator();
PdfOutline rootOutline = allPdfDoc.getOutlines(false);
PdfOutline mainOutline;
mainOutline = rootOutline.addOutline(itr.next().getValue());
mainOutline.addDestination(PdfExplicitDestination.createFit(allPdfDoc.getNumberOfPages() + 1));
int fileNum = 0;
for (String oneFile : allFile) {
PdfDocument onePdfDoc = new PdfDocument(new PdfReader(oneFile));
PdfAcroForm oneForm = PdfAcroForm.getAcroForm(onePdfDoc, false);
if (oneForm != null) {
oneForm.flattenFields();
}
allPdfMerger.merge(onePdfDoc, 1, onePdfDoc.getNumberOfPages());
fileNum++;
String bookmarkLabel = itr.hasNext() ? itr.next().getKey() : "Document " + fileNum;
PdfOutline linkToDoc = mainOutline.addOutline(bookmarkLabel);
linkToDoc.addDestination(PdfExplicitDestination.createFit(allPdfDoc.getNumberOfPages() + 1));
PdfOutline srcDocOutline = onePdfDoc.getOutlines(false);
if (srcDocOutline != null) {
List<PdfOutline> outlineList = srcDocOutline.getAllChildren();
if (!outlineList.isEmpty()) {
for (PdfOutline p : outlineList) {
linkToDoc.addOutline(p); // if I comment this out, no error, but links wrong order
}
}
}
onePdfDoc.close();
}
System.out.println("=== combinePdf() DONE ADDING PAGES ==="); //TODO REMOVE
}
return outFile.getAbsolutePath();
}
Problem:
com.itextpdf.kernel.PdfException: Pdf indirect object belongs to other PDF document. Copy object to current pdf document.
Error occurs after the debug line "=== combinePdf() DONE ADDING PAGES ===" so the for loop completes as expected.
This means the error occurs when allPdfDoc is automagically closed.
If I remove the line linkToDoc.addOutline(p); I get all of my links and they go to the correct pages but they are not nested/ordered as I want:
-- NEW Combined Document meta(page 1)
---- NEW Document one meta (page 1)
---- NEW Document two meta (page 3)
-- EXISTING Doc one link (page 2)
-- EXISTING Doc two link (page 4)
With the aforementioned line commented out, I am not even sure how the EXISTING links are included at all. I have the mergeOutlines flag set to false in the PdfMerger constructor since I thought I had to construct my own outline. I get similar results no matter whether I set the getOutlines() to true or false as well as if I take out my arbitrary top level new bookmark.
I know how to create a flattened list of new and existing bookmarks in the desired order. So my question is about how to get both the indenting and ordering as desired.
Thanks for taking a look!

Rather than shift bookmarks in the combined PDF, I did it in the component PDF before merging.
Feedback welcome, especially if something is horribly inefficient as PDF size increases:
private static void shiftPdfBookmarksUnderNewBookmark(PdfDocument pdfDocument, String bookmarkLabel) {
if (pdfDocument == null || pdfDocument.getWriter() == null) {
log.warn("shiftPdfBookmarksUnderNewBookmark(): no writer linked to PDFDocument, cannot modify bookmarks");
return;
}
pdfDocument.initializeOutlines();
try {
PdfOutline rootOutline = pdfDocument.getOutlines(false);
PdfOutline subOutline = rootOutline.addOutline(bookmarkLabel);
subOutline.addDestination(PdfExplicitDestination.createFit(pdfDocument.getFirstPage())); // Not sure why this is needed, but problems if omitted.
List<PdfOutline> pdfOutlineChildren = rootOutline.getAllChildren();
if (pdfOutlineChildren.size() == 1) {
return;
}
int i = 0;
for (PdfOutline p : rootOutline.getAllChildren()) {
if (p != subOutline) {
if (p.getDestination() == null) {
continue;
}
subOutline.addOutline(p);
}
}
rootOutline.getAllChildren().clear();
rootOutline.addOutline(subOutline);
subOutline.addDestination(PdfExplicitDestination.createFit(pdfDocument.getFirstPage())); // not sure why duplicate line above seems to be needed
}
catch (Exception logAndIgnore) {
log.warn("shiftPdfBookmarksUnderNewBookmark ignoring error and not shifting bookmarks: " +logAndIgnore, logAndIgnore);
}
}

Upload documents into Watson's Retrieve & Rank service

I'm implementing a solution using Watson's Retrieve & Rank service.
When I use the tooling interface, I upload my documents and they appear as a list, where I can click on any of them to open up all the Titles that are inside the document ( Answer Units ), as you can see on the Picture 1 and Picture 2.
When I try to upload documents via Java, it wont recognize the documents, they get uploaded in parts ( Answer units as documents ), each part as a new document.
I would like to know how can I upload my documents as a entire document and not only parts of it?
Here's the codes for the upload function in Java:
public Answers ConvertToUnits(File doc, String collection) throws ParseException, SolrServerException, IOException{
DC.setUsernameAndPassword(USERNAME,PASSWORD);
Answers response = DC.convertDocumentToAnswer(doc).execute();
SolrInputDocument newdoc = new SolrInputDocument();
WatsonProcessing wp = new WatsonProcessing();
Collection<SolrInputDocument> newdocs = new ArrayList<SolrInputDocument>();
for(int i=0; i<response.getAnswerUnits().size(); i++)
{
String titulo = response.getAnswerUnits().get(i).getTitle();
String id = response.getAnswerUnits().get(i).getId();
newdoc.addField("title", titulo);
for(int j=0; j<response.getAnswerUnits().get(i).getContent().size(); j++)
{
String texto = response.getAnswerUnits().get(i).getContent().get(j).getText();
newdoc.addField("body", texto);
}
wp.IndexDocument(newdoc,collection);
newdoc.clear();
}
wp.ComitChanges(collection);
return response;
}
public void IndexDocument(SolrInputDocument newdoc, String collection) throws SolrServerException, IOException
{
UpdateRequest update = new UpdateRequest();
update.add(newdoc);
UpdateResponse addResponse = solrClient.add(collection, newdoc);
}

You can specify config options in this line:
Answers response = DC.convertDocumentToAnswer(doc).execute();
I think something like this should do the trick:
String configAsString = "{ \"conversion_target\":\"answer_units\", \"answer_units\": { \"selector_tags\": [] } }";
JsonParser jsonParser = new JsonParser();
JsonObject customConfig = jsonParser.parse(configAsString).getAsJsonObject();
Answers response = DC.convertDocumentToAnswer(doc, null, customConfig).execute();
I've not tried it out, so might not have got the syntax exactly right, but hopefully this will get you on the right track.
Essentially, what I'm trying to do here is use the selector_tags option in the config (see https://www.ibm.com/watson/developercloud/doc/document-conversion/customizing.shtml#htmlau for doc on this) to specify which tags the document should be split on. By specifying an empty list with no tags in, it results in it not being split at all - and coming out in a single answer unit as you want.
(Note that you can do this through the tooling interface, too - by unticking the "Split my documents up into individual answers for me" option when you upload the document)

Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

I am trying to extract image from the pdf using pdfbox. I have taken help from this post . It worked for some of the pdfs but for others/most it did not. For example, I am not able to extract the figures in this file
After doing some research I found that PDResources.getImages is deprecated. So, I am using PDResources.getXObjects(). With this, I am not able to extract any image from the PDF and instead get this message at the console:
org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage
Now I am stuck and unable to find the solution. Please assist if anyone can.
//////UPDATE AS REPLY ON COMMENTS///
I am using pdfbox-1.8.10
Here is the code:
public void getimg ()throws Exception {
try {
String sourceDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/inputs/Yavaa.pdf";
String destinationDir = "C:/Users/admin/Desktop/pdfbox/mypdfbox/pdfbox/outputs/";
File oldFile = new File(sourceDir);
if (oldFile.exists()){
PDDocument document = PDDocument.load(sourceDir);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
String fileName = oldFile.getName().replace(".pdf", "_cover");
int totalImages = 1;
for (PDPage page : list) {
PDResources pdResources = page.getResources();
Map pageImages = pdResources.getXObjects();
if (pageImages != null){
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()){
String key = (String) imageIter.next();
Object obj = pageImages.get(key);
if(obj instanceof PDXObjectImage) {
PDXObjectImage pdxObjectImage = (PDXObjectImage) obj;
pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);
totalImages++;
}
}
}
}
} else {
System.err.println("File not exist");
}
}
catch (Exception e){
System.err.println(e.getMessage());
}
}
//// PARTIAL SOLUTION/////
I have solved the problem of the error message. I have updated the correct code in the post as well. However, the problem remains the same. I am still not able to extract the images from few of the files. Like the one, I have mentioned in this post. Any solution in that regards.

The first problem with the original code is that XObjects can be PDXObjectImage or PDXObjectForm, so it is needed to check the instance. The second problem is that the code doesn't walk PDXObjectForm recursively, forms can have resources too. The third problem (only in 1.8) is that you used getResources() instead of findResources(), getResources() doesn't check higher levels.
Code for 1.8 can be found here:
https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractImages.java?view=markup
Code for 2.0 can be found here:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup&sortby=date
(Even these are not always perfect, see this answer)
The fourth problem is that your file doesn't have any XObjects at all. All "graphics" were really vector drawings, these can't be "extracted" like embedded images. All you could do is to convert the PDF pages to images, and then mark and cut what you need.

How to extract links from a webpage using jsp?

My requirement is to extract all links (using "a href") from a web page dynamically. I am using JSP. To be more specific, i am building a meta search engine in JSP. So when user enters a query item, i have to extract the links from the search results pages of yahoo, ask, google, momma etc.
For getting the pages in string format, the code i am using right now is.
> > try
{
> String sUrl_yahoo = "http://www.mamma.com/result.php?type=web&q=hai+bird&j_q=&l=";
>
> String nextLine;
> String webPage;
> StringBuffer wPage;
> String sSql;
> java.net.URL siteURL = new java.net.URL (sUrl_yahoo);
> java.net.URLConnection siteConn = siteURL.openConnection();
> java.io.BufferedReader in = new java.io.BufferedReader ( new java.io.InputStreamReader(siteConn.getInputStream() ) );
> wPage = new StringBuffer(30*1024);
> while ( ( nextLine = in.readLine() ) != null ) {
> wPage.append(nextLine); }
> in.close();
> webPage = wPage.toString(); out.println(webPage); }
> catch(Exception e) {
> out.println("Error" + e); }
Now, my request is: Can you suggest some way to extract the links from the String webPage ?
Or is there some other way to extract those links ? I would prefer doing it without using any external packages.

One quick solution would be to use a regex Matcher object to pull the URLs out:
Pattern p = Pattern.compile("<a +href=\"([a-zA-z0-9\\:\\-\\/\\.]+)\">");
Matcher m = p.matcher(webPage);
ArrayList<String> foundUrls = new ArrayList<String>();
while(m.find()) {
foundUrls.add(m.group(1));
}
You might have to play around with the URL pattern a little to make it more airtight, but this is a quick and dirty solution without using external libraries.

Android: Parsing XML DOM parser. Converting childnodes to string

Again a question. This time I'm parsing XML messages I receive from a server.
Someone thought to be smart and decided to place HTML pages in a XML message. Now I'm kind of facing problems because I want to extract that HTML page as a string from this XML message.
Ok this is the XML message I'm parsing:
<AmigoRequest>
<From></From>
<To></To>
<MessageType>showMessage</MessageType>
<Param0>general message</Param0>
<Param1><html><head>test</head><body>Testhtml</body></html></Param1>
</AmigoRequest>
You see that in Param1 a HTML page is specified. I've tried to extract the message the following way:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
return results.item(0).getFirstChild().getNodeValue();
}
}
return "";
}
Where d is the XML message in document form.
It always returns me a null value, because getNodeValue() returns null.
When i try results.item(0).getFirstChild().hasChildNodes() it will return true because he sees there is a tag in the message.
How can i extract the html message <html><head>test</head><body>Testhtml</body></html> from Param0 in a string?
I'm using Android sdk 1.5 (well almost java) and a DOM Parser.
Thanks for your time and replies.
Antek

You could take the content of param1, like this:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
// String extractHTMLTags(String s) is a function that you have
// to implement in a way that will extract all the HTML tags inside a string.
return extractHTMLTags(results.item(0).getTextContent());
}
}
return "";
}
All you have to do is to implement a function:
String extractHTMLTags(String s)
that will remove all HTML tag occurrences from a string.
For that you can take a look at this post: Remove HTML tags from a String

after checking a lot and scratching my head thousands of times I came up with simple alteration that it needs to change your API level to 8

EDIT: I just saw your comment above about getTextContent() not being supported on Android. I'm going to leave this answer up in case it's useful to someone who's on a different platform.
If your DOM API supports it, you can call getTextContent(), as follows:
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results != null) {
return results.getTextContent();
}
}
return "";
}
However, getTextContent() is a DOM Level 3 API call; not all parsers are guaranteed to support it. Xerces-J does.
By the way, in your original example, your check for null is in the wrong place; it should be:
if (results != null && results.getLength() > 0) {
Otherwise, you'd get a NPE if results really does come back as null.

Since getTextContent() isn't available to you, another option would be to write it -- it isn't hard. In fact, if you're writing this solely for your own use -- or your employer doesn't have overly strict rules about open source -- you could look at Apache's implementation as a starting point; lines 610-646 seem to contain most of what you need. (Please be respectful of Apache's copyright and license.)
Otherwise, some rough pseudocode for the method would be:
String getTextContent(Node node) {
if (node has no children)
return "";
if (node has 1 child)
return getTextContent(node.getFirstChild());
return getTextContent(new StringBuffer()).toString();
}
StringBuffer getTextContent(Node node, StringBuffer sb) {
for each child of node {
if (child is a text node) sb.append(child's text)
else getTextContent(child, sb);
}
return sb;
}

Well i was almost there with the code...
public String getParam1(Document d) {
if (d.getDocumentElement().getTagName().equals("AmigoRequest")) {
NodeList results = d.getElementsByTagName("Param1");
// Messagetype depends on what message we are reading.
if (results.getLength() > 0 && results != null) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db;
Element node = (Element) results.item(0); // get the value of Param1
Document doc2 = null;
try {
db = dbf.newDocumentBuilder();
doc2 = db.newDocument(); //create new document
doc2.appendChild(doc2.importNode(node, true)); //import the <html>...</html> result in doc2
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
Log.d(TAG, " Exception ", e);
} catch (DOMException e) {
// TODO: handle exception
Log.d(TAG, " Exception ", e);
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace(); }
return doc2. .....// All I'm missing is something to convert a Document to a string.
}
}
return "";
}
Like explained in the comment of my code. All I am missing is to make a String out of a Document. You can't use the Transform class in Android... doc2.toString() will give you a serialization of the object..
But my next step is write my own parser if this doesnt work out ;)
Not the best code but a temponary solution.
public String getParam1(String b) {
return b
.substring(b.indexOf("<Param1>") + "<Param1>".length(), b.indexOf("</Param1>"));
}
Where String b is the XML document string.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get the link from html file - java

Related

iText 7.0.5: How to combine PDF and have existing bookmarks indented under new bookmarks for each document?

Upload documents into Watson's Retrieve & Rank service

Error: org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm cannot be cast to org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage

How to extract links from a webpage using jsp?

Android: Parsing XML DOM parser. Converting childnodes to string

Categories

Resources