Dynamic Content Parsing

Dynamic Content Parsing - java

I am working with content parsing I executed the sample program for this i have taken a sample link
please visit the below link
http://www.equitymaster.com/stockquotes/sector.asp?sector=0%2CSOFTL&utm_source=top-menu&utm_medium=website&utm_campaign=performance&utm_content=key-sector
or
Click Here
in the above link i parsed the table data and store into java object.
BSE and NSE are not my exact requirement just I am taken sample example. the above link is developed in the tables they are not used id's and classes. in my example I parsed data using XPath
this is my Xpath
/html/body/table[4]/tbody/tr/td/table[2]/tbody/tr[2]/td[2]/font/table[2]
I selected and parsing it is working fine . here is a problem in future if they changed website structure my program will not work for sure. tell me any other way to parse data dynamically and able to store in database. display the results based on the condition even if they changed the webpage structure I used for this JSOUP api for this. Tell me any other ApI's which provide best support for this type of requirement

If you're trying to parse a page without any clear id/class to select your nodes, you have to try and rely on something else. Redefining the whole tree is indeed the weakest way of doing it, if anything is added/changed everything will collapse.
You could try relying on color: //table[#bgcolor="#c9d0e0"], the "GET MORE INFO" field: //table[tr/td//text()="GET MORE INFO"], the "More Info" there is on every line: //table[.//td//text()="&nbspMore Info&nbsp"]...
The idea is to find something ideally unique (if you can't find any unique criteria, table[color condition selecting a few tables][2] is still stronger walking the whole tree), present every time, and use that as an id.

Related

Developing app to detect webpage change

I'm trying to make a desktop app with java to track changes made to a webpage as a side project and also to monitor when my professors add content to their webpages. I did a bit of research and my current approach is to use the Jsoup library to retrieve the webpage, run it through a hashing algorithm, and then compare the current hash value with a previous hash value.
Is this a recommended approach? I'm open to suggestions and ideas since before I did any research I had no clue how to start nor what jsoup was.

One potential problem with your hashing method: if the page contains any dynamically generated content that changes on each refresh, as many modern websites do, your program will report that the page is constantly changing. Hashing the whole page will only work if the site does not employ any of this dynamic content (ads, hit counter, social media, etc.).
What specifically are you looking for that has changed? Perhaps new assignments being posted? You likely do not want to monitor the entire page for changes anyway. Therefore, you should use an HTML parser -- this is where Jsoup comes in.
First, parse the page into a Document object:
Document doc = Jsoup.parse(htmlString)
You can now perform a number of methods on the Document object to traverse the HTML Nodes. (See Jsoup docs on DOM navigation methods)
For instance, say there is a table on the site, and each row of the table represents a different assignment. The following code would get the table by its ID and each of its row by selecting each of the table's tags.
Element assignTbl = doc.getElementById("assignmentTable");
Elements tblRows = assignTbl.getElementsByTag("tr");
for (Element tblRow: tblRows) {
tblRow.html();
}
You will need to somehow view the webpage's source code (such as Inspect Element in Google Chrome) to figure out the page's structure and design your code accordingly. This way, not only would the algorithm be more reliable, but you could take it much further, such as extracting the details of the assignment that has changed. (If you would like assistance, please edit your question with the target page's HTML.)

Notify when web content change

Im new to java and working on a simple application that monitor an url and notify me when a table is updated whit new items. Looking at the entire page will not work as there are commercials that change all the time and they would give false positives.
My thought was to fetch the url line by line looking for the elements. For each element I will check to see if the element is already in an arraylist. If not the element is added to the arraylist and a notification is send.
What I need support with is not the exact code but advice if this would be a good approach and if I should store the elements in an array list or if I should use a file instead as there are 2 lines of text in each element.
Also It would be good to get recomandation on what methods and libs there would be good to look at.
Thanks in advance
Sebastian

To check the site it'd probably be more stable to parse the HTML and work with an object representation of the DOM. I've never had to do this but in a question regarding how to do this another user suggested using JTidy, maybe you could have a look at that.
As for storing the information (what you currently do in your ArrayList): this really depends on what you use your application for. If you only want to be notified of changes that occur during the runtime of your program this is perfectly fine. If you want to have the information persist you should find a way to store the information in the file system or database.

How can I send a newsletter with xPages content?

I have some content displayed using computed fields inside a repeat in my xpage.
I now need to be able to send out a newsletter (by email) every week with the content of this repeat. The content can be both plain text and html
My site is also translated into different languages so I need the code to be able to specify the language and return the content in that language.
I am thinking about creating a scheduled lotusscript or java agent that somehow read the content of the repeat. is this possible? if so, some sample code to get me started would be great
edit: the content is only available to logged in users
thanks
Thomas

Use a java agent, and instead of going to the content natively, do a web page open and open the page as if in a browser, then process the result. (you could make a special version of the web page that hides all extraneous content as well if you wanted)

How is the data for the repeat evaluated? Can it be translated in to a lotusscript database.search?
If so then it would be best to forget about the actual xPage and concentrate on working out how to get the same data via LotusScript and then write your scheduled agent to loop through the document collection and generate the email that way.
Looking to the Xpage would generate a lot of extra work, you need to be authenticated as the user ( if the data in the repeat is different from one user to the next ) to get the exact same data that this particular user would see and then you have to parse the page to extract the data.

If you have a complicated enough newsletter that you want to do an Xpage and not build the html yourself in the agent, what you could do is build a single xpage that changes what's rendered based on a special query string, then in your agent get the html from a URLConnection and pass the html into the body of your email.
You could build the URL based on a view that shows documents with today's date.

I would solve this by giving the user a teaser on what to read and give them a link to the full content.

You should check out Weihang Chens (my colleague) article about rendering an xPage as Mime and sending it as a mail.
http://www.bleedyellow.com/blogs/weihang/entry/render_a_xpages_programmtically_and_send_it_as_a_mail?lang=en_us
We got this working in house and it is very convenient.
He describes 3 different approaches to the problem.

Using jquery with java for HTML

I've got a HTML website, in which there is some kind of data inside a table(I have no control in which way data is displayed on that website). I need to get/extract this table data.
i.e. nth row in the table has 2 columns, first columns text is "last update time", and the next column has some datestamp value. Using jquery I could say exactly that I want to get this tables nth row second column text which would give me some timestamp string.
Is there something like this in java, I will basically get the whole site and try to extract that information in the same manner as I described above. Does java have something similar?
Since javascript can be executed as a shell script as long as there is interpreter available, can something similar be done so that jquery functions are possible to invoke from java?

I think jsoup will help you.

Java can parse a DOM and you can interpret it that way. For a brief explanation, check out http://tutorials.jenkov.com/java-xml/dom-document-object.html
Its certainly not going to be as simple as writing a jquery function to do the same.

What technologies are there for formatted, structured data input and output?

I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks

You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.