Developing app to detect webpage change - java

I'm trying to make a desktop app with java to track changes made to a webpage as a side project and also to monitor when my professors add content to their webpages. I did a bit of research and my current approach is to use the Jsoup library to retrieve the webpage, run it through a hashing algorithm, and then compare the current hash value with a previous hash value.
Is this a recommended approach? I'm open to suggestions and ideas since before I did any research I had no clue how to start nor what jsoup was.

One potential problem with your hashing method: if the page contains any dynamically generated content that changes on each refresh, as many modern websites do, your program will report that the page is constantly changing. Hashing the whole page will only work if the site does not employ any of this dynamic content (ads, hit counter, social media, etc.).
What specifically are you looking for that has changed? Perhaps new assignments being posted? You likely do not want to monitor the entire page for changes anyway. Therefore, you should use an HTML parser -- this is where Jsoup comes in.
First, parse the page into a Document object:
Document doc = Jsoup.parse(htmlString)
You can now perform a number of methods on the Document object to traverse the HTML Nodes. (See Jsoup docs on DOM navigation methods)
For instance, say there is a table on the site, and each row of the table represents a different assignment. The following code would get the table by its ID and each of its row by selecting each of the table's tags.
Element assignTbl = doc.getElementById("assignmentTable");
Elements tblRows = assignTbl.getElementsByTag("tr");
for (Element tblRow: tblRows) {
tblRow.html();
}
You will need to somehow view the webpage's source code (such as Inspect Element in Google Chrome) to figure out the page's structure and design your code accordingly. This way, not only would the algorithm be more reliable, but you could take it much further, such as extracting the details of the assignment that has changed. (If you would like assistance, please edit your question with the target page's HTML.)

Related

Siebel Open UI and Selenium; changing ID's/names

I am working on a project for a client where they are going to upgrade to Siebel Open UI. With that upgrade, they also want to start implementing Selenium. The problem we are currently facing, or going to face once implementing, is that with each build the ID's/Names of HTML elements in Siebel change. Because we are talking about a lot of views and applets it's not a good solution to change the code manually each time.
What is a good solution for this problem? One solution that was offered is a correlation table where we keep track of changes in the ID's.
Xpath in this case is also not a good option because of the complicated structure of the views and applets.
I would suggest that you look into CSS Selectors. They are faster and less brittle than XPath. For ID/names that are dynamic, typically there is at least some portion of the ID that is static.
For instance,
<a id="somestatictext_12345">...
where "12345" is some dynamically generated number. In this case you can use a CSS selector like
driver.findElement(By.cssSelector("[id^='somestatictext']"));
Examples
"[id^='somestatictext']" - ID begins with "somestatictext"
"[id$='somestatictext']" - ID ends with "somestatictext"
"[id*='somestatictext']" - ID contains "somestatictext"
For more info, take a look at this CSS Selector reference.

Scraping issue (data-reactid)

I'm trying to scrape a website and compile a spreadsheet based on what data I pull.
The website I am trying to scrape is WEARVR.
I am not too experienced with scraping, but my approach would be to find unique attributes within html tags and use this to scrape what I want.
So for this website my approach would be firstly to scrape a list of URLs of the pages you are taken to upon clicking on one of the experiences, for example : https://www.wearvr.com/#game_id=game_1041, and then secondly, cycle through this list scraping the relevant attributes each time.
However I am stuck at the first step as instead of working with simple "a href" tags, I come across "data-reactid" tags which confuse the matter.
I do my scraping with iMacros but I'm pretty decent at Java now so would learn scraping in Java if need be (which seems likely as iMacros is pretty limited).
My question is, how do these "data-reactid" tags work, and as such how can I utilise them for my scraping purposes?
Additionally if this is an XY problem, please let me know and suggest a better approach.
Thanks for reading!
The simplest way to approach scraping is to treat the page like a big string (because ultimately, that is what it is). You can search within that string for certain things (like href=) to grab links. You can also intelligently assume that whatever is in the a tags is relevant to the link and grab that.
You really don't have to understand HTML, and you don't have to understand how the page or any additional css or markup work, you just need to identify what sort of identifiable string combinations are around the text you want. I will say this is probably much easier to implement in Java than using IMacro, and probably more accurate.
The other way you can handle it, which requires a little more knowledge of HTML and XML, is to treat the entire page as an XML document. This...doesn't always work with HTML, particularly if it is older or badly formed, so the string approach is easier. You get some utility out of the various XML map libraries that exist, but otherwise its similar to the above.

Notify when web content change

Im new to java and working on a simple application that monitor an url and notify me when a table is updated whit new items. Looking at the entire page will not work as there are commercials that change all the time and they would give false positives.
My thought was to fetch the url line by line looking for the elements. For each element I will check to see if the element is already in an arraylist. If not the element is added to the arraylist and a notification is send.
What I need support with is not the exact code but advice if this would be a good approach and if I should store the elements in an array list or if I should use a file instead as there are 2 lines of text in each element.
Also It would be good to get recomandation on what methods and libs there would be good to look at.
Thanks in advance
Sebastian
To check the site it'd probably be more stable to parse the HTML and work with an object representation of the DOM. I've never had to do this but in a question regarding how to do this another user suggested using JTidy, maybe you could have a look at that.
As for storing the information (what you currently do in your ArrayList): this really depends on what you use your application for. If you only want to be notified of changes that occur during the runtime of your program this is perfectly fine. If you want to have the information persist you should find a way to store the information in the file system or database.

Dynamic Content Parsing

I am working with content parsing I executed the sample program for this i have taken a sample link
please visit the below link
http://www.equitymaster.com/stockquotes/sector.asp?sector=0%2CSOFTL&utm_source=top-menu&utm_medium=website&utm_campaign=performance&utm_content=key-sector
or
Click Here
in the above link i parsed the table data and store into java object.
BSE and NSE are not my exact requirement just I am taken sample example. the above link is developed in the tables they are not used id's and classes. in my example I parsed data using XPath
this is my Xpath
/html/body/table[4]/tbody/tr/td/table[2]/tbody/tr[2]/td[2]/font/table[2]
I selected and parsing it is working fine . here is a problem in future if they changed website structure my program will not work for sure. tell me any other way to parse data dynamically and able to store in database. display the results based on the condition even if they changed the webpage structure I used for this JSOUP api for this. Tell me any other ApI's which provide best support for this type of requirement
If you're trying to parse a page without any clear id/class to select your nodes, you have to try and rely on something else. Redefining the whole tree is indeed the weakest way of doing it, if anything is added/changed everything will collapse.
You could try relying on color: //table[#bgcolor="#c9d0e0"], the "GET MORE INFO" field: //table[tr/td//text()="GET MORE INFO"], the "More Info" there is on every line: //table[.//td//text()="&nbspMore Info&nbsp"]...
The idea is to find something ideally unique (if you can't find any unique criteria, table[color condition selecting a few tables][2] is still stronger walking the whole tree), present every time, and use that as an id.

How to create templates from html page automatically?

I have a use case in which I need to render an unformatted text in the format of a given web page programmatically in Java. i.e. The text should automatically be formatted like the web page with styles, paragraphs, bullet points etc.
As I see first I will have to analyze the piece of unformatted text to find out the candidates for paragraphs, bullet points, headings etc. I intend to use Lucene analyzers/tokenizers for this task. Are there any alternatives?
The second problem is to convert the formatted web page into some kind of template (e.g. velocity template) with place holders for various entities like titles, bullet points etc.
Is there any text analysis/templating library in Java that can help me do this? Preferably open source.
Are there any other suggestions for doing this sort of task in a better way in Java?
Thanks for your help.
There are a lot of hard parts to what you're doing.
The user input
If you don't ask your user to provide any context, you're never going to guess the structure of the text. At least, you should ask them to provide a title, and a series of paragraph in your GUI.
Ideally, you could ask them to follow a well-know markup language (Markdown, Textile, etc...) and use the open source parser to extract the structure.
The external page
If any page is used, the only things you can rely on are the "structural markup". So assuming you know the title of the page should be "Hello World", and there is a "h1" element somewhere in the page, you can maybe assume that this is where the header could go.
But if the pages is a div tag-soup, and only CSS is used to differentiate the rendering of the header as opposed to the bulk of the text, you're going to have to guess how the styling is done : that's plain impossible if you don't know how the page is made.
I don't think Lucene would help fo this (as far as I know Lucene is made to create an index of the words used in a bulk of text ; I don't think it can help you guessing which part of the text is meant to be a title, a subtitle, etc...)
Generating templates from external page
Assuming you have "guessed" right, you could generate the content by
copy pasting the page
replacing the parts to change with tags of your template language of choice
storing the template somewhere the templating system can access it
configure your template / view system (viewResolver for velocity) to use the right template for the rigth person
That would of course pose terrible legal questions, since your templates would incorporate works by the original website author (most probably copyrighted material)
A more realistic solution
I would suggest you constrain your problem to :
using input that has some structure information available (use a GUI to enter it, use a markup language, whatever)
using templates that you provide, know the structure of (and can reuse very easily)
Note that none of those points are related to the template system.
Otherwise, I'm afraid you're heading to an unreasonnable amount of work...

Categories

Resources