Using jquery with java for HTML - java

I've got a HTML website, in which there is some kind of data inside a table(I have no control in which way data is displayed on that website). I need to get/extract this table data.
i.e. nth row in the table has 2 columns, first columns text is "last update time", and the next column has some datestamp value. Using jquery I could say exactly that I want to get this tables nth row second column text which would give me some timestamp string.
Is there something like this in java, I will basically get the whole site and try to extract that information in the same manner as I described above. Does java have something similar?
Since javascript can be executed as a shell script as long as there is interpreter available, can something similar be done so that jquery functions are possible to invoke from java?

I think jsoup will help you.

Java can parse a DOM and you can interpret it that way. For a brief explanation, check out http://tutorials.jenkov.com/java-xml/dom-document-object.html
Its certainly not going to be as simple as writing a jquery function to do the same.

Related

Is it possible to remove tags (or sequences) and relate or remember them as indexes?

I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?
You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.

Parsing html text to obtain input fields

So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)
I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.
Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).
Thanks in advance
Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.

Notify when web content change

Im new to java and working on a simple application that monitor an url and notify me when a table is updated whit new items. Looking at the entire page will not work as there are commercials that change all the time and they would give false positives.
My thought was to fetch the url line by line looking for the elements. For each element I will check to see if the element is already in an arraylist. If not the element is added to the arraylist and a notification is send.
What I need support with is not the exact code but advice if this would be a good approach and if I should store the elements in an array list or if I should use a file instead as there are 2 lines of text in each element.
Also It would be good to get recomandation on what methods and libs there would be good to look at.
Thanks in advance
Sebastian
To check the site it'd probably be more stable to parse the HTML and work with an object representation of the DOM. I've never had to do this but in a question regarding how to do this another user suggested using JTidy, maybe you could have a look at that.
As for storing the information (what you currently do in your ArrayList): this really depends on what you use your application for. If you only want to be notified of changes that occur during the runtime of your program this is perfectly fine. If you want to have the information persist you should find a way to store the information in the file system or database.

Dynamic Content Parsing

I am working with content parsing I executed the sample program for this i have taken a sample link
please visit the below link
http://www.equitymaster.com/stockquotes/sector.asp?sector=0%2CSOFTL&utm_source=top-menu&utm_medium=website&utm_campaign=performance&utm_content=key-sector
or
Click Here
in the above link i parsed the table data and store into java object.
BSE and NSE are not my exact requirement just I am taken sample example. the above link is developed in the tables they are not used id's and classes. in my example I parsed data using XPath
this is my Xpath
/html/body/table[4]/tbody/tr/td/table[2]/tbody/tr[2]/td[2]/font/table[2]
I selected and parsing it is working fine . here is a problem in future if they changed website structure my program will not work for sure. tell me any other way to parse data dynamically and able to store in database. display the results based on the condition even if they changed the webpage structure I used for this JSOUP api for this. Tell me any other ApI's which provide best support for this type of requirement
If you're trying to parse a page without any clear id/class to select your nodes, you have to try and rely on something else. Redefining the whole tree is indeed the weakest way of doing it, if anything is added/changed everything will collapse.
You could try relying on color: //table[#bgcolor="#c9d0e0"], the "GET MORE INFO" field: //table[tr/td//text()="GET MORE INFO"], the "More Info" there is on every line: //table[.//td//text()="&nbspMore Info&nbsp"]...
The idea is to find something ideally unique (if you can't find any unique criteria, table[color condition selecting a few tables][2] is still stronger walking the whole tree), present every time, and use that as an id.

Parse javascript generated content using Java

http://support.xbox.com/en-us/contact-us uses javascript to create some lists. I want to be able to parse these lists for their text. So for the above page I want to return the following:
Billing and Subscriptions
Xbox 360
Xbox LIVE
Kinect
Apps
Games
I was trying to use JSoup for a while before noticing it was generated using javascript. I have no idea how to go about parsing a page for its javascript generated content.
Where do I begin?
You'll want to use an HTML+JavaScript library like Cobra. It'll parse the DOM elements in the HTML as well as apply any DOM changes caused by JavaScript.
you could always import the whole page and then perform a string separator on the page (using return, etc) and look for the string containing the information, then return the string you want and pull pieces out of that string. That is the dirty way of doing it, not sure if there is a clean way to do it.
I don't think that text is generated by javascript... If I disable javascript those options can be found inside the html at this location (a jquery selector just because it was easier to hand-write than figuring out the xpath without javascript enabled :))
'div#ShellNavigationBar ul.NavigationElements li ul li a'
Regardless in direct answer to your query, you'd have to evaluate the javascript within the scope of the document, which I expect would be rather complex in Java. You'd have more luck identifying the javascript file generating the relevant content and just parsing that directly.

Categories

Resources