Parsing html text to obtain input fields - java

So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)
I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.
Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).
Thanks in advance

Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.

Related

Scraping issue (data-reactid)

I'm trying to scrape a website and compile a spreadsheet based on what data I pull.
The website I am trying to scrape is WEARVR.
I am not too experienced with scraping, but my approach would be to find unique attributes within html tags and use this to scrape what I want.
So for this website my approach would be firstly to scrape a list of URLs of the pages you are taken to upon clicking on one of the experiences, for example : https://www.wearvr.com/#game_id=game_1041, and then secondly, cycle through this list scraping the relevant attributes each time.
However I am stuck at the first step as instead of working with simple "a href" tags, I come across "data-reactid" tags which confuse the matter.
I do my scraping with iMacros but I'm pretty decent at Java now so would learn scraping in Java if need be (which seems likely as iMacros is pretty limited).
My question is, how do these "data-reactid" tags work, and as such how can I utilise them for my scraping purposes?
Additionally if this is an XY problem, please let me know and suggest a better approach.
Thanks for reading!
The simplest way to approach scraping is to treat the page like a big string (because ultimately, that is what it is). You can search within that string for certain things (like href=) to grab links. You can also intelligently assume that whatever is in the a tags is relevant to the link and grab that.
You really don't have to understand HTML, and you don't have to understand how the page or any additional css or markup work, you just need to identify what sort of identifiable string combinations are around the text you want. I will say this is probably much easier to implement in Java than using IMacro, and probably more accurate.
The other way you can handle it, which requires a little more knowledge of HTML and XML, is to treat the entire page as an XML document. This...doesn't always work with HTML, particularly if it is older or badly formed, so the string approach is easier. You get some utility out of the various XML map libraries that exist, but otherwise its similar to the above.

Get nodes in html document contains word

I want to write a script that checks a
document for keywords and specifies html document nodes in which they are contained (possibly
assign a unique identifier).
I am not a professional programmer and do not know the strength of low-level languages ​​and things as PLO.. I'm afraid of doing something very bad and unsupported.
How is it possible to isolate the desired nodes?
My experience - js and php - php only for very simple things. Also, I
do not want to use the opportunity to work
with js nodes. My thoughts:
to make a string of html
verify the existence of the words on the page
if the word on page exists: foreach node in body element I get first and last positions
(for example, we see opening tag for each character we initially know
position and therefore we calculate the first
position where the tag is opened and last where closed. And so on for all nodes).
We know the position of the word (eg 192,
199) and check in what range it got (in this
case, these bands - nodes html document).
I need ideas from experienced programmers.
It does not matter what language you are
programming (except for web-oriented)-
every opinion is important to me. It is likely
that there are libraries that solve such
problems. I very much hope that you will
understand me. English is not my native
language.
I always recommend Beautiful Soup for this kind of thing. It is a Python library that allows you to parse XML/HTML documents really quickly. You could quite quickly get something running that extracts the text from each div element I would have thought. Then using Pythons built-in string manipulation tools I'm sure searching for particular words would be fairly simple.
You need to use a html parser. Refer
Which HTML Parser is the best?
After that, you need to use xpath feature to extract whichever node.

Is HTML parsing (in Java/Android) then extracting data from it, an effective way of getting a webpage's content?

So, I'm using HTTP Post Requests in Android Java to log into a website, before extracting the entire HTML code. After that, I use Pattern/Matcher (regex) to find all the elements I need before extracting them from the HTML data, and deleting everything unnecessary. For instance when I extract this:
String extractions = <td>Good day sir</td>
Then I use:
extractions.replaceAll("<td>", "").replaceAll("</td>", "");
I do this multiple times until I have all the data needed from that site, before I display it in some kind of list.
I'm not particularly stuck on anything, but please, can you tell me if this is an effective/efficient/fast way of getting data from a page and processing it, or are there ways to do this faster? Because sometimes it's like my program takes a lot of time to get certain data (although mostly that's when I'm on 3G on my phone).
Like others have said, regex is not the best tool for this job. But in this case, the particular way you use regex is even more inefficient than it would normally be.
In any case, let me offer one more possible solution (depending on your use case).
It's called YQL (Yahoo Query Language).
http://developer.yahoo.com/yql/
Here is a console for it so you can play around with it.
http://developer.yahoo.com/yql/console/
YQL is the lazy developer's way to build your own api on the fly. The main inconvenience is that you have to use Yahoo as a go-between, but if you're ok with that, then I'd suggest you go that route. Using YQL is probably the quickest way to get that kind of work done (especially if the html you're targeting keeps on changing and if its html tags are not always valid).
Using regex to parse a website is always a bad idea:
How to use regular expressions to parse HTML in Java?
Using regular expressions to parse HTML: why not?
Have a look at the Apache Tika library for extracting text from HTML - there are many other parsers also available, such as PDF etc. : http://tika.apache.org/

Generate HTML from plain text using Java

I have to convert a .log file into a nice and pretty HTML file with tables. Right now I just want to get the HTML header down. My current method is to println to file every single line of the HTML file. for example
p.println("<html>");
p.println("<script>");
etc. there has to be a simpler way right?
How about using a JSP scriplet and JSTL?, you could create some custom object which holds all the important information and display it formatted using the Expression Language.
Printing raw HTML text as strings is probably the "easiest" (most straightforward) way to do what you're asking but it has its drawbacks (e.g. properly escaping the content text).
You could use the DOM (e.g. Document et al) interface provided by Java but that would hardly be "easy". Perhaps there are "DOM builder" type tools/libraries for Java that would simplify this task for you; I suggest looking at dom4j.
Look at this Java HTML Generator library (easy to use). It should make generating the actual HTML muuuch clearer. There are complications when creating HTML with Java Strings (what happens if you want to change something like a rowspan?) that can be avoided with this library. Especially when dealing with tables.
There are many templating engines available. Have a look at https://stackoverflow.com/questions/174204/suggestions-for-a-java-based-templating-engine
This way you can define a template in a txt file and have the java code fill in the variables.

How to detect different data types inside HTML page?

What is the best way to detect data types inside html page using Java facilities DOM API, regexp, etc?
I'd like to detect types like skype plugin does for the phone/skype numbers, similar for addresses, emails, time, etc.
'Types' is an inappropriate term for the kind of information you are referring to. Choice of DOM API or regex depends upon the structure of information within the page.
If you know the structure, (for example tables being used for displaying information, you already know from which cell you can find phone number and which cell you can find email address), it makes sense to go with a DOM API.
Otherwise, you should use regex on plain HTML text without parsing it.
I'd use regexes in the following order:
Extract only the BODY content
Remove all tags to leave just plain text
Match relevant patterns in text
Of course, this assumes that markup isn't providing hints, and that you're purely extracting data, not modifying page context.
Hope this helps,
Phil Lello

Categories

Resources