Scraping issue (data-reactid) - java

I'm trying to scrape a website and compile a spreadsheet based on what data I pull.
The website I am trying to scrape is WEARVR.
I am not too experienced with scraping, but my approach would be to find unique attributes within html tags and use this to scrape what I want.
So for this website my approach would be firstly to scrape a list of URLs of the pages you are taken to upon clicking on one of the experiences, for example : https://www.wearvr.com/#game_id=game_1041, and then secondly, cycle through this list scraping the relevant attributes each time.
However I am stuck at the first step as instead of working with simple "a href" tags, I come across "data-reactid" tags which confuse the matter.
I do my scraping with iMacros but I'm pretty decent at Java now so would learn scraping in Java if need be (which seems likely as iMacros is pretty limited).
My question is, how do these "data-reactid" tags work, and as such how can I utilise them for my scraping purposes?
Additionally if this is an XY problem, please let me know and suggest a better approach.
Thanks for reading!

The simplest way to approach scraping is to treat the page like a big string (because ultimately, that is what it is). You can search within that string for certain things (like href=) to grab links. You can also intelligently assume that whatever is in the a tags is relevant to the link and grab that.
You really don't have to understand HTML, and you don't have to understand how the page or any additional css or markup work, you just need to identify what sort of identifiable string combinations are around the text you want. I will say this is probably much easier to implement in Java than using IMacro, and probably more accurate.
The other way you can handle it, which requires a little more knowledge of HTML and XML, is to treat the entire page as an XML document. This...doesn't always work with HTML, particularly if it is older or badly formed, so the string approach is easier. You get some utility out of the various XML map libraries that exist, but otherwise its similar to the above.

Related

Parsing html text to obtain input fields

So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)
I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.
Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).
Thanks in advance
Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.

Java - Extracting plaintext from web-page source code (getting massive quantities of lyrics from website)

O community, I'm in the process of writing the pseudocode for an application that extracts song lyrics from a remote host (web-server, not my own) by reading the page's source code.
This is assuming that:
Lyrics are being displayed in plaintext
Portion of source code containing lyrics is readable by Java front-end application
I'm not looking for source code to answer the question, but what is the technical term used for querying a remote webpage for plaintext content?
If I can determine the webpage naming scheme, I could set the pointer of the URL object to the appropriate webpage, right? The only limitations would be irregular capitalization, and would only be effective if the plaintext was found in EXACTLY the same place.
Do you have any suggestions?
I was thinking something like this for "Buck 65", singing "I look good"
URL url = new URL(http://www.elyrics.net/read/b/buck-65-lyrics/i-look-good-lyrics.html);
I could substitute "buck-65-lyrics" & "i-look-good-lyrics" to reflect user input?
Input re-directed to PostgreSQL table
Current objective:
User will request name of {song, artist, album}, Java front-end will query remote webpage
Full source code (containing plaintext) will be extracted with Java front-end
Lyrics will be extracted from source code (somehow)
If song is not currently indexed by PostgreSQL server, will be added to table.
Operations will be made on the plaintext to suit the objectives of the program
I'm only looking for direction. If I'm headed completely in the wrong direction, please let me know. This is only for the pseudocode. I'm not looking for answers, or hand-outs, I need assistance in determining what I need to do. Are there external libraries for extracting plaintext that you know of? What technical names are there for what I'm trying to accomplish?
Thanks, Tyler
This approach is referred to as screen or data scraping. Note that employing it often breaks the target service's terms of service. Usually, this is not a robust approach, which is why API-like services with guarantees about how they operate are preferable.
Your approach sounds like it will work for the most part, but a few things to keep in mind.
If the web service you're interacting with requires a very precise URL scheme, you should not feed your user-provided data directly into it, since it is likely to be muddied by missing words, abbreviations, or misspellings. You might be better off doing some sort of search, first, and using that search's best result.
Reading HTML data is more complicated than you think. Use an existing library like jsoup to assist you.
The technical term to extract content from a site is web scraping, you can google that. There are a lot of online libraries, for java there is jsoup. Though its easy to write your own regex.
1st thing I would do i use curl and get the content from the site just for testing, this will give you a fair idea of what to do.
You will have to use a HTML parser. One of the most popular is jsoup.
Take care abut the legal aspect fo what you you do ;)

Is HTML parsing (in Java/Android) then extracting data from it, an effective way of getting a webpage's content?

So, I'm using HTTP Post Requests in Android Java to log into a website, before extracting the entire HTML code. After that, I use Pattern/Matcher (regex) to find all the elements I need before extracting them from the HTML data, and deleting everything unnecessary. For instance when I extract this:
String extractions = <td>Good day sir</td>
Then I use:
extractions.replaceAll("<td>", "").replaceAll("</td>", "");
I do this multiple times until I have all the data needed from that site, before I display it in some kind of list.
I'm not particularly stuck on anything, but please, can you tell me if this is an effective/efficient/fast way of getting data from a page and processing it, or are there ways to do this faster? Because sometimes it's like my program takes a lot of time to get certain data (although mostly that's when I'm on 3G on my phone).
Like others have said, regex is not the best tool for this job. But in this case, the particular way you use regex is even more inefficient than it would normally be.
In any case, let me offer one more possible solution (depending on your use case).
It's called YQL (Yahoo Query Language).
http://developer.yahoo.com/yql/
Here is a console for it so you can play around with it.
http://developer.yahoo.com/yql/console/
YQL is the lazy developer's way to build your own api on the fly. The main inconvenience is that you have to use Yahoo as a go-between, but if you're ok with that, then I'd suggest you go that route. Using YQL is probably the quickest way to get that kind of work done (especially if the html you're targeting keeps on changing and if its html tags are not always valid).
Using regex to parse a website is always a bad idea:
How to use regular expressions to parse HTML in Java?
Using regular expressions to parse HTML: why not?
Have a look at the Apache Tika library for extracting text from HTML - there are many other parsers also available, such as PDF etc. : http://tika.apache.org/

How to create templates from html page automatically?

I have a use case in which I need to render an unformatted text in the format of a given web page programmatically in Java. i.e. The text should automatically be formatted like the web page with styles, paragraphs, bullet points etc.
As I see first I will have to analyze the piece of unformatted text to find out the candidates for paragraphs, bullet points, headings etc. I intend to use Lucene analyzers/tokenizers for this task. Are there any alternatives?
The second problem is to convert the formatted web page into some kind of template (e.g. velocity template) with place holders for various entities like titles, bullet points etc.
Is there any text analysis/templating library in Java that can help me do this? Preferably open source.
Are there any other suggestions for doing this sort of task in a better way in Java?
Thanks for your help.
There are a lot of hard parts to what you're doing.
The user input
If you don't ask your user to provide any context, you're never going to guess the structure of the text. At least, you should ask them to provide a title, and a series of paragraph in your GUI.
Ideally, you could ask them to follow a well-know markup language (Markdown, Textile, etc...) and use the open source parser to extract the structure.
The external page
If any page is used, the only things you can rely on are the "structural markup". So assuming you know the title of the page should be "Hello World", and there is a "h1" element somewhere in the page, you can maybe assume that this is where the header could go.
But if the pages is a div tag-soup, and only CSS is used to differentiate the rendering of the header as opposed to the bulk of the text, you're going to have to guess how the styling is done : that's plain impossible if you don't know how the page is made.
I don't think Lucene would help fo this (as far as I know Lucene is made to create an index of the words used in a bulk of text ; I don't think it can help you guessing which part of the text is meant to be a title, a subtitle, etc...)
Generating templates from external page
Assuming you have "guessed" right, you could generate the content by
copy pasting the page
replacing the parts to change with tags of your template language of choice
storing the template somewhere the templating system can access it
configure your template / view system (viewResolver for velocity) to use the right template for the rigth person
That would of course pose terrible legal questions, since your templates would incorporate works by the original website author (most probably copyrighted material)
A more realistic solution
I would suggest you constrain your problem to :
using input that has some structure information available (use a GUI to enter it, use a markup language, whatever)
using templates that you provide, know the structure of (and can reuse very easily)
Note that none of those points are related to the template system.
Otherwise, I'm afraid you're heading to an unreasonnable amount of work...

How to detect different data types inside HTML page?

What is the best way to detect data types inside html page using Java facilities DOM API, regexp, etc?
I'd like to detect types like skype plugin does for the phone/skype numbers, similar for addresses, emails, time, etc.
'Types' is an inappropriate term for the kind of information you are referring to. Choice of DOM API or regex depends upon the structure of information within the page.
If you know the structure, (for example tables being used for displaying information, you already know from which cell you can find phone number and which cell you can find email address), it makes sense to go with a DOM API.
Otherwise, you should use regex on plain HTML text without parsing it.
I'd use regexes in the following order:
Extract only the BODY content
Remove all tags to leave just plain text
Match relevant patterns in text
Of course, this assumes that markup isn't providing hints, and that you're purely extracting data, not modifying page context.
Hope this helps,
Phil Lello

Categories

Resources