Just wondering which is best here. I want to output data from a table in my DB then put a lot of this data into a html table on the fly on my page. I'm working with Java on the server side. Basically I pull the results form the DB and have the raw data..just what next?
There is a chance I may want to take data from multiple tables in order to combine it into one table for my site.
I retrieve the results of the query from the DB, now do I create a text from it in the form of json which I can parse as json using jquery upon the return of the object to my browser?(kind of a sub question of this question: Is just using a stringbuilder the correct way to make a json object to output?)
Or..
Should I build the HTML as a string and output that to the browser instead?
Which is better and why?
I've built entire pages from JSON data on the client. It reduces the redundancy of repeating HTML and can lead to better performance, depending on the complexity of your HTML.
I had large a catalog that used multiple tabs for different sections. Sending it all to the client as JSON and generating the resulting HTML was way faster than downloading the equivalent HTML.
What you lose, of course, is SEO. Search engines won't be able to see the Javascript-generated output. There are ways around this, using hash URL techniques.
I used to be in favor of generating HTML on the server so that the client can be dumb and simply inject dynamic content. The pragmatic real world advantages for our small team was that we needed to be experts at fewer different technologies. We focused on the middle tier and back end and spent less time on the front end.
Lately, with tools like jQuery, it is easier and easier to do more robust client stuff without having to increase the dev bandwidth much. From a client side, I can say building dynamic HTML from JSON using jQuery isn't that hard.
From the server side, I'm sure there are tools to serialize to JSON. I wouldn't roll your own with StringBuilder. Sorry, I'm not a Java guy so don't have a recommendation.
I'd go with JSON if I knew I had anything more than just static views of the data in mind later on.
But if it's just so that you can see what the result was, and don't care too much for the data then I'd go with the straight forward HTML output.
For actually generating the JSON server-side, there are a number of libraries you can use. org.json is the canonical one, but I prefer Stringtree personally.
Related
I'm trying to scrape a website and compile a spreadsheet based on what data I pull.
The website I am trying to scrape is WEARVR.
I am not too experienced with scraping, but my approach would be to find unique attributes within html tags and use this to scrape what I want.
So for this website my approach would be firstly to scrape a list of URLs of the pages you are taken to upon clicking on one of the experiences, for example : https://www.wearvr.com/#game_id=game_1041, and then secondly, cycle through this list scraping the relevant attributes each time.
However I am stuck at the first step as instead of working with simple "a href" tags, I come across "data-reactid" tags which confuse the matter.
I do my scraping with iMacros but I'm pretty decent at Java now so would learn scraping in Java if need be (which seems likely as iMacros is pretty limited).
My question is, how do these "data-reactid" tags work, and as such how can I utilise them for my scraping purposes?
Additionally if this is an XY problem, please let me know and suggest a better approach.
Thanks for reading!
The simplest way to approach scraping is to treat the page like a big string (because ultimately, that is what it is). You can search within that string for certain things (like href=) to grab links. You can also intelligently assume that whatever is in the a tags is relevant to the link and grab that.
You really don't have to understand HTML, and you don't have to understand how the page or any additional css or markup work, you just need to identify what sort of identifiable string combinations are around the text you want. I will say this is probably much easier to implement in Java than using IMacro, and probably more accurate.
The other way you can handle it, which requires a little more knowledge of HTML and XML, is to treat the entire page as an XML document. This...doesn't always work with HTML, particularly if it is older or badly formed, so the string approach is easier. You get some utility out of the various XML map libraries that exist, but otherwise its similar to the above.
O community, I'm in the process of writing the pseudocode for an application that extracts song lyrics from a remote host (web-server, not my own) by reading the page's source code.
This is assuming that:
Lyrics are being displayed in plaintext
Portion of source code containing lyrics is readable by Java front-end application
I'm not looking for source code to answer the question, but what is the technical term used for querying a remote webpage for plaintext content?
If I can determine the webpage naming scheme, I could set the pointer of the URL object to the appropriate webpage, right? The only limitations would be irregular capitalization, and would only be effective if the plaintext was found in EXACTLY the same place.
Do you have any suggestions?
I was thinking something like this for "Buck 65", singing "I look good"
URL url = new URL(http://www.elyrics.net/read/b/buck-65-lyrics/i-look-good-lyrics.html);
I could substitute "buck-65-lyrics" & "i-look-good-lyrics" to reflect user input?
Input re-directed to PostgreSQL table
Current objective:
User will request name of {song, artist, album}, Java front-end will query remote webpage
Full source code (containing plaintext) will be extracted with Java front-end
Lyrics will be extracted from source code (somehow)
If song is not currently indexed by PostgreSQL server, will be added to table.
Operations will be made on the plaintext to suit the objectives of the program
I'm only looking for direction. If I'm headed completely in the wrong direction, please let me know. This is only for the pseudocode. I'm not looking for answers, or hand-outs, I need assistance in determining what I need to do. Are there external libraries for extracting plaintext that you know of? What technical names are there for what I'm trying to accomplish?
Thanks, Tyler
This approach is referred to as screen or data scraping. Note that employing it often breaks the target service's terms of service. Usually, this is not a robust approach, which is why API-like services with guarantees about how they operate are preferable.
Your approach sounds like it will work for the most part, but a few things to keep in mind.
If the web service you're interacting with requires a very precise URL scheme, you should not feed your user-provided data directly into it, since it is likely to be muddied by missing words, abbreviations, or misspellings. You might be better off doing some sort of search, first, and using that search's best result.
Reading HTML data is more complicated than you think. Use an existing library like jsoup to assist you.
The technical term to extract content from a site is web scraping, you can google that. There are a lot of online libraries, for java there is jsoup. Though its easy to write your own regex.
1st thing I would do i use curl and get the content from the site just for testing, this will give you a fair idea of what to do.
You will have to use a HTML parser. One of the most popular is jsoup.
Take care abut the legal aspect fo what you you do ;)
So, I'm using HTTP Post Requests in Android Java to log into a website, before extracting the entire HTML code. After that, I use Pattern/Matcher (regex) to find all the elements I need before extracting them from the HTML data, and deleting everything unnecessary. For instance when I extract this:
String extractions = <td>Good day sir</td>
Then I use:
extractions.replaceAll("<td>", "").replaceAll("</td>", "");
I do this multiple times until I have all the data needed from that site, before I display it in some kind of list.
I'm not particularly stuck on anything, but please, can you tell me if this is an effective/efficient/fast way of getting data from a page and processing it, or are there ways to do this faster? Because sometimes it's like my program takes a lot of time to get certain data (although mostly that's when I'm on 3G on my phone).
Like others have said, regex is not the best tool for this job. But in this case, the particular way you use regex is even more inefficient than it would normally be.
In any case, let me offer one more possible solution (depending on your use case).
It's called YQL (Yahoo Query Language).
http://developer.yahoo.com/yql/
Here is a console for it so you can play around with it.
http://developer.yahoo.com/yql/console/
YQL is the lazy developer's way to build your own api on the fly. The main inconvenience is that you have to use Yahoo as a go-between, but if you're ok with that, then I'd suggest you go that route. Using YQL is probably the quickest way to get that kind of work done (especially if the html you're targeting keeps on changing and if its html tags are not always valid).
Using regex to parse a website is always a bad idea:
How to use regular expressions to parse HTML in Java?
Using regular expressions to parse HTML: why not?
Have a look at the Apache Tika library for extracting text from HTML - there are many other parsers also available, such as PDF etc. : http://tika.apache.org/
I'm looking for a good web framework for compositing multiple JSON sources fetched with HTTP requests in to static HTML. I am not looking to do this on the client-side (browser, javascript), I am looking for the best server-side solution.
So, what I need to do is:
Fetch several different JSON documents over HTTP
Format that JSON as HTML content, mostly with templates but some dynamic custom HTML
Basic login/logout/preferences customization, nothing major
Mostly stateless pages; what state there is, comes already in the JSON
User / search engine friendly / bookmarkable URLs; should be customizable accurately
How I'd like to do it:
A lean solution, perhaps just a template engine
HTML templates that have no custom syntax over HTML/XML, like Wicket and almost like Tapestry
Application server that is scalable and utilizes multiple CPUs properly (for example, a single Python process does not)
Preferably Java, but if Java doesn't have anything that fits, willing to consider others
As for the template part, if this were to be in JavaScript in the browser, something like PURE would be my tool of choice.
You might want to check out RESTx. That's an open source platform for the easy creation of RESTful resources. It allows you to write custom data access and integration logic in either Java or Python. Getting data from multiple sources and combining them is what it's made for, so this should be a pretty close fit. Data output is rendered according to what the user requested. For example, a further JSON data source, or the same data rendered in HTML.
The HTML rendering is currently according to a built-in template. However, that should be easy enough to modify. I'm one of the developers on that project, so if you need some special template functionality, let me know and I will see what I can do.
To give you an example: Assume you have two JSON resources, in your code you would write this (I'm giving a Python example here, but the Java example would look very similar):
status, data_1 = accessResource("/resource/some_resource")
status, data_2 = accessResource("/resource/some_other_resource")
# data_1 and data_2 now hold a dictionary, list, etc., depending on the JSON
# that was returned.
# ... some code that combines and processes the data and produces a dict or list
# with the result. The data object you return here is then automatically rendered
# in either HTML or JSON, depending on the client request.
return Result.ok(data)
Also take a look at the example for some simple data integration here.
I think that the only framework you need is a library that reads json. The templates can very well be standard jsp pages.
JSON stands for JavaScript Object Notation. But how come languages like php, java, c etc can also communication each other with json.
What I want to know is that, am i correct to say that json is not limited to js only, but served as a protocol for applications to communicate with each other over the network, which is the same purpose as XML?
JSON cannot handle complex data hierarchies like XML can (attributes, namespaces, etc.), but on the other hand you don't get the same overhead with JSON as you get with XML (if you don't need the complex data structures).
Since JSON is plain text with a special notation for JS to interpret, it's an easy protocol to adopt in other languages.
It is easy for a JS script to parse JSON, since it can be done using 'eval' in which the JS enginge can use its full power.
On the other hand, it is more complicated to generate JSON from within JS. Usually one uses the JSON package from www.json.org in which an object can easily be serialised using JSON.stringify, but it is implemented in JS so its not running with optimal performance.
So serialising JSON is about the same complexity using JS as when using Java, PHP or any other server side language.
Therefore, in my opinion, JSON is best suited when there is asymmetry between produce/consumer e.g. a web server that generates a lot of data that is consumed by the web application. Not the other way around.
But! When one choses JSON as data format it should be used in both directions, not XML<>JSON. Except for when simple get requests are used to retrieve JSON data.
yes, JSON is also wildly used as a data exchange protocol much like XML.
Typically a program (not written in JavaScript) needs a JSON library to parse and create JSON objects (although you can probably create them even without one).
Your right - it's a light weight data interchange format -- more details at: http://www.json.org
You are completely correct. JSON definition of how data should be formatted. It is more light weight than XML and therefore well suited to things like AJAX where you want to send data back and forth to the server quickly.