I am familier with java programming language I like to extract the data from a website and store it to my database running on my machine.Is that possible in java.If so which API I should use. For example the are number of schools listed on a website How can I extract that data and store it to my database using java.
What you're referring to is commonly called 'screenscraping'. There are a variety of ways to do this in Java, however, I prefer HtmlUnit. While it was designed as a way to test web functionality, you can use it to hit a remote webpage, and parse it out.
I would recommend using a good error handling html parser like Tagsoup to extract from the HTML exactly what you're looking for.
You definitely need a good parser like NekoHTML.
Here's an example of using NekoHTML, albeit using Groovy (a Java-based scripting language) rather than Java itself:
http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy
You can use VietSpider XML from
http://sourceforge.net/projects/binhgiang/files/
Download VietSpider3_16_XML_Windows.zip or VietSpider3_16_XML_Linux.zip
VietSpider Web Data Extractor: Software crawls the data from the websites ((Data Scraper)), format to XML standard (Text, CDATA) then store in the relational database. Product supports the various of RDBMs such as Oracle, MySQL, SQL Server, H2, HSQL, Apache Derby, Postgres …VietSpider Crawler supports session (login, query by form input), multi-downloading, JavaScript handling, proxy (and multi-proxy by auto scan the proxies from website)…
Depending on what you are really trying to do, you can use many different solutions.
If you juste wanna fetch the HTML code of a web page, then URL.getContent() may be your solution. Here is a little tutorial :
http://www.javacoffeebreak.com/books/extracts/javanotesv3/c10/s4.html
EDIT : didn't understand he was searching for a way to parse the HTML code. Some tools have been suggested above. Sorry for that.
Related
I would like to read data from Oracle database and need to present (report) in browser (like HTML or other UI technology)
What would be the best approach to do this?
As of now I'm thinking these following approach:
1) JDBC (To read from Oracle) & Java Script to present
2) NodeJS (node-oracledb driver) & AngularJS to present
3) Python (cx_Oracle to interact with DB) & Flask to present it on browser
What do you think which would be the best approach among this that is most suitable and faster too?
Any approach other than this to handle this scenario is appreciated.
(EDIT: I've Java background and alongside I'd want to learn scripting language like Python or powerful client-side and server-side framework like Angularjs or Node.js or other similar framework, technology - So I'm open for anything but needs to be faster and scalable.)
JDBC (To read from Oracle) & Java Script to present
"Java Script" is one word: JavaScript.
What would be the best approach to do this?
"best" is relative and subjective. Let's start with this: Do you know Java, JavaScript, or Python? Start with the one you know best. Build something and then reevaluate based on what you know.
Don't know any of those? I'd start with Node.js - but I'm biased! :) The Node.js community is pretty awesome and it helps to only need one language in the front-end and mid-tier environments.
If you're new to JavaScript, reconsider AngularJS (the community favors TypeScript so you'll be pushed to adopt that too). Take a look at Vue.js instead: https://vuejs.org/
My task is to write a Java based web application which will produce various charts like WaferMap, Histogram, Overlay Chart etc.
The front end is ExtJS and the chart generation part is taken care by JFreeChart.
The data for charts will be in multiple .CSV files which are stored in the file system.
My questions are:
The .CSV files size will be in GB's. Can I store these files in HDFS and query them during run-time and display data in frontend?
Is using Hadoop ecosystem is a feasible solution for my above requirement? Should I also consider Apache pig or hive for querying the CSV file?
Yes you can (Apache Hive)
It all depends but Hive seems like what you're looking for. It was designed with a SQL like feel and can include SQL clauses. It is widely used with major corps like Facebook, Netflix, FINRA, etc. In your case, supporting SQL syntax also means that you can integrate with Java's JDBC driver real easy and query data from your CSV files.
http://www.tutorialspoint.com/hive/
Setting up Hive can be a bit difficult at first if you're not too familiar with the Hadoop environment. The above link is a great reference link to understand Hive better and get you in the right direction.
Hope this was helpful!
I need to create a document store with search capabilities. Sounds simple...
That means that I have documents which I need to store in database. I thought about CouchDB, and about few other document-oriented databases, but I'm still not sure what would be the best solution.
On the other side, I thought about integrating Solr in some kind of web application which I'm going to use for uploading, indexing, search, update, delete documents.
And, of course, the main problem is that most of these documents are written using Cyrillic characters.
Maybe I'm trying to combine things that do not match together.
Could someone give me an advice what would be the best way to implement solution like this.
Best,
Joksimovic
Brate Srbine/Crnogorče :)
I suggest you use MongoDB as your database and use Solr to get index/search capability.
I used Solr in my previous (government tender) project and it's GREAT.
No bugs, easy to use when you get into it and it's blindingly fast.
Looks like for your needs Thinking sphinx could help. You could store documents in any database(SQL-oriented or not) and search them with sphinx.
Sphinx supports cyrillic characters from the box, also it's possible to use stemming, faceted search, fuzzy search, etc. May be it helps you.
Read more about sphinx here
I am also working on such a content management system. Utill now i am going to use a database to store the metadata.
Store the documents on file system.
Dont go for storing documents in database like SQL server. since it has a limitation and licensing cost.For search you can use Solr (better in terms of support and acceptance in open source over sphinx)
Choosing a stand-alone full-text search server: Sphinx or SOLR?
. either way you need to populate indexes. then call API methods to search.
I need to extract data from a Java web application. To be specific I am looking to extract real time stock data from yahoo market tracker. Can anyone please suggest any method?
I'm not sure you can extract the data from Yahoo Market Tracker. Even if you can, you might not be allowed to - I can't see any obvious terms & conditions/licensing. I think (although I could be wrong, anyone got better info?) that you'll need to pay to get access to an API providing near realtime market data.
There is a HTTP-based Yahoo Stock Quote API you could use to get prices, described here. Very simple, returns a comma-separated list of attributes for one or more stock symbols, for example:
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=snd1l1yr
It might not be realtime enough, but it might be the best you can do for free.
You can use glorious HTTP protocol to do that. Use any language you are comfortable with (Java, C#, VB.NET, python, ruby, php) and crawl the website you are trying to get information from.
I need to extract data from a Java web
application
From your standpoint, the fact that it is a Java Webapp or a PHP-one or static html pages doesn't change anything. It is not because Java is backing the webapp that suddenly you get a "Java-way" to extract the info.
Now in some cases there are APIs provided allowing you to interact with the data present on the website: but once again the fact that the Webapp is a Java one or not bears no importance.
I would like to populate the data store. Yet all the examples and instructions for populating the data store are concerned with Python projects. Is there a way to upload bulk data using the AppEngine Java tools? (At the moment the data is in CSV format, but I can easily reformat the data as needed.)
It would be especially useful if it could be done within the Eclipse IDE.
Thanks.
I'm having the same problem as you with this one. According to the discussion at http://groups.google.com/group/google-appengine-java/browse_thread/thread/72f58c28433cac26
there's no equivalent tool available for Java yet. However it looks like there's nothing stopping you from using the Python tool to just populate the datastore and then accessing that data as normal through your Java code, although this assumes you're comfortable with Python, which could be the problem.