I Am Working on project and in a specific part i need to search for some information and get the results of this search from the internet so how the data can be fetched from the web page to use it ?
Use Selenium - Here's a nice tutorial https://www.guru99.com/selenium-tutorial.html
Selenium is a java library, that supports all major browsers. The download and documentation links can be found on its website - https://www.seleniumhq.org/docs/
You can do that with python library beautifulsoup.
Related
How can I scrape a website with dynamic content loading, like a forbes.com article, but without using web-driver (it's slow) in apache http client.
I've tried getting the sitemap.xml but their sitemap includes only the latest articles and I want info from very old articles.
Also, I want a more generic solution and with the web-driver (I use selenium with phantomJS now) is site-specific and slow.
I'd suggest you to try this tool ui4j. It's a wrapper around the JavaFx WebKit Engine with headless modes. It can help you speeding up things.
I've got a problem: I want to parse a page (e.g. this one) to collect information about the offered apps and save these information into a database.
Moreover I am using crawler4j for visiting every (available) page. But the problem - as I can see - is, that crawler4j needs links to follow in the source code.
But in this case the hrefs are generated by some JavaScript code so that crawler4j does not get new links to visit / pages to crawl.
So my idea was to use Selenium so that I can inspect several Elements like in a real Browser like Chrome or Firefox (I'm quite new with this).
But, to be honest, I don't know how to get the "generated" HTML instead of the source code.
Can anybody help me?
To inspect elements, you do not need the Selenium IDE, just use Firefox with the Firebug extension. Also, with the developer tools add on you can view a page's source and also the generated source (this is mainly for PHP).
Crawler4J can not handle javascript like this. It is better left for another more advanced crawling library. See this response here:
Web Crawling (Ajax/JavaScript enabled pages) using java
Google Drive searching is really amazing.
It built up a full text search index instantly right after I had uploaded the document(pdf/M$ office document).
Since I want to use this technology in my own GAE project,I was wondering
1.is there any existing api from(Google/others) provide this function
2.how to implement by myself.
On App Engine you can use Search API, to do full-test indexing & querying.
You can use following options.
Google desktop and not sure about api.
IIS index server thenyou can do query it index files too.
Dtseach is tool with api but paid.
Apache api i forgot name but will post it shortly.
I am trying to use Google App Engine Java Search API, but it doesn't work as intended. In fact python and java search differs.
Python:
Website
Source Code
Java:
Website
Source code
When i search for "tes" python results in all documents with "test" but not java.
Is it a bug in java sdk, i am using 1.7.4?
Unfortunately this is still not possible given the API.
See these posts for more information:
Partial matching GAE search API
GAE Full Text Search API phrase matching
This is also logged as an issue/defect here:
http://code.google.com/p/googleappengine/issues/detail?id=7689
Need to build a little java tool that gets the keyword suggestions and traffic estimates from the google keywords tool at https://adwords.google.com/select/KeywordToolExternal .
The page is rendered in javascript so simple scraping isn't possible. I have tried htmlunit, but it doesn't work (tried diff browserversions .. still no luck).
one way could be to embed a webbrowser in java , but hvn't had any success with it ?
any suggestions or some alternative ?
Have a look at adword's api and [Client library for Java,Python ect].