Extracting information using XPaths

Extracting information using XPaths - java

Good afternoon dear community,
I have finally compiled a list of working XPaths required to scrape all of the information from URL's that i need.
I would like to ask for your suggestion, for a newbie in coding what is the best way to scrape around 50k links using only XPaths (around 100 xpaths for each link)?
Import.io is my best tool at the moment, or even SEO tools for Excel, but they both have their own limitations. Import io is expensive, SEO tools for excel isn't suited to extract more than 1000 links.
I am willing to learn the system suggested, but please suggest a good way of scraping for my project!
#
SOLVED! SEO Tools crawler is actually super usefull and I believe I've found what i need. I guess i'll hold off Python or Java until i encounter another tough obstacle.
Thank you all!

That strongly depends on what you mean by "scraping information". What exactly do you want to mine from the websites? All major languages (certainly Java and Python that you mentioned) have good solutions for connecting to websites, reading content, parsing HTML using a DOM and using XPath to extract certain fragments. For example, Java has JTidy, which allows you to parse even "dirty" HTML from websites into a DOM and manipulate it somewhat. However, the tools needed will depend on the exact data processing needs of your project.

I would encourage you to use Python (I use 2.7.x) w/ Selenium. I routinely automate scraping and testing of websites with this combo (in both a headed and headless manner), and Selenium unlocks the opportunity to interact with scripted sites that do not have explicit webcalls for each and every page.
Here is a good, quick tutorial from the Selenium docs: 2. Getting Started
There are a lot of great sources out there, and it would take forever to post them all; but, you will find the Python community very helpful and you'll likely see that Python is a great language for this type of web interaction.
Good luck!

Related

How to make Selenium automation simple?

Sorry for this weird question.
Actually we go for selenium-webdriver to make manual test simpler,but what I felt is, for finding each and every webelement itself a hectic job. I actually do 'n' number of test to test my selenium code.
So, how can I make it simpler.
Thanks in advance!!

With my current experience with Selenium testing:
write your own methods
If you have repetitive sequences of actions to check something - pack it in your own methods like: find, click or find, get text attribute, assert if true etc.
make use of loops
Need to assert if text attribute is correct for N elements? Count number of elements, put your "testing" method inside loop for N repeats, compare against control-data stored in list/array etc.
use all what testing-framework can provide
In my case I test with the help of NUnit. If I have set of similar or even identical tests, why not to use [TestCase] instead of [Test]?
refactor / simplify
If you realize during or after test development that some parts of code are redundant - just replace them with corresponding methods you created instead. Code will get much shorter quickly and will be much easier to update if needed.

For me, the easiest way to find an element is by using css selectors. It is more natural and you can easily test your css selectors on chrome devtool by using the following construct like jquery, $('.classname').
BTW, I just created a small project to bootstrap a selenium project. You might want to check it out as it used shortcuts for selectors and I feel that it is more natural to have it that way, e.g. selector starts with '#' then us By.id, starts with '=', then use By.name.
Here's the project url: https://github.com/codezombies/easytest

You can use the tool fire-ie browser, this will help recognising elements without navigating to inspection page.
From below link, tool can be downloaded
http://toolsqa.com/selenium-webdriver/fire-ie-selenium-tool-ie-browser/
Another option is use Selenium IDE to identify the element. This is suggested option only for the non-technical users. Now Selenium IDE is available even on Chrome.

Selenium WebDriver is the most mainstream computerized test structure being utilized in programming improvement these days. Since it underpins all the principle programming dialects, for example, C#, Perl, Ruby, PHP, and JAVA, you have the opportunity to learn and make test code in any of these dialects. Selenium WebDriver+JAVA is the mix that is utilized the most. Obviously, information on HTML, javascript, and CSS is significant. Remember that an analyzer's employment is active. There will be numerous acceptable books on test mechanization. Nonetheless, the best preparation and range of abilities comes from dealing with genuine tasks.
Web advancement has expanded three-overlay contrasted and what it was years back. Various organizations are selecting to create sites to show their essence on the web. Also, testing is a significant piece of this web advancement measure. Concur? Obviously, it is. Most likely on that. In excess of a billion sites are live right now.
We see that various instruments are being delivered to assist with the turn of events and testing measures. On account of the web advancement specialists who are exploring and concocting good thoughts once in a while. You can discover various tooling alternatives on the lookout. Heaps of instruments are accessible to help you make your testing cycle simple and straightforward.
In a ton of testing instruments accessible out there, Selenium is perhaps the best device to date. The developing prominence of Selenium is a result of the wide scope of highlights that it offers. Selenium is open-source. You can download it liberated from cost. It is easy to comprehend and simple to utilize. With Selenium, you can play out your test on an assortment of programming dialects, for example, C#, Java, Python, etc, utilizing its Web Driver API. You can utilize it for mechanizing cell phones, including Android and iOS through Appium separated from for internet browsers.
Steps are as follows:
Choosing a Framework for Testing
How Do You Choose a Programming Language
Choosing a Unit Test Framework
Designing the Architecture of your Framework
Choose a Mechanism for Reporting
Building the “Selenium Test” Component

library for text classification in java

I have a set of categorized text files. I want to categorize another large set of text files to use in my research. Is there a good way to compare them?
I think SVM based methods are useful but is there a simple and documented library for using such algorithms?

I don't know much about SVM, but LingPipe might be really helpful for you. The link is a tutorial specifically about categorization of documents (automatic or guided).
Also, look into the inter-related search products Lucene (a search library), Solr (search server app), and Carrot2 (for 'clustering' search results). There should be some interesting work in that space for you.

Mallet is another awesome library to look into. It has good commandline tools to help you get started and a Java API once you start getting into integrating it with the rest of your system.

What should i use to crawl many news articles?

I've a project of natural language processing but for that i need to crawl many web articles from some sources like Yahoo news, Google news or blogs...
I'm a java developper (so i'd rather use java tools). I guess i can parse each source website on my own and extract the articles with HttpClient / XPath but i'm a bit lazy :) is there a way so that i won't have to make a parser per source?
(I'm not only interested by new articles but articles from 2000 to now too)

The hardest part of NLP is getting data you can use. Everything else is just math.
It may be hard to find a large collection of news articles other than on each news source's website because of all the copyright issues involved. If you don't need recent news, your best bet is probably to look at the Linguistic Data Consortium's English Gigaword corpus; if you are at a university, there may already be an existing relationship for you to use the data for free.
If you need to actually crawl and parse websites, for now you'll probably find you have to write specific parsers for the various news websites to make sure you get the right text. However, once more websites start using HTML5, it will be easier to pull out the relevant text through the use of the article tag.
To do the actual crawling, this previous question can point you in some useful directions.

Generic Article Extraction from web pages

Am going to begin my work in article extraction.
The task that I will be doing is to extract the hotel reviews that is posted in different web pages(eg. 1. http://www.tripadvisor.ca/Hotel_Review-g32643-d1097955-Reviews-San_Mateo_County_Memorial_Park_Campground-Loma_Mar_California.html, 2. http://www.travelpod.com/hotel/Comfort_Suites_Sfo_Airport-San_Mateo.html )
I need to do the task in Java and I am just working with Java for the past couple of months alone..
And here comes my questions regarding these.
Is there possibility to extract reviews alone from different web pages in a generic way.
Kindly let me know if there are any API that supports the task in Java.
Also, let me know of your thoughts/sources which will be more helpful for me to attain the task mentioned above.
UPDATE
If any sort of related examples available in net, please post the same since that could be of great use.

You probably need a screen scraping utility for Java like TagSoup or NekoHTML. JSoup is also popular.
However, you also have a bigger legal consideration here when extracting data from a 3rd party website like tripadvisor. Does their policy allow it?

Whats the best way to implement a simple document management system?

I am planning to build a simple document management system. Preferably built around the java platform. Are there are best practices around this? The requirements are :
Ability to upload documents
Ability to Tag documents
Version the documents
Comment on documents
There are a couple of options that I am currently considering. The first option would be a simple API on top of SVN or CVS and use a DB backend to track tags, uploader, comments etc
Another option is to use the filesystem. Version the documents as copies in a versions folder and work with filenames.
Or, if there is an Open non GPL'ed doc management system, we could customize it to our needs and package it in our application. Does anybody have any experience building something like this?

You may want to take a look at Content repository API for Java and the several implementations (some of them free).

Take a look at the many Document Oriented Database systems out there. I can't speak about MongoDB or any of the others, but my experience with Couchdb has been fantastic.
http://couchdb.apache.org/
best part of it is that you communicate with it via a REST protocol.

The best way is to reuse the efforts of others. This particular wheel has been invented quite a bit of times.
Who will use this and for what purpose?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.