I have a site having source code :
<article id="post-438" class="post-438 post type-post status-publish format-standard has-post-thumbnail hentry category-history tag-africa tag-asia tag-europe tag-maps tag-middle-east tag-mongol-empire tag-ottoman-empire tag-rise-of-islam">
<header class="entry-header">
<div class="entry-meta smallPart">
<span class="posted-on"><i class="fa fa-clock-o spaceRight"></i><time class="entry-date published updated" datetime="2015-07-23T00:26:22+00:00">July 23, 2015</time></span><span class="byline"> <i class="fa fa-user spaceLeftRight"></i><span class="author vcard"><a class="url fn n" href="https://muslimmemo.com/author/sufyan/">Sufyan bin Uzayr</a></span></span><span class="comments-link"><i class="fa fa-comments-o spaceLeftRight"></i><span class="dsq-postid" data-dsqidentifier="438 https://muslimmemo.com/?p=438">Leave a comment</span></span> </div><!-- .entry-meta -->
<h1 class="entry-title">Map Showing The Rise of Islam Down The Ages</h1> </header><!-- .entry-header -->
<div class="entry-summary">
<p>This is a rather interesting map that shows the spread of Islam across Asia, Europe and Africa, down the ages. The earliest period is marked in shades of brown and red, followed by shades of yellow. South-east Asia is shown separately as an inset using shades of blue. While this map is far from perfect…</p>
</div><!-- .entry-summary -->
<footer class="entry-footer smallPart">
<div class="cruzy-bottom-content">
<span class="cat-links"><i class="fa fa-folder-open spaceRight"></i>History</span> <span class="read-link">
<a class="readMoreLink invertPart" href="https://muslimmemo.com/map-rise-islam/">Read More<i class="fa fa-angle-double-right spaceLeft"></i></a>
</span>
</div>
</footer><!-- .entry-footer -->
</article>
I need to fetch :
What I need to fetch from the site:
entry-title. -> document.getElementByClassName("entry-title");
entry-link -> document.select("span.entry-title > a[href]")
summary of the entry. -> document.getElementByClassName("entry-summary");
author's link. -> document.select("span.author > a[href]")
author's name. -> document.getElementByClassName("author");
category. -> document.getElementByClassName("cat-links");
category's link. -> document.select("span.cat-links > a[href]")
posting date -> document.getElementsByClass("published");
I am doing it by this way:
Document document = Jsoup.connect(url).get();
heading = document.getElementsByClass("entry-title");
headingLink = document.select("h1.entry-title > a[href]");
headingSummary = document.getElementsByClass("entry-summary");
author = document.getElementsByClass("author");
authorLinks = document.select("span.author > a[href]");
category = document.getElementsByClass("cat-links");
categoryLinks = document.select("span.cat-links > a[href]");
published = document.getElementsByClass("published");
It is working well but working very slowly. How should I change my code for the same. Please help me.
Some hints from luksch:
From my experience Jsoup is doing a pretty good job speed wise, altough a SAX based approach should be a bit faster. Anyway, I use Jsoup a lot and never found it slow. Network access however can be very slow, depending on a lot of parameters, some of which you don't have much control over. I advise you to check out the connection over which you retrieve the data. Maybe this is the culprit and not JSoup parsing.
your JSoup use seems okay to me. At least I don't see a way to speed that up by much. One thing could be to restrict the search for Elements by not starting at the document level, but at a suitable inner node. element.select(".whatever") will start at element not at the document. If your document is very big, this might help
Related
I'm trying to get contact information off a webpage. Each contact is listed within an info class. The information I want is found in an n, adr, and primary phone class. What I want to do is iterate through each info element, check if it has those 3 child elements, and if all 3 exist add it to an ArrayList.
Heres an example of the basic parent-child relationship in the html
<div class = "info">
<h2 class = "n">Header</h2>
<div class = "info-section info-primary">
<p class = adr> address here </p>
<ul class = "phones> phone# </u>
</div>
</div>
Thanks to those that helped me I was about to get only the child elements that I want. However, I need to check and make sure each parent element contains those child elements then add them to my list.
For Example: one contact could be
<div class = "info">
<h2 class = "n">Company Name</h2>
</div>
</div>
Since there is no phone or address listed I don't want to get them from the webpage, and move to the next contact.
Since all information you need is present on html by tags and classes, you just need to write a selector that be able to filter it to you, and not only using a Java code to do that.
Without a full HTML, is hard to imagine or think the better way to write a selector for you (CssSelector or XPath still looks the better options here). So, you can try the following CssSelector, or adapt it for what you need:
driver.findElements(By.cssSelector(".info > h2 .info .adr, .info > h2 .phones"));
-
hey folks,
i am building a Java-tool, trying to automatically fill out some form input elements in an HTML-Page using Java and Jaunt API.
the HTML-Code is like:
<fieldset class = "fieldsetlong">
<legend>searchprofile</legend>
<label for="reference">reference:</label>
<input maxlength="50" name="reference" id="reference" type="text" />
</fieldset>
<fieldset class = "fieldsetlong">
<legend>searchcriteria</legend>
<label for="surname">surname:</label>
<input name="searchprofile.surname" id="surname" type="text" />
</fieldset>
The Java-Code for filling in the "normal" Input-field reference (it works) looks like:
form.set("reference", "123Test");
Unfortunately, I am not able to fill out the fields that use the dot-notation searchprofile.surname in the name
Here's a sample of what i've tried (without success):
form.set("surname", "TestPerson");
form.set("searchprofile.surname", "TestPerson");
form.set("name=\"searchprofile.surname\"", pers.getSurname());
form.set("id=\"surname\"", pers.getSurname());
For each of these commands I get a NotFoundException and don't know whether I can do this with Jaunt.
It would appreciate any kind of help in this regard.
Thanks in advance
Edit - is there a way to reach the dot-notated input-field searchprofile.surname with JSoup?
HTML allows dots in the name-Attribute, but does Jaunt accept this abc.name?
Not sure about Jaunt, never used it before. However Jsoup seems to be a pretty decent library to be used here. I myself have been using Jsoup for a fairly long time and it has been very successful in scraping web pages, filling input form and submit, and of course, HTML parsing!
I've posted a step by step guide to fill in form input fields and submit to server in the following answer: How to login with Jsoup
Basically it works very similar to your code, a very brief example would be:
Connection.Response response = Jsoup.connect(url)
.data("Name", "Value")
.method(Method.POST).execute();
Today, at work the Jaunt solution with
form.set("searchprofile.surname", "TestPerson");
worked like a charm.
I don't know what the problem was earlier but I am glad that it worked.
The HTML allows to use dots and minus, etc. which I misinterpreted as some kind of nested forms or hierarchies but the dot-notation is just a valid name-attribute in HTML.
Im new at vaadin 7 and have a little issue with formatting.
I have spend a few hours but no luck.
I have:
2 Form layouts on my vertical Layout.
2 Labels on each form layout.
Better check the screenshot
I want format label test as on the right part of screenshot.
Can you please advice or share thoughts or ideas.
I'm not 100% certain if I get what you're trying to do, but you might be able to achieve this through custom CSS.
It is hard to write out the exact CSS since it would require seeing the HTML generated by Vaadin and testing it with that, but it would be something like this for the labels:
.padded-form-layout v-caption:first-child {
float: left;
padding-left: 30px; /* set desired padding used for each label */
}
Of course, you'll need something similar for the values as well.
Above, padded-form-layout is the class name you define for layouts that need this look. In Java:
formLayout.setStyleName("padded-form-layout");
To figure out what the CSS modifications needed are I recommend you open the page in browser (Chrome or Firefox will do) and use the dev tools to directly modify the CSS to figure out what rules are needed. I usually do this by simply typing a style tag to the element, something like this (in this example, style="XXXXX" would be added manually. This is possible at least with Chrome's developer tools):
<div class="v-formlayout v-layout v-widget v-has-width" style="width: 100%;">
<!-- ... -->
<td class="v-formlayout-captioncell">
<div class="v-caption v-caption-hasdescription">
<span id="gwt-uid-21" for="gwt-uid-22" style="XXXXX">First name:</span>
<span class="v-required-field-indicator" aria-hidden="true">*</span>
</div>
</td>
<!-- ... -->
</div>
To be able to use the CSS, you'll need to either add it to your theme somehow and compile it (see Vaadin documentation about themes), or by using the #StyleSheet annotation
Im trying to parse an XML using android. My problem is that the XML is in a strange format. The entirety of the data I'm trying to parse is located inside one element.
Here is an example:
<a name="3"></a>
<div class="series_alpha">
<h2 class="series_alpha">3</h2>
<ul class="series_alpha"><li>3 Banme no Kareshi<span class="mangacompleted">[Completed]</span></li>
<li>3 Gatsu no Lion</li>
<li>337 Byooshi</li>
<li>360 Degrees Material</li>
<li>37 Degrees Kiss<span class="mangacompleted">[Completed]</span></li>
<li>3x3 Eyes</li>
</ul>
<div class="clear"></div>
</div>
The XML is a piece of the source from this webpage. The data I'm trying to retrieve is found in the <li> tag, specifically the link reference and Manga name. But I dont know how I would separate the link from the title.
After looking it up I found the information inside the tags is known as an "attribute" (I'm a noob i know) and with some google searches found the attributes.getvalue("name of attribute here") method is what I was looking for.
source: XML Parsing to get Attribute Value
I'm trying to adapt the code used in PSI Probe (or more generally, the idea of PSI Probe) to be used inside of my company's web application. I can get the majority of the portions of what I'm looking to do, but I have become stuck on one bit of code - the 'Status' tab. One column of data is the processing time for the thread, data I would really like to have, but I can't figure out where it is coming from. Here's the relevant snippet:
<c:forEach items="${pools}" var="pool" varStatus="poolStatus">
<div class="poolInfo">
<h3>${pool.name}</h3>
<div class="processorInfo">
<span class="name">
<spring:message code="probe.jsp.status.processor.maxTime"/>
</span>
${pool.maxTime}
I can't figure out where the pools object is coming from! Does anyone have experience with this sort of thing? Thanks!
Looking at the source code (this being Google code, a Google search works really quick)
the pools is being populated in the ListThreadPoolsController
List pools = containerListenerBean.getThreadPools();
return new ModelAndView(getViewName())
.addObject("pools", pools);
A closer look at ContainerListenerBean
shows the properties which are listed in status.jsp
<span class="name"><spring:message code="probe.jsp.status.currentThreadCount"/></span> ${pool.currentThreadCount}
<span class="name"><spring:message code="probe.jsp.status.currentThreadsBusy"/></span> ${pool.currentThreadsBusy}
<span class="name"><spring:message code="probe.jsp.status.maxThreads"/></span> ${pool.maxThreads}
<span class="name"><spring:message code="probe.jsp.status.maxSpareThreads"/></span> ${pool.maxSpareThreads}
<span class="name"><spring:message code="probe.jsp.status.minSpareThreads"/></span> ${pool.minSpareThreads}
are being populated in the getThreadPools() method
ThreadPool threadPool = new ThreadPool();
threadPool.setName(executorName.getKeyProperty("name"));
threadPool.setMaxThreads(JmxTools.getIntAttr(server, executorName, "maxThreads"));
threadPool.setMaxSpareThreads(JmxTools.getIntAttr(server, executorName, "largestPoolSize"));
threadPool.setMinSpareThreads(JmxTools.getIntAttr(server, executorName, "minSpareThreads"));
threadPool.setCurrentThreadsBusy(JmxTools.getIntAttr(server, executorName, "activeCount"));
threadPool.setCurrentThreadCount(JmxTools.getIntAttr(server, executorName, "poolSize"));
It can come from two places, generally. A Servlet that is invoked before the JSP or a Filter. Check all filters, and the servlet mapped to the url you are opening.