How to set depth of simple JAVA web crawler

How to set depth of simple JAVA web crawler - java

I wrote a simple recursive web crawler to fetch just the URL links from the web page recursively.
Now I am trying to figure out a way to limit the crawler using depth but I am not sure how to limit the crawler by specific depth (I can limit the crawler by top N links but I want to limit using depth)
For Ex: Depth 2 should fetch Parent links -> children(s) links--> children(s) link
Any inputs is appreciated.
public class SimpleCrawler {
static Map<String, String> retMap = new ConcurrentHashMap<String, String>();
public static void main(String args[]) throws IOException {
StringBuffer sb = new StringBuffer();
Map<String, String> map = (returnURL("http://www.google.com"));
recursiveCrawl(map);
for (Map.Entry<String, String> entry : retMap.entrySet()) {
sb.append(entry.getKey());
}
}
public static void recursiveCrawl(Map<String, String> map)
throws IOException {
for (Map.Entry<String, String> entry : map.entrySet()) {
String key = entry.getKey();
Map<String, String> recurSive = returnURL(key);
recursiveCrawl(recurSive);
}
}
public synchronized static Map<String, String> returnURL(String URL)
throws IOException {
Map<String, String> tempMap = new HashMap<String, String>();
Document doc = null;
if (URL != null && !URL.equals("") && !retMap.containsKey(URL)) {
System.out.println("Processing==>" + URL);
try {
URL url = new URL(URL);
System.setProperty("http.proxyHost", "proxy");
System.setProperty("http.proxyPort", "port");
doc = Jsoup.connect(URL).get();
if (doc != null) {
Elements links = doc.select("a");
String FinalString = "";
for (Element e : links) {
FinalString = "http:" + e.attr("href");
if (!retMap.containsKey(FinalString)) {
tempMap.put(FinalString, FinalString);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
retMap.put(URL, URL);
} else {
System.out.println("****Skipping URL****" + URL);
}
return tempMap;
}
}
EDIT1:
I thought of using worklist hence modified the code. I am not exactly sure how to set depth here too (I can set the number of webpages to crawl but not exactly depth). Any suggestions would be appreciated.
public void startCrawl(String url) {
while (this.pagesVisited.size() < 2) {
String currentUrl;
SpiderLeg leg = new SpiderLeg();
if (this.pagesToVisit.isEmpty()) {
currentUrl = url;
this.pagesVisited.add(url);
} else {
currentUrl = this.nextUrl();
}
leg.crawl(currentUrl);
System.out.println("pagesToVisit Size" + pagesToVisit.size());
// SpiderLeg
this.pagesToVisit.addAll(leg.getLinks());
}
System.out.println("\n**Done** Visited " + this.pagesVisited.size()
+ " web page(s)");
}

Based on the non-recursive approach:
Keep a worklist of URLs pagesToCrawl of type CrawlURL
class CrawlURL {
public String url;
public int depth;
public CrawlURL(String url, int depth) {
this.url = url;
this.depth = depth;
}
}
initially (before entering the loop):
Queue<CrawlURL> pagesToCrawl = new LinkedList<>();
pagesToCrawl.add(new CrawlURL(rootUrl, 0)); //rootUrl is the url to start from
now the loop:
while (!pagesToCrawl.isEmpty()) { // will proceed at least once (for rootUrl)
CrawlURL currentUrl = pagesToCrawl.remove();
//analyze the url
//updated with crawled links
}
and the updating with links:
if (currentUrl.depth < 2) {
for (String url : leg.getLinks()) { // referring to your analysis result
pagesToCrawl.add(new CrawlURL(url, currentUrl.depth + 1));
}
}
You could enhance CrawlURL with other meta data (e.g. link name, referrer,. etc.).
Alternative:
In my upper comment I mentioned a generational approach. Its a bit more complex than this one. The basic Idea is to keep to lists (currentPagesToCrawl and futurePagesToCrawl) together with a generation variable (starting with 0 and increasing every time currentPagesToCrawl gets empty). All crawled urls are put into the futurePagesToCrawl queue and if currentPagesToCrawl both lists will be switched. This is done until the generation variable reaches 2.

You could add a depth parameter on the signature of your recursive method eg
on your main
recursiveCrawl(map,0);
and
public static void recursiveCrawl(Map<String, String> map, int depth) throws IOException {
if (depth++ < DESIRED_DEPTH) //assuming initial depth = 0
for (Map.Entry<String, String> entry : map.entrySet()) {
String key = entry.getKey();
Map<String, String> recurSive = returnURL(key);
recursiveCrawl(recurSive, depth);
}
}
]

You can do something like this:
static int maxLevels = 10;
public static void main(String args[]) throws IOException {
...
recursiveCrawl(map,0);
...
}
public static void recursiveCrawl(Map<String, String> map, int level) throws IOException {
for (Map.Entry<String, String> entry : map.entrySet()) {
String key = entry.getKey();
Map<String, String> recurSive = returnURL(key);
if (level < maxLevels) {
recursiveCrawl(recurSive, ++level);
}
}
}
Also, you can use a Set instead of a Map.

Related

Collect all elements in a JSON file into a single list

I am using Gson 2.8.1+ (I can upgrade if needed).
If I have the JsonObject:
"config" : {
"option_one" : {
"option_two" : "result_one"
}
}
}
... how can I convert this efficiently to the form:
"config.option_one.option_two" : "result_one"

Simple example:
public static void main(String[] args) {
String str = """
{
"config" : {
"option_one" : {
"option_two" : "result_one"
}
}
}""";
var obj = JsonParser.parseString(str).getAsJsonObject();
System.out.println(flatten(obj)); // {"config.option_one.option_two":"result_one"}
}
public static JsonObject flatten(JsonObject toFlatten) {
var flattened = new JsonObject();
flatten0("", toFlatten, flattened);
return flattened;
}
private static void flatten0(String prefix, JsonObject toFlatten, JsonObject toMutate) {
for (var entry : toFlatten.entrySet()) {
var keyWithPrefix = prefix + entry.getKey();
if (entry.getValue() instanceof JsonObject child) {
flatten0(keyWithPrefix + ".", child, toMutate);
} else {
toMutate.add(keyWithPrefix, entry.getValue());
}
}
}

Algorithm
Simplest algorithm you can come up with is recursive folding. You first dive recursively to the bottom of a structure, then ask if there is only one element in the map(you have to parse json with some framework to get a Map<string, object> structure). If there is, you join the string of parent field with property and set value of parent to value of that property. Then you move up and repeat the process until you are at the root. Of course, if map has multiple fields, you will move on to the parent and try egan.

Gson does not have anything like that, but it provides enough capabilities to build it on top: you can walk JSON streams (JsonReader) and trees (JsonElement, but not wrapped into JsonReader) stack-based and stack-based/recursively accordingly (streams may save much).
I would create a generic tree-walking method to adapt it for further purposes.
public static void walk(final JsonElement jsonElement, final BiConsumer<? super Collection<?>, ? super JsonElement> consumer) {
final Deque<Object> parents = new ArrayDeque<>();
parents.push("$");
walk(jsonElement, consumer, parents);
}
private static void walk(final JsonElement jsonElement, final BiConsumer<? super Collection<?>, ? super JsonElement> consumer, final Deque<Object> path) {
if ( jsonElement.isJsonNull() ) {
consumer.accept(path, jsonElement);
} else if ( jsonElement.isJsonPrimitive() ) {
consumer.accept(path, jsonElement);
} else if ( jsonElement.isJsonObject() ) {
for ( final Map.Entry<String, JsonElement> e : jsonElement.getAsJsonObject().entrySet() ) {
path.addLast(e.getKey());
walk(e.getValue(), consumer, path);
path.removeLast();
}
} else if ( jsonElement.isJsonArray() ) {
int i = 0;
for ( final JsonElement e : jsonElement.getAsJsonArray() ) {
path.addLast(i++);
walk(e, consumer, path);
path.removeLast();
}
} else {
throw new AssertionError(jsonElement);
}
}
Note that the method above also supports arrays. The walk method is push-semantics-driven: it uses callbacks to provide the walk progress. Making it lazy by returning an iterator or a stream would probably be cheaper and get the pull semantics applied. Also, CharSequence view elements would probably save on creating many strings.
public static String toJsonPath(final Iterable<?> path) {
final StringBuilder stringBuilder = new StringBuilder();
final Iterator<?> iterator = path.iterator();
if ( iterator.hasNext() ) {
final Object next = iterator.next();
stringBuilder.append(next);
}
while ( iterator.hasNext() ) {
final Object next = iterator.next();
if ( next instanceof Number ) {
stringBuilder.append('[').append(next).append(']');
} else if ( next instanceof CharSequence ) {
stringBuilder.append('.').append(next);
} else {
throw new UnsupportedOperationException("Unsupported: " + next);
}
}
return stringBuilder.toString();
}
Test:
final JsonElement jsonElement = Streams.parse(jsonReader);
final Collection<String> paths = new ArrayList<>();
JsonPaths.walk(jsonElement, (path, element) -> paths.add(JsonPaths.toJsonPath(path)));
for ( final String path : paths ) {
System.out.println(path);
}
Assertions.assertIterableEquals(
ImmutableList.of(
"$.nothing",
"$.number",
"$.object.subobject.number",
"$.array[0].string",
"$.array[1].string",
"$.array[2][0][0][0]"
),
paths
);

Creating and assigning tasks to threads in java

I have a piece of code here which takes a long time to run. The code basically go through each file in the file list and do stuffs. How do I create 4 threads and let each of them handle one file (since there are only 4 files).
public static void run(String referenceFile, String dir) throws FileNotFoundException, IOException
{
List<Gene> referenceGenes = ParseReferenceGenes(referenceFile);
List<String> filenames = ListGenbankFiles(dir);
for (String filename:filenames)
{
System.out.println(filename);
GenbankRecord record = Parse(filename);
for (Gene referenceGene : referenceGenes)
{
System.out.println(referenceGene.name);
for (Gene gene : record.genes)
{
if (Homologous(gene.sequence, referenceGene.sequence)) {
NucleotideSequence upStreamRegion = GetUpstreamRegion(record.nucleotides, gene);
Match prediction = PredictPromoter(upStreamRegion);
if (prediction != null) {
consensus.get(referenceGene.name).addMatch(prediction);
consensus.get("all").addMatch(prediction);
}
}
}
}
}
for (Map.Entry<String, Sigma70Consensus> entry : consensus.entrySet())
System.out.println(entry.getKey() + " " + entry.getValue());
}

You can use Executors.newFixedThreadPool(4) to create an ExecutorService, and use the execute() method to start your tasks, or, alternatively invokeAll when you need to return results.
For example, do something like this:
public static void run(String referenceFile, String dir) throws FileNotFoundException, IOException
{
List<Gene> referenceGenes = ParseReferenceGenes(referenceFile);
List<String> filenames = ListGenbankFiles(dir);
ExecutorService executor = Executors.newFixedThreadPool(4);
List<Callable<Map<String, Sigma70Consensus>>> tasks = ListGenbankFiles(dir);
for (String filename: filenames)
{
tasks.add(new MyTask(filename));
}
for (Future<Map<String, Sigma70Consensus>> result : executor.invokeAll(tasks)) {
for (Map.Entry<String, Sigma70Consensus> entry : result.get().entrySet())
System.out.println(entry.getKey() + " " + entry.getValue());
}
}
public class MyTask implements Callable<Map<String, Sigma70Consensus>> {
private final String fileName;
public MyTask(String fileName) { this.fileName = fileName; }
public Map<String, Sigma70Consensus> call() {
/* The file processing code you have in the loop here */
}
}
If you need a more detailed explanation of this approach, please refer to this totorial

how to remove a query parameter from a query string

I am using UriBuilder to remove a parameter from a URI:
public static URI removeParameterFromURI(URI uri, String param) {
UriBuilder uriBuilder = UriBuilder.fromUri(uri);
return uriBuilder.replaceQueryParam(param, "").build();
}
public static String removeParameterFromURIString(String uriString, String param) {
try {
URI uri = removeParameterFromURI(new URI(uriString), param);
return uri.toString();
} catch (URISyntaxException e) {
throw new RuntimeException(e);
}
}
The above sort of works and modifies:
http://a.b.c/d/e/f?foo=1&bar=2&zar=3
… into:
http://a.b.c/d/e/f?bar=&foo=1&zar=3
But it has the following issues:
It messes up the order of the parameters. I know that the order is not relevant but it still bothers me.
it doesn't fully remove the parameter, it just sets its value to the empty string. I would prefer is the parameter is completely removed from the query string.
Is there some standard or commonly used library that can achieve the above neatly without having to parse and hack the query string myself?

In Android, without import any library.
I write a util method inspired by this answerReplace query parameters in Uri.Builder in Android?(Replace query parameters in Uri.Builder in Android?)
Hope can help you. Code below:
public static Uri removeUriParameter(Uri uri, String key) {
final Set<String> params = uri.getQueryParameterNames();
final Uri.Builder newUri = uri.buildUpon().clearQuery();
for (String param : params) {
if (!param.equals(key)) {
newUri.appendQueryParameter(param, uri.getQueryParameter(param));
}
}
return newUri.build();
}

Using httpclient URIBuilder would be much cleaner if you can.
public String removeQueryParameter(String url, String parameterName) throws URISyntaxException {
URIBuilder uriBuilder = new URIBuilder(url);
List<NameValuePair> queryParameters = uriBuilder.getQueryParams();
for (Iterator<NameValuePair> queryParameterItr = queryParameters.iterator(); queryParameterItr.hasNext();) {
NameValuePair queryParameter = queryParameterItr.next();
if (queryParameter.getName().equals(parameterName)) {
queryParameterItr.remove();
}
}
uriBuilder.setParameters(queryParameters);
return uriBuilder.build().toString();
}

If you are on Android and want to remove all query parameters, you can use
Uri uriWithoutQuery = Uri.parse(urlWithQuery).buildUpon().clearQuery().build();

Using streams and URIBuilder from httpclient it would look like this
public String removeQueryParameter(String url, String parameterName) throws URISyntaxException {
URIBuilder uriBuilder = new URIBuilder(url);
List<NameValuePair> queryParameters = uriBuilder.getQueryParams()
.stream()
.filter(p -> !p.getName().equals(parameterName))
.collect(Collectors.toList());
if (queryParameters.isEmpty()) {
uriBuilder.removeQuery();
} else {
uriBuilder.setParameters(queryParameters);
}
return uriBuilder.build().toString();
}

To fully remove the parameter, you can use
public static URI removeParameterFromURI(URI uri, String param) {
UriBuilder uriBuilder = UriBuilder.fromUri(uri);
return uriBuilder.replaceQueryParam(param, (Object[]) null).build();
}

The following piece of code worked for me:
Code:
import java.util.Arrays;
import java.util.stream.Collectors;
public class RemoveURL {
public static void main(String[] args) {
final String remove = "password";
final String url = "http://testdomainxyz.com?username=john&password=cena&password1=cena";
System.out.println(url);
System.out.println(RemoveURL.removeParameterFromURL(url, remove));
}
public static String removeParameterFromURL(final String url, final String remove) {
final String[] urlArr = url.split("\\?");
final String params = Arrays.asList(urlArr[1].split("&")).stream()
.filter(item -> !item.split("=")[0].equalsIgnoreCase(remove)).collect(Collectors.joining("&"));
return String.join("?", urlArr[0], params);
}
}
Output
http://testdomainxyz.com?username=john&password=cena&password1=cena
http://testdomainxyz.com?username=john&password1=cena

Based on the suggestion by JB Nizzet, this is what I ended up doing (I added some extra logic to be able to assert whether I expect the parameter to be present, and if so, how many times):
public static URI removeParameterFromURI(URI uri, String parameter, boolean assertAtLeastOneIsFound, Integer assertHowManyAreExpected) {
Assert.assertFalse("it makes no sense to expect 0 or less", (assertHowManyAreExpected!=null) && (assertHowManyAreExpected<=0) );
Assert.assertFalse("it makes no sense to not assert that at least one is found and at the same time assert a definite expected number", (!assertAtLeastOneIsFound) && (assertHowManyAreExpected!=null) );
String queryString = uri.getQuery();
if (queryString==null)
return uri;
Map<String, List<String>> params = parseQuery(queryString);
Map<String, List<String>> paramsModified = new LinkedHashMap<>();
boolean found = false;
for (String key: params.keySet()) {
if (!key.equals(parameter))
Assert.assertNull(paramsModified.put(key, params.get(key)));
else {
found = true;
if (assertHowManyAreExpected!=null) {
Assert.assertEquals((long) assertHowManyAreExpected, params.get(key).size());
}
}
}
if (assertAtLeastOneIsFound)
Assert.assertTrue(found);
UriBuilder uriBuilder = UriBuilder.fromUri(uri)
.replaceQuery("");
for (String key: paramsModified.keySet()) {
List<String> values = paramsModified.get(key);
uriBuilder = uriBuilder.queryParam(key, (Object[]) values.toArray(new String[values.size()]));
}
return uriBuilder.build();
}
public static String removeParameterFromURI(String uri, String parameter, boolean assertAtLeastOneIsFound, Integer assertHowManyAreExpected) {
try {
return removeParameterFromURI(new URI(uri), parameter, assertAtLeastOneIsFound, assertHowManyAreExpected).toString();
} catch (URISyntaxException e) {
throw new RuntimeException(e);
}
}
private static Map<String, List<String>> parseQuery(String queryString) {
try {
final Map<String, List<String>> query_pairs = new LinkedHashMap<String, List<String>>();
final String[] pairs = queryString.split("&");
for (String pair : pairs) {
final int idx = pair.indexOf("=");
final String key = idx > 0 ? URLDecoder.decode(pair.substring(0, idx), StandardCharsets.UTF_8.name()) : pair;
if (!query_pairs.containsKey(key)) {
query_pairs.put(key, new ArrayList<String>());
}
final String value = idx > 0 && pair.length() > idx + 1 ? URLDecoder.decode(pair.substring(idx + 1), StandardCharsets.UTF_8.name()) : null;
query_pairs.get(key).add(value);
}
return query_pairs;
} catch (UnsupportedEncodingException e) {
throw new RuntimeException(e);
}
}

You can use simpler method from Collection based on #Flips solution:
public String removeQueryParameter(String url, String parameterName) throws URISyntaxException {
URIBuilder uriBuilder = new URIBuilder(url);
List<NameValuePair> queryParameters = uriBuilder.getQueryParams();
queryParameters.removeIf(param ->
param.getName().equals(parameterName));
uriBuilder.setParameters(queryParameters);
return uriBuilder.build().toString();
}

I am not sure if there is some library to help, but I would just split the string on "?" and take the second half and split it on "&". Then I would rebuild the string accordingly.
public static void main(String[] args) {
// TODO code application logic here
System.out.println("original: http://a.b.c/d/e/f?foo=1&bar=2&zar=3");
System.out.println("new : " + fixString("http://a.b.c/d/e/f?foo=1&bar=2&zar=3"));
}
static String fixString(String original)
{
String[] processing = original.split("\\?");
String[] processing2ndHalf = processing[1].split("&");
return processing[0] + "?" + processing2ndHalf[1] + "&" + processing2ndHalf[0] + "&" + processing2ndHalf[2];
}
Output:
To remove a paramater just remove it from the return string.

public static String removeQueryParameter(String url, List<String> removeNames) {
try {
Map<String, String> queryMap = new HashMap<>();
Uri uri = Uri.parse(url);
Set<String> queryParameterNames = uri.getQueryParameterNames();
for (String queryParameterName : queryParameterNames) {
if (TextUtils.isEmpty(queryParameterName)
||TextUtils.isEmpty(uri.getQueryParameter(queryParameterName))
|| removeNames.contains(queryParameterName)) {
continue;
}
queryMap.put(queryParameterName, uri.getQueryParameter(queryParameterName));
}
// remove all params
Uri.Builder uriBuilder = uri.buildUpon().clearQuery();
for (String name : queryMap.keySet()) {
uriBuilder.appendQueryParameter(name, queryMap.get(name));
}
return uriBuilder.build().toString();
} catch (Exception e) {
return url;
}
}

#TTKatrina's answer worked for me, but I need to remove query param from fragment too. So extended that for fragment and came up with this.
fun Uri.removeQueryParam(key: String): Uri {
//Create new Uri builder with no query params.
val builder = buildUpon().clearQuery()
//Add all query params excluding the key we don't want back to the new Uri.
queryParameterNames.filter { it != key }
.onEach { builder.appendQueryParameter(it, getQueryParameter(it)) }
//If query param is in fragment, remove from it.
val fragmentUri = fragment?.toUri()
if (fragmentUri != null) {
builder.encodedFragment(fragmentUri.removeQueryParam(key).toString())
}
//Now this Uri doesn't have the query param for [key]
return builder.build()
}

Matching Keys in a HashMap

I am attempting to do the following (in psuedocode):
Generate HashMapOne that will be populated by results
found in a DICOM file (the Key was manipulated for matching
purposes).
Generate a second HashMapTwo that will be read from a
text document.
Compare the Keys of both HashMaps, if a match add the results of
the value of HashMapOne in a new HashMapThree.
I am getting stuck with adding the matched key's value to the HashMapThree. It always populates a null value despite me declaring this a public static variable. Can anyone please tell me why this may be? Here is the code snippets below:
public class viewDICOMTags {
HashMap<String,String> dicomFile = new HashMap<String,String>();
HashMap<String,String> dicomTagList = new HashMap<String,String>();
HashMap<String,String> Result = new HashMap<String, String>();
Iterator<org.dcm4che2.data.DicomElement> iter = null;
DicomObject working;
public static DicomElement element;
DicomElement elementTwo;
public static String result;
File dicomList = new File("C:\\Users\\Ryan\\dicomTagList.txt");
public void readDICOMObject(String path) throws IOException
{
DicomInputStream din = null;
din = new DicomInputStream(new File(path));
try {
working = din.readDicomObject();
iter = working.iterator();
while (iter.hasNext())
{
element = iter.next();
result = element.toString();
String s = element.toString().substring(0, Math.min(element.toString().length(), 11));
dicomFile.put(String.valueOf(s.toString()), element.vr().toString());
}
System.out.println("Collected tags, VR Code, and Description from DICOM file....");
}
catch (IOException e)
{
e.printStackTrace();
return;
}
finally {
try {
din.close();
}
catch (IOException ignore){
}
}
readFromTextFile();
}
public void readFromTextFile() throws IOException
{
try
{
String dicomData = "DICOM";
String line = null;
BufferedReader bReader = new BufferedReader(new FileReader(dicomList));
while((line = bReader.readLine()) != null)
{
dicomTagList.put(line.toString(), dicomData);
}
System.out.println("Reading Tags from Text File....");
bReader.close();
}
catch(FileNotFoundException e)
{
System.err.print(e);
}
catch(IOException i)
{
System.err.print(i);
}
compareDICOMSets();
}
public void compareDICOMSets() throws IOException
{
for (Entry<String, String> entry : dicomFile.entrySet())
{
if(dicomTagList.containsKey(entry.getKey()))
Result.put(entry.getKey(), dicomFile.get(element.toString()));
System.out.println(dicomFile.get(element.toString()));
}
SortedSet<String> keys = new TreeSet<String>(Result.keySet());
for (String key : keys) {
String value = Result.get(key);
System.out.println(key);
}
}
}

This line of code looks very wrong
Result.put(entry.getKey(), dicomFile.get(element.toString()));
If you are trying to copy the key/value pair from HashMapOne, then this is not correct.
The value for each key added to Result will be null, because you are calling get method on Map interface on dicomFile. get requires a key as a lookup value, and you are passing in
element.toString()
where element will be the last element that was read from your file.
I think you should be using
Result.put(entry.getKey(), entry.getValue()));

How to get HtmlElements from a website

I am trying to get urls and html elements from a website.Able to get urls and html from website but, when one url contains multiple elements(like multiple input elements (or)multiple textarea elements)i am able getting only last element.The code like below
GetURLsAndElemens.java
public static void main(String[] args) throws FileNotFoundException,
IOException, ParseException {
Properties properties = new Properties();
properties
.load(new FileInputStream(
"src//io//servicely//ci//plugin//SeleniumResources.properties"));
Map<String, String> urls = gettingUrls(properties
.getProperty("MAIN_URL"));
GettingHTMLElements.getHTMLElements(urls);
// .out.println(urls.size());
// System.out.println(urls);
}
public static Map<String, String> gettingUrls(String mainURL) {
Document doc = null;
Map<String, String> urlsList = new HashMap<String, String>();
try {
System.out.println("Main URL " + mainURL);
// need http protocol
doc = Jsoup.connect(mainURL).get();
GettingHTMLElements.getInputElements(doc, mainURL);
// get page title
// String title = doc.title();
// System.out.println("title : " + title);
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
// urlsList.clear();
// get the value from href attribute and adding to list
if (link.attr("href").contains("http")) {
urlsList.put(link.attr("href"), link.text());
} else {
urlsList.put(mainURL + link.attr("href"), link.text());
}
// System.out.println(urlsList);
}
} catch (IOException e) {
e.printStackTrace();
}
// System.out.println("Total urls are "+urlsList.size());
// System.out.println(urlsList);
return urlsList;
}
GettingHtmlElements.java
static Map<String, HtmlElements> urlList = new HashMap<String, HtmlElements>();
public static void getHTMLElements(Map<String, String> urls)
throws IOException {
getElements(urls);
}
public static void getElements(Map<String, String> urls) throws IOException {
for (Map.Entry<String, String> entry1 : urls.entrySet()) {
try {
System.out.println(entry1.getKey());
Document doc = Jsoup.connect(entry1.getKey()).get();
getInputElements(doc, entry1.getKey());
}
catch (Exception e) {
e.printStackTrace();
}
}
Map<String,HtmlElements> list = urlList;
for(Map.Entry<String,HtmlElements> entry1:list.entrySet())
{
HtmlElements ele = entry1.getValue();
System.out.println("url is "+entry1.getKey());
System.out.println("input name "+ele.getInput_name());
}
}
public static HtmlElements getInputElements(Document doc, String entry1) {
HtmlElements htmlElements = new HtmlElements();
Elements inputElements2 = doc.getElementsByTag("input");
Elements textAreaElements2 = doc.getElementsByTag("textarea");
Elements formElements3 = doc.getElementsByTag("form");
for (Element inputElement : inputElements2) {
String key = inputElement.attr("name");
htmlElements.setInput_name(key);
String key1 = inputElement.attr("type");
htmlElements.setInput_type(key1);
String key2 = inputElement.attr("class");
htmlElements.setInput_class(key2);
}
for (Element inputElement : textAreaElements2) {
String key = inputElement.attr("id");
htmlElements.setTextarea_id(key);
String key1 = inputElement.attr("name");
htmlElements.setTextarea_name(key1);
}
for (Element inputElement : formElements3) {
String key = inputElement.attr("method");
htmlElements.setForm_method(key);
String key1 = inputElement.attr("action");
htmlElements.setForm_action(key1);
}
return urlList.put(entry1, htmlElements);
}
Which elements i want take it as a bean.For every url i am getting the urls and htmle elements.but when url contains multiple elements i was getting last element only

You use a class HtmlElements which is not part of JSoup as far as I know. I don't know its inner workings, but I assume it is some sort of list of html nodes or something.
However, you seem to use this class like this:
HtmlElements htmlElements = new HtmlElements();
htmlElements.setInput_name(key);
This indicates that only ONE html element is stored in the htmlElements variable. This would explain why you get only the last element stored - you simply overwrite the one instance all the time.
It is not really clear, since I don't know the HtmlElements class. Maybe something like this works, assuming that HtmlElement is working as a single instance of HtmlElements and HtmlElements has a method add:
HtmlElements htmlElements = new HtmlElements();
...
for (Element inputElement : inputElements2) {
HtmlElement e = new HtmlElement();
htmlElements.add(e);
String key = inputElement.attr("name");
e.setInput_name(key);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to set depth of simple JAVA web crawler - java

Related

Collect all elements in a JSON file into a single list

Creating and assigning tasks to threads in java

how to remove a query parameter from a query string

Matching Keys in a HashMap

How to get HtmlElements from a website

Categories

Resources