java.lang.IllegalArgumentException: Must supply a valid URL - java

Im trying to build a web crawler for my OOP class. The crawler needs to traverse 1000 wikipedia pages and collect the titles and words off the page. The current code I have will traverse a singular page and collect the required information but it also gives me the error code "java.lang.IllegalArgumentException: Must supply a valid URL:" Here is my crawlers code. Ive been using Jsoups libraries.
import java.util.HashMap;
import java.util.HashSet;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class crawler {
private static final int MAX_PAGES = 1000;
private final HashSet<String> titles = new HashSet<>();
private final HashSet<String> urlVisited = new HashSet<>();
private final HashMap<String, Integer> map = new HashMap<>();
public void getLinks(String startURL) {
if ((titles.size() < MAX_PAGES) && !urlVisited.contains(startURL)) {
urlVisited.add(startURL);
try {
Document doc = Jsoup.connect(startURL).get();
Elements linksFromPage = doc.select("a[href]");
String title = doc.select("title").first().text();
titles.add(title);
String text = doc.body().text();
CountWords(text);
for (Element link : linksFromPage) {
if(titles.size() <= MAX_PAGES) {
Thread.sleep(50);
getLinks(link.attr("a[href]"));
}
else {
System.out.println("URL couldn't visit");
System.out.println(startURL + ", " + urlVisited.size());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
public void PrintAllTitles() {
for (String t : titles) {
System.out.println(t);
}
}
public void PrintAllWordsAndCount() {
for (String key : map.keySet()) {
System.out.println(key + " : " + map.get(key));
}
}
private void CountWords(String text) {
String[] lines = text.split(" ");
for (String word : lines) {
if (map.containsKey(word)) {
int val = map.get(word);
val += 1;
map.remove(word);
map.put(word, val);
} else {
map.put(word, 1);
}
}
}
}
The Driver function just uses c.getLinks(https://en.wikipedia.org/wiki/Computer)
as the starting URL.

The issue is in this line:
getLinks(link.attr("a[href]"));
link.attr(attributeName) is a method for getting an element's attribute by name. But a[href] is a CSS selector. So that method call returns a blank String (as there is no attribute in the element named a[href]), which is not a valid URL, and so you get the validation exception.
Before you call connect, you should log the URL you are about to hit. That way you will see the error.
You should change the line to:
getLinks(link.attr("abs:href"));
That will get the absolute URL pointed to by the href attribute. Most of the hrefs on that page are relative, so it's important to make them absolute before they are made into a URL for connect().
You can see the URLs that the first a[href] selector will return here. You should also think about how to only fetch HTML pages and not images (e.g., maybe filter out by filetype).
There is more detail and examples of this area in the Working with URLs article of jsoup.

Related

How to keep this code repeating more than once

My code pulls the links and adds them to the HashSet. I want the link to replace the original link and repeat the process till no more new links can be found to add. The program keeps running but the link isn't updating and the program gets stuck in an infinite loop doing nothing. How do I get the link to update so the program can repeat until no more links can be found?
package downloader;
import java.io.IOException;
import java.net.URL;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Stage2 {
public static void main(String[] args) throws IOException {
int q = 0;
int w = 0;
HashSet<String> chapters = new HashSet();
String seen = new String("/manga/manabi-ikiru-wa-fuufu-no-tsutome/i1778063/v1/c1");
String source = new String("https://mangapark.net" + seen);
// 0123456789
while( q == w ) {
String source2 = new String(source.substring(21));
String last = new String(source.substring(source.length() - 12));
String last2 = new String(source.substring(source.length() - 1));
chapters.add(seen);
for (String link : findLinks(source)) {
if(link.contains("/manga") && !link.contains(last) && link.contains("/i") && link.contains("/c") && !chapters.contains(link)) {
chapters.add(link);
System.out.println(link);
seen = link;
System.out.print(chapters);
System.out.println(seen);
}
}
}
System.out.print(chapters);
}
private static Set<String> findLinks(String url) throws IOException {
Set<String> links = new HashSet<>();
Document doc = Jsoup.connect(url)
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.get();
Elements elements = doc.select("a[href]");
for (Element element : elements) {
links.add(element.attr("href"));
}
return links;
}
}
Your progamm didn't stop becouse yout while conditions never change:
while( q == w )
is always true. I run your code without the while and I got 2 links print twice(!) and the programm stop.
If you want the links to the other chapters you have the same problem like me. In the element
Element element = doc.getElementById("sel_book_1");
the links are after the pseudoelement ::before. So they will not be in your Jsoup Document.
Here is my questsion to this topic:
How can I find a HTML tag with the pseudoElement ::before in jsoup

how to check if webelements are in order?

Using the code below, i am trying to open a link page and then go to mobile section and sort the items on the basis of name order. now i want to check if the mobile devices are sorted by Name means alphabetically.
i tried to convert my List below to arraylist but not able to check if elements printed are in ascending order, kindly help
package selflearning;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.support.ui.Select;
public class Guru99Ecommerce1 {
public static void main(String[] args) throws Exception {
System.setProperty("webdriver.gecko.driver","C:\\geckodriver\\geckodriver.exe");
WebDriver driver = new FirefoxDriver();
driver.get("http://live.guru99.com/index.php/");
String title=driver.getTitle();
String expectedTitle = "Home page";
System.out.println("The title of the webPage is " + title);
expectedTitle.equalsIgnoreCase(title);
System.out.println("Title is verified");
driver.findElement(By.xpath("//a[text()='Mobile']")).click();
String nextTitle = driver.getTitle();
System.out.println("The title of next page" + nextTitle);
String nextExpectedTitle = "pageMobile";
nextExpectedTitle.equalsIgnoreCase(nextTitle);
System.out.println("The next title is verified");
Select s = new Select(driver.findElement(By.xpath("//div[#class='category-products']//div/div[#class='sorter']/div/select[#title='Sort By']")));
s.selectByVisibleText("Name");
List<WebElement> element = driver.findElements(By.xpath("//div[#class='product-info']/h2/a"));
for(WebElement e: element)
{
String str = e.getText();
System.out.println("The items are " + str);
}
HashSet<WebElement> value = new
List<WebElement> list = new ArrayList<WebElement>(element);
list.addAll(element);
System.out.println("arrangement" + list);
}
}
The easiest way to do this is to just grab the list of products, loop through them, and see if the current product name (a String) is "greater" than the last product name using String#compareToIgnoreCase().
I would write some functions for the common tasks you are likely to repeat for this page.
public static void sortBy(String sortValue)
{
new Select(driver.findElement(By.cssSelector("select[title='Sort By']"))).selectByVisibleText(sortValue);
}
public static List<String> getProductNames()
{
List<String> names = new ArrayList<>();
List<WebElement> products = driver.findElements(By.cssSelector("ul.products-grid h2.product-name"));
for (WebElement product : products)
{
names.add(product.getText());
}
return names;
}
public static boolean isListSorted(List<String> list)
{
String last = list.get(0);
for (int i = 1; i < list.size(); i++)
{
String current = list.get(i);
if (last.compareToIgnoreCase(current) > 0)
{
return false;
}
last = current;
}
return true;
}
NOTE: You should be using JUnit or TestNG for your assertions instead of writing your own because it makes it much, much easier (and you don't have to write and debug your own which saves time). The code I wrote below is using TestNG. You can see how much shorter (and simpler) the code below is when using a library like TestNG.
String url = "http://live.guru99.com/index.php";
driver.navigate().to(url);
Assert.assertEquals(driver.getTitle(), "Home page");
driver.findElement(By.xpath("//nav[#id='nav']//a[.='Mobile']")).click();
Assert.assertEquals(driver.getTitle(), "Mobile");
sortBy("Name");
System.out.println(getProductNames());
System.out.println(isListSorted(getProductNames()));
Where getProductNames() returns
[IPHONE, SAMSUNG GALAXY, SONY XPERIA]

Parsing XML with StAX with non-unique tag paths, design suggestions

I need to parse a large XML file (probably going to use StAX in Java) and output it into a delimited text file and I have a couple of design questions. First here is an example of the XML
<demographic>
<value>001</value>
<question>Name?</question>
<value>Bob</value>
<question>Last Name?</question>
<value>Smith</value>
<followUpQuestions>
<question>Middle Init.</question>
<value>J</value>
</followUpQuestions>
</demographic>
this would need to be outputted (in the delimited output file) as
001~Bob~Smith~J
so here are my questions:
How can I distinguish between all the different "value" tags, since the tag names are not unique. Currently I tried to resolve this by having 'state' variables that turn on once they pass question-text such as "Name?", however this approach doesnt really work for the first value since I have to check to make sure the 'name' and 'lastName' states are off to ensure I'm getting the first value.
Everytime the client changes the text of the questions (which happens) I have to change the code and recompile it. Is there anyway to avoid this? Maybe save the questions-text in a text file that the program reads in?
Can this be scalable? I need to extract over 100 values and the XML files are usually about 2 gigs large.
Thank you, in advance, for your help (from a Java and XML newbie)!!
UPDATE: here is my attempt to code the solution, can someone please help to streamline? There has to be a less messy way to do this:
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import java.io.*;
class TestJavaForStackOverflow{
boolean nameState = false,
lastNameState = false,
middleInitState = false;
String name = "",
lastName = "",
middleInit = "",
value = "";
public void parse() throws IOException, XMLStreamException{
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader streamReader = factory.createXMLStreamReader(
new FileReader("/n04/data/revmgmt/anthony/scripts/Java_Programs/TestJavaForStackOverflow.xml"));
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
if("demographic".equals(streamReader.getLocalName())){
parseDemographicInformation(streamReader);
}
}
}
System.out.println(value + "~" + name + "~" + lastName + "~" + middleInit);
}
public void parseDemographicInformation(XMLStreamReader streamReader) throws XMLStreamException {
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.END_ELEMENT){
if("demographic".equals(streamReader.getLocalName())){
return;
}
}
else if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
if("question".equals(streamReader.getLocalName())){
streamReader.next();
if("Name?".equals(streamReader.getText())){
nameState = true;
}
else if("Last Name?".equals(streamReader.getText())){
lastNameState = true;
}
else if("Middle Init.".equals(streamReader.getText())){
middleInitState = true;
}
}
else if("value".equals(streamReader.getLocalName())){
streamReader.next();
if(nameState){
name = streamReader.getText();
nameState = false;
}
else if (lastNameState){
lastName = streamReader.getText();
lastNameState = false;
}
else if (middleInitState){
middleInit = streamReader.getText();
middleInitState = false;
}
else {
value = streamReader.getText();
}
}
}
}
}
public static void main(String[] args){
TestJavaForStackOverflow t = new TestJavaForStackOverflow();
try{t.parse();}
catch(IOException e1){}
catch(XMLStreamException e2){}
}
}
I think the flags are not very scalable if you have a lot of different questions to parse, and neither are the global variables to hold the results... if you have 100 questions then you'll need 100 variables, and when they change over time it will be a bear to keep them up to date. I would use a map structure to hold the result, and another one to hold the correspondence between each question text and the corresponding field you are trying to capture (this is not actual Java, just an approximation):
public Map parseDemographicInformation(XmlStream xml, Map questionMap) {
Map record = new Map();
String field = "id";
while((elem = xml.getNextElement())) {
if(elem.tagName == "question") {
field = questionMap[elem.value];
} else if(elem.tagName == "value") {
record[field] = elem.value;
}
}
return record;
}
Then you have something like this to output the result:
String[] fieldsToOutput = { "id", "firstName", "lastName" }; // ideally read this from a file too so it can be changed dynamically
// ...
for(int i=0; i < fieldsToOutput.length; i++){
if(i > 0)
System.out.print("~");
System.out.print(record[fieldsToOutput[i]]);
}
System.out.println();

Parse Company Info

I was wondering if anyone knows how to successfully parse the company name "Alcoa Inc." shown in the URL below. It would be much easier to show a picture but I do not have enough reputation. Any help would be appreciated.
http://www.google.com/finance?q=NYSE%3AAA&ei=LdwVUYC7Fp_YlgPBiAE
This is what I have tried so far using jsoup to parse the div class:
<div class="appbar-snippet-primary">
<span>Alcoa Inc.</span>
</div>
public Elements htmlParser(String url, String element, String elementType, String returnElement){
try {
Document doc = Jsoup.connect(url).get();
Document parse = Jsoup.parse(doc.html());
if (returnElement == null){
return parse.select(elementType + "." + element);
}
else {
return parse.select(elementType + "." + element + " " + returnElement);
}
}
public String htmlparseGoogleStocks(String url){
String pr = "pr";
String appbar_center = "appbar-snippet-primary";
String val = "val";
String span = "span";
String div = "div";
String td = "td";
Elements price_data;
Elements title_data;
Elements more_data;
price_data = htmlParser(url, pr, span, null);
title_data = htmlParser(url, appbar_center, div, span);
//more_data = htmlParser(url, val, td, null);
//String stockprice = price_data.text().toString();
String title = title_data.text().toString();
//System.out.println(more_data.text());
return title;
Myself, I'd analyze the page of interest's source HTML, and then just use JSoup to extract the information. For instance, using a very small JSoup program like so:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class GoogleFinance {
public static final String PAGE = "https://www.google.com/finance?q=NASDAQ:XONE";
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect(PAGE).get();
Elements title = doc.select("title");
System.out.println(title.text());
}
}
You get in return:
ExOne Co: NASDAQ:XONE quotes & news - Google Finance
It doesn't get much easier than that.

In Java, how do I parse an xml schema (xsd) to learn what's valid at a given element?

I'd like to be able to read in an XML schema (i.e. xsd) and from that know what are valid attributes, child elements, values as I walk through it.
For example, let's say I have an xsd that this xml will validate against:
<root>
<element-a type="something">
<element-b>blah</element-b>
<element-c>blahblah</element-c>
</element-a>
</root>
I've tinkered with several libraries and I can confidently get <root> as the root element. Beyond that I'm lost.
Given an element I need to know what child elements are required or allowed, attributes, facets, choices, etc. Using the above example I'd want to know that element-a has an attribute type and may have children element-b and element-c...or must have children element-b and element-c...or must have one of each...you get the picture I hope.
I've looked at numerous libraries such as XSOM, Eclipse XSD, Apache XmlSchema and found they're all short on good sample code. My search of the Internet has also been unsuccessful.
Does anyone know of a good example or even a book that demonstrates how to go through an XML schema and find out what would be valid options at a given point in a validated XML document?
clarification
I'm not looking to validate a document, rather I'd like to know the options at a given point to assist in creating or editing a document. If I know "I am here" in a document, I'd like to determing what I can do at that point. "Insert one of element A, B, or C" or "attach attribute 'description'".
This is a good question. Although, it is old, I did not find an acceptable answer. The thing is that the existing libraries I am aware of (XSOM, Apache XmlSchema) are designed as object models. The implementors did not have the intention to provide any utility methods — you should consider implement them yourself using the provided object model.
Let's see how querying context-specific elements can be done by the means of Apache XmlSchema.
You can use their tutorial as a starting point. In addition, Apache CFX framework provides the XmlSchemaUtils class with lots of handy code examples.
First of all, read the XmlSchemaCollection as illustrated by the library's tutorial:
XmlSchemaCollection xmlSchemaCollection = new XmlSchemaCollection();
xmlSchemaCollection.read(inputSource, new ValidationEventHandler());
Now, XML Schema defines two kinds of data types:
Simple types
Complex types
Simple types are represented by the XmlSchemaSimpleType class. Handling them is easy. Read the documentation: https://ws.apache.org/commons/XmlSchema/apidocs/org/apache/ws/commons/schema/XmlSchemaSimpleType.html. But let's see how to handle complex types. Let's start with a simple method:
#Override
public List<QName> getChildElementNames(QName parentElementName) {
XmlSchemaElement element = xmlSchemaCollection.getElementByQName(parentElementName);
XmlSchemaType type = element != null ? element.getSchemaType() : null;
List<QName> result = new LinkedList<>();
if (type instanceof XmlSchemaComplexType) {
addElementNames(result, (XmlSchemaComplexType) type);
}
return result;
}
XmlSchemaComplexType may stand for both real type and for the extension element. Please see the public static QName getBaseType(XmlSchemaComplexType type) method of the XmlSchemaUtils class.
private void addElementNames(List<QName> result, XmlSchemaComplexType type) {
XmlSchemaComplexType baseType = getBaseType(type);
XmlSchemaParticle particle = baseType != null ? baseType.getParticle() : type.getParticle();
addElementNames(result, particle);
}
When you handle XmlSchemaParticle, consider that it can have multiple implementations. See: https://ws.apache.org/commons/XmlSchema/apidocs/org/apache/ws/commons/schema/XmlSchemaParticle.html
private void addElementNames(List<QName> result, XmlSchemaParticle particle) {
if (particle instanceof XmlSchemaAny) {
} else if (particle instanceof XmlSchemaElement) {
} else if (particle instanceof XmlSchemaGroupBase) {
} else if (particle instanceof XmlSchemaGroupRef) {
}
}
The other thing to bear in mind is that elements can be either abstract or concrete. Again, the JavaDocs are the best guidance.
Many of the solutions for validating XML in java use the JAXB API. There's an extensive tutorial available here. The basic recipe for doing what you're looking for with JAXB is as follows:
Obtain or create the XML schema to validate against.
Generate Java classes to bind the XML to using xjc, the JAXB compiler.
Write java code to:
Open the XML content as an input stream.
Create a JAXBContext and Unmarshaller
Pass the input stream to the Unmarshaller's unmarshal method.
The parts of the tutorial you can read for this are:
Hello, world
Unmarshalling XML
I see you have tried Eclipse XSD. Have you tried Eclipse Modeling Framework (EMF)? You can:
Generating an EMF Model using XML Schema (XSD)
Create a dynamic instance from your metamodel (3.1 With the dynamic instance creation tool)
This is for exploring the xsd. You can create the dynamic instance of the root element then you can right click the element and create child element. There you will see what the possible children element and so on.
As for saving the created EMF model to an xml complied xsd: I have to look it up. I think you can use JAXB for that (How to use EMF to read XML file?).
Some refs:
EMF: Eclipse Modeling Framework, 2nd Edition (written by creators)
Eclipse Modeling Framework (EMF)
Discover the Eclipse Modeling Framework (EMF) and Its Dynamic Capabilities
Creating Dynamic EMF Models From XSDs and Loading its Instances From XML as SDOs
This is a fairly complete sample on how to parse an XSD using XSOM:
import java.io.File;
import java.util.Iterator;
import java.util.Vector;
import org.xml.sax.ErrorHandler;
import com.sun.xml.xsom.XSComplexType;
import com.sun.xml.xsom.XSElementDecl;
import com.sun.xml.xsom.XSFacet;
import com.sun.xml.xsom.XSModelGroup;
import com.sun.xml.xsom.XSModelGroupDecl;
import com.sun.xml.xsom.XSParticle;
import com.sun.xml.xsom.XSRestrictionSimpleType;
import com.sun.xml.xsom.XSSchema;
import com.sun.xml.xsom.XSSchemaSet;
import com.sun.xml.xsom.XSSimpleType;
import com.sun.xml.xsom.XSTerm;
import com.sun.xml.xsom.impl.Const;
import com.sun.xml.xsom.parser.XSOMParser;
import com.sun.xml.xsom.util.DomAnnotationParserFactory;
public class XSOMNavigator
{
public static class SimpleTypeRestriction
{
public String[] enumeration = null;
public String maxValue = null;
public String minValue = null;
public String length = null;
public String maxLength = null;
public String minLength = null;
public String[] pattern = null;
public String totalDigits = null;
public String fractionDigits = null;
public String whiteSpace = null;
public String toString()
{
String enumValues = "";
if (enumeration != null)
{
for(String val : enumeration)
{
enumValues += val + ", ";
}
enumValues = enumValues.substring(0, enumValues.lastIndexOf(','));
}
String patternValues = "";
if (pattern != null)
{
for(String val : pattern)
{
patternValues += "(" + val + ")|";
}
patternValues = patternValues.substring(0, patternValues.lastIndexOf('|'));
}
String retval = "";
retval += minValue == null ? "" : "[MinValue = " + minValue + "]\t";
retval += maxValue == null ? "" : "[MaxValue = " + maxValue + "]\t";
retval += minLength == null ? "" : "[MinLength = " + minLength + "]\t";
retval += maxLength == null ? "" : "[MaxLength = " + maxLength + "]\t";
retval += pattern == null ? "" : "[Pattern(s) = " + patternValues + "]\t";
retval += totalDigits == null ? "" : "[TotalDigits = " + totalDigits + "]\t";
retval += fractionDigits == null ? "" : "[FractionDigits = " + fractionDigits + "]\t";
retval += whiteSpace == null ? "" : "[WhiteSpace = " + whiteSpace + "]\t";
retval += length == null ? "" : "[Length = " + length + "]\t";
retval += enumeration == null ? "" : "[Enumeration Values = " + enumValues + "]\t";
return retval;
}
}
private static void initRestrictions(XSSimpleType xsSimpleType, SimpleTypeRestriction simpleTypeRestriction)
{
XSRestrictionSimpleType restriction = xsSimpleType.asRestriction();
if (restriction != null)
{
Vector<String> enumeration = new Vector<String>();
Vector<String> pattern = new Vector<String>();
for (XSFacet facet : restriction.getDeclaredFacets())
{
if (facet.getName().equals(XSFacet.FACET_ENUMERATION))
{
enumeration.add(facet.getValue().value);
}
if (facet.getName().equals(XSFacet.FACET_MAXINCLUSIVE))
{
simpleTypeRestriction.maxValue = facet.getValue().value;
}
if (facet.getName().equals(XSFacet.FACET_MININCLUSIVE))
{
simpleTypeRestriction.minValue = facet.getValue().value;
}
if (facet.getName().equals(XSFacet.FACET_MAXEXCLUSIVE))
{
simpleTypeRestriction.maxValue = String.valueOf(Integer.parseInt(facet.getValue().value) - 1);
}
if (facet.getName().equals(XSFacet.FACET_MINEXCLUSIVE))
{
simpleTypeRestriction.minValue = String.valueOf(Integer.parseInt(facet.getValue().value) + 1);
}
if (facet.getName().equals(XSFacet.FACET_LENGTH))
{
simpleTypeRestriction.length = facet.getValue().value;
}
if (facet.getName().equals(XSFacet.FACET_MAXLENGTH))
{
simpleTypeRestriction.maxLength = facet.getValue().value;
}
if (facet.getName().equals(XSFacet.FACET_MINLENGTH))
{
simpleTypeRestriction.minLength = facet.getValue().value;
}
if (facet.getName().equals(XSFacet.FACET_PATTERN))
{
pattern.add(facet.getValue().value);
}
if (facet.getName().equals(XSFacet.FACET_TOTALDIGITS))
{
simpleTypeRestriction.totalDigits = facet.getValue().value;
}
if (facet.getName().equals(XSFacet.FACET_FRACTIONDIGITS))
{
simpleTypeRestriction.fractionDigits = facet.getValue().value;
}
if (facet.getName().equals(XSFacet.FACET_WHITESPACE))
{
simpleTypeRestriction.whiteSpace = facet.getValue().value;
}
}
if (enumeration.size() > 0)
{
simpleTypeRestriction.enumeration = enumeration.toArray(new String[] {});
}
if (pattern.size() > 0)
{
simpleTypeRestriction.pattern = pattern.toArray(new String[] {});
}
}
}
private static void printParticle(XSParticle particle, String occurs, String absPath, String indent)
{
boolean repeats = particle.isRepeated();
occurs = " MinOccurs = " + particle.getMinOccurs() + ", MaxOccurs = " + particle.getMaxOccurs() + ", Repeats = " + Boolean.toString(repeats);
XSTerm term = particle.getTerm();
if (term.isModelGroup())
{
printGroup(term.asModelGroup(), occurs, absPath, indent);
}
else if(term.isModelGroupDecl())
{
printGroupDecl(term.asModelGroupDecl(), occurs, absPath, indent);
}
else if (term.isElementDecl())
{
printElement(term.asElementDecl(), occurs, absPath, indent);
}
}
private static void printGroup(XSModelGroup modelGroup, String occurs, String absPath, String indent)
{
System.out.println(indent + "[Start of Group " + modelGroup.getCompositor() + occurs + "]" );
for (XSParticle particle : modelGroup.getChildren())
{
printParticle(particle, occurs, absPath, indent + "\t");
}
System.out.println(indent + "[End of Group " + modelGroup.getCompositor() + "]");
}
private static void printGroupDecl(XSModelGroupDecl modelGroupDecl, String occurs, String absPath, String indent)
{
System.out.println(indent + "[GroupDecl " + modelGroupDecl.getName() + occurs + "]");
printGroup(modelGroupDecl.getModelGroup(), occurs, absPath, indent);
}
private static void printComplexType(XSComplexType complexType, String occurs, String absPath, String indent)
{
System.out.println();
XSParticle particle = complexType.getContentType().asParticle();
if (particle != null)
{
printParticle(particle, occurs, absPath, indent);
}
}
private static void printSimpleType(XSSimpleType simpleType, String occurs, String absPath, String indent)
{
SimpleTypeRestriction restriction = new SimpleTypeRestriction();
initRestrictions(simpleType, restriction);
System.out.println(restriction.toString());
}
public static void printElement(XSElementDecl element, String occurs, String absPath, String indent)
{
absPath += "/" + element.getName();
String typeName = element.getType().getBaseType().getName();
if(element.getType().isSimpleType() && element.getType().asSimpleType().isPrimitive())
{
// We have a primitive type - So use that instead
typeName = element.getType().asSimpleType().getPrimitiveType().getName();
}
boolean nillable = element.isNillable();
System.out.print(indent + "[Element " + absPath + " " + occurs + "] of type [" + typeName + "]" + (nillable ? " [nillable] " : ""));
if (element.getType().isComplexType())
{
printComplexType(element.getType().asComplexType(), occurs, absPath, indent);
}
else
{
printSimpleType(element.getType().asSimpleType(), occurs, absPath, indent);
}
}
public static void printNameSpace(XSSchema s, String indent)
{
String nameSpace = s.getTargetNamespace();
// We do not want the default XSD namespaces or a namespace with nothing in it
if(nameSpace == null || Const.schemaNamespace.equals(nameSpace) || s.getElementDecls().isEmpty())
{
return;
}
System.out.println("Target namespace: " + nameSpace);
Iterator<XSElementDecl> jtr = s.iterateElementDecls();
while (jtr.hasNext())
{
XSElementDecl e = (XSElementDecl) jtr.next();
String occurs = "";
String absPath = "";
XSOMNavigator.printElement(e, occurs, absPath,indent);
System.out.println();
}
}
public static void xsomNavigate(File xsdFile)
{
ErrorHandler errorHandler = new ErrorReporter(System.err);
XSSchemaSet schemaSet = null;
XSOMParser parser = new XSOMParser();
try
{
parser.setErrorHandler(errorHandler);
parser.setAnnotationParser(new DomAnnotationParserFactory());
parser.parse(xsdFile);
schemaSet = parser.getResult();
}
catch (Exception exp)
{
exp.printStackTrace(System.out);
}
if(schemaSet != null)
{
// iterate each XSSchema object. XSSchema is a per-namespace schema.
Iterator<XSSchema> itr = schemaSet.iterateSchema();
while (itr.hasNext())
{
XSSchema s = (XSSchema) itr.next();
String indent = "";
printNameSpace(s, indent);
}
}
}
public static void printFile(String fileName)
{
File fileToParse = new File(fileName);
if (fileToParse != null && fileToParse.canRead())
{
xsomNavigate(fileToParse);
}
}
}
And for your Error Reporter use:
import java.io.OutputStream;
import java.io.PrintStream;
import java.text.MessageFormat;
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
public class ErrorReporter implements ErrorHandler {
private final PrintStream out;
public ErrorReporter( PrintStream o ) { this.out = o; }
public ErrorReporter( OutputStream o ) { this(new PrintStream(o)); }
public void warning(SAXParseException e) throws SAXException {
print("[Warning]",e);
}
public void error(SAXParseException e) throws SAXException {
print("[Error ]",e);
}
public void fatalError(SAXParseException e) throws SAXException {
print("[Fatal ]",e);
}
private void print( String header, SAXParseException e ) {
out.println(header+' '+e.getMessage());
out.println(MessageFormat.format(" line {0} at {1}",
new Object[]{
Integer.toString(e.getLineNumber()),
e.getSystemId()}));
}
}
For your main use:
public class WDXSOMParser
{
public static void main(String[] args)
{
String fileName = null;
if(args != null && args.length > 0 && args[0] != null)
fileName = args[0];
else
fileName = "C:\\xml\\CollectionComments\\CollectionComment1.07.xsd";
//fileName = "C:\\xml\\PropertyListingContractSaleInfo\\PropertyListingContractSaleInfo.xsd";
//fileName = "C:\\xml\\PropertyPreservation\\PropertyPreservation.xsd";
XSOMNavigator.printFile(fileName);
}
}
It's agood bit of work depending on how compex your xsd is but basically.
if you had
<Document>
<Header/>
<Body/>
<Document>
And you wanted to find out where were the alowable children of header you'd (taking account of namespaces)
Xpath would have you look for '/element[name="Document"]/element[name="Header"]'
After that it depends on how much you want to do. You might find it easier to write or find something that loads an xsd into a DOM type structure.
Course you are going to possibly find all sorts of things under that elment in xsd, choice, sequence, any, attributes, complexType, SimpleContent, annotation.
Loads of time consuming fun.
Have a look at this.
How to parse schema using XOM Parser.
Also, here is the project home for XOM

Categories

Resources