Implementing XML parser in Java for educational purposes

Implementing XML parser in Java for educational purposes - java

I want to validate if a XML (in a String object) is well formed. Like this:
"<root> Hello StackOverflow! <a> Something here </a> Goodbye StackOverflow </root>"
It should also validate attributes, but I'm kind of too far of that right now. I just want to make sure I have the logic right. Here's what I've got so far, but I'm stucked and I need some help.
public boolean isWellFormed( String str )
{
boolean retorno = true;
if ( str == null )
{
throw new NullPointerException();
}
else
{
this.chopTheElements( str );
this.chopTags();
}
return retorno;
}
private void chopTags()
{
for ( String element : this.elements )
{
this.tags.add( element.substring( 1, element.length()-1 ) );
}
}
public void chopTheElements( String str )
{
for ( int i = 0; i < str.length(); i++ )
{
if ( str.charAt( i ) == '<' )
{
elements.add( getNextToken( str.substring( i ) ) );
}
}
}
private String getNextToken( String str )
{
String retStr = "";
if ( str.indexOf( ">" ) != -1 )
{
retStr = str.substring( 0, str.indexOf( ">" ) + 1 );
}
return retStr;
}
So far I chopped the elements like "" in a list, and then the tags in another, like this: root, /root.
But I don't know how to proceed or if I'm going in the right direction. I been asigned to solve this without regex.
Any advice? I'm lost here. Thanks.

Starting by breaking the string when you see a "<" is not the way to go about it, because the chunks you identify will be unrelated to the hierarchic structure of the XML. For example, if you have as input:
<a>xxx<b>...</b>yyy</a>
then one of your chunks will be "/b>yyy<" which isn't a useful thing to break up further.
You need to structure your code according to the structure of the grammar. If the grammar says that an element consists of a start tag then a sequence of (elements or characters) then an end tag, then you need a method that matches that sequence, and calls other methods to process its components. Because the grammar is recursive, your code will be recursive, so this is known as recursive descent parsing. It's something that is often taught in computer science courses so you'll find excellent coverage of the topic in textbooks.

If you're not dealing with a huge XML file, consider DOM parsers for your purpose. I would suggest that you look at DocumentBuilder class for this purpose. You would actually need to call the different parse() methods (your source can be a file or any other InputSource)

Related

Making a string of postorder data of a given tree

As part of my assignment , I am given an expression tree and I need to convert it to a in-fix with O(n) run-time.
For example,
To convert this tree to "( ( 1 V ( 2 H ( 3 V 4 ) ) ) H ( 5 V 6 ) )".
I couldn't think of a way to convert it straight to infix so I thought of first converting it to post-fix and than to in-fix. (If there's a better way please tell me).
Now, my problem is with converting the tree to post-order.
I have tried the following:
private String treeToPost(Node node, String post) {
if (node != null) {
treeToPost(node.left, post);
treeToPost(node.right, post);
post = post + node.data.getName();
}
return post;
}
Now I have two problems with this method, the first one is that doesn't work because it only saves the last node it traveled, the second one is that I'm not sure it will run at O(n) because it will have to create a new string each time.
I found a solution to this issue here but it used StringBuilder which I am not allowed to use. I thought of making an array of chars to save the data , but because I dont know the size of the tree I cant know the size of the needed array.
Thank you for your time :)

Going directly to infix is probably easier, just always add parenthesis.
Secondly, doing something like this will save all nodes:
private String treeToPost(Node node) {
String returnString = "";
if (node != null) {
returnString += treeToPost(node.left);
returnString += treeToPost(node.right);
returnString += node.data.getName();
}
return returnString;
}
For infix, this should work
private String treeToPost(Node node) {
String returnString = "";
if (node != null) {
returnString += "(" + treeToPost(node.left);
returnString += node.data.getName();
returnString += treeToPost(node.right) + ")";
}
return returnString;
}
These both make new String objects each time. So i think it technically is O(n^2), because the string grows each time, but no professor of mine would deduct points for that.
However if you want to avoid this behaviour and can't use StringBuilder. You can use a CharArrayWriter. This is a buffer that grows dynamically. You can then make two methods. One that appends to the buffer and returns nothing. And one that returns a String. You would then call the buffer one from inside the String one.

Regex to print out all the matches inside ? and , or )

I have a collection of Strings saved in 2d array.
the string has a shape of horn-clause and the complete one string can be in the form of patient(?x) as well as hasdoctor(?x,?y)
if i write the ?x=alex and ?y=john then the above string takes a structure of
patient(alex)
hasdoctor(alex, john)
Now the Question is when is use the below code it finds the ?x, but in the hasdoctor(?x,?y) it skips the ?y .
void find_var(String[][] temp)
{
System.out.println(temp.length);
System.out.println(temp[0].length);
for(int i=0;i<temp.length;i++)
for(int j=1;j<temp[0].length-1;j++)
{
String text_to_parse=temp[i][j];
Pattern y = Pattern.compile("[?]\\w[,)]");
Matcher z= y.matcher(text_to_parse);
if(z.find())
{
System.out.println("Found at::"+temp[i][j]);
System.out.println(z.group());
}
else
{
System.out.println("Not found at::"+temp[i][j]);
}
}}
the pesudo code i can explain that i want in java is
if([?]\\w[,) is found in array[][])
if([?]\\w[,) is already in other_array[]
Then skip;
else
save [?]\\w[,) to other_array[]

Can't say that I completely understand what you're trying to achieve, but I think the problem is that you're using
if (z.find()) { /* ... */ }
instead of
while (z.find()) { /* ... */ }
Using if, the string will not be completely consumed and it will return after the first match is found.

regex expression for nested structures

I have the following string :
bla {{bla {{bla bla {{afsaasg}} }} blabla}} {{bla bla}} bla
I would like to match
{{bla {{bla bla {{afsaasg}} }} blabla}}
with a regex.
but my regex
{{(.*?)}}
matches
{{bla {{bla bla}}
anyone can help ?
Additional Info : I expect to have not more then 2 brackets at the same time.
Finally I solved this with an own Java fuction. Perhabs this will help someone :
public static ArrayList<String> getRecursivePattern(String sText, String sBegin, String sEnd) {
ArrayList<String> alReturn = new ArrayList<String>();
boolean ok1 = true;
boolean ok2 = true;
int iStartCount = 0;
int iEndCount = 0;
int iStartSearching = 0;
while (ok1) {
int iAnfang = sText.indexOf(sBegin, iStartSearching);
ok2 = true;
if (iAnfang > -1) {
while (ok2) {
int iStartCharacter = sText.indexOf(sBegin, iStartSearching);
int iEndCharacter = sText.indexOf(sEnd, iStartSearching);
if (iEndCharacter == -1) {
// Nothing found . stop
ok2 = false;
ok1 = false;
} else if (iStartCharacter < iEndCharacter && iStartCharacter != -1) {
// found startpattern
iStartCount = iStartCount + 1;
iStartSearching = iStartCharacter + sBegin.length();
} else if (iStartCharacter > iEndCharacter && iEndCharacter != -1 || (iStartCharacter == -1 && iEndCharacter != -1)) {
iEndCount = iEndCount + 1;
iStartSearching = iEndCharacter + sEnd.length();
} else {
if (iStartCharacter < 0) {
// No End found . stop
ok2 = false;
}
}
if (iEndCount == iStartCount) {
// found the pattern
ok2 = false;
// cut
int iEnde = iStartSearching;// +sEnd.length();
String sReturn = sText.substring(iAnfang, iEnde);
alReturn.add(sReturn);
}
}
} else {
ok1 = false;
}
}
return alReturn;
}
I call it:
ArrayList<String> alTest=getRecursivePattern("This {{ is a {{Test}} bla }}","{{","}}");
System.out.println(" sTest : " + alTest.get(0));

.NET has special support for nested item matching, so {{(?>[^\{\}]+|\{(?<DEPTH>)|\}(?<-DEPTH>))*(?(DEPTH)(?!))}} would do what you wanted in C# to any level of nesting, but not Java.

Don't you need to escape the curly braces? I do in notepad++. Anyway, this should do it
\{\{[^{]+\{\{[^{}]+\}\}[^}]+\}\}

You can't do this with regular expressions. It the consequence of the pumping lemma. You need to use context-free grammar's, or perhaps use dedicated tools (like XML/DOM/... parsers).
You can indeed parse this for - say - three levels deep, but you can't let this work for an arbitrary number of levels. Even then, it's better to use context-free grammars (like a LALR compiler compiler), simply because "These are the tools designed to parse such structures.".
In other words, If one day, someone can enter {{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{ bla }}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}, and this is supposed to be valid, it will most likely fail.
One sidenote:
Say the level is for instance i levels deep, you can use a regex like:
for 1: .*?(.*?\{\{.*?\}\}.*?)*.*?
for 2: .*?(.*?\{\{.*?(.*?\{\{.*?\}\}.*?)*.*?\}\}.*?)*.*?
...
But as you can see, the more deep you go, the longer the regex, and there is no way to parse them for arbitrary depth.
See also this discussion for people who want to parse XML/HTML - another recursive language - with regexes.
As you noted, some regular expression toolkits indeed provide tools to count things. These can be found in the P-languages (PHP, Perl,...). These aren't regular expressions (as defined by Kleene, see this Wikipedia-article about what a real regex is) strictly speaking, but simplified parsers. Because they don't describe a regular language. And - currently - not available in most regex libraries including Java. Some of the libraries even provide Turing complete parsers, parsers than can parse anything you can parse algorithmically, but it's not really recommended for advanced tasks...

java replace HTML_Escapecodes

i need to develope a new methode, that should replace all Umlaute (ä, ö, ü) of a string entered with high performance with the correspondent HTML_Escapecodes. According to statistics only 5% of all strings entered contain Umlauts. As it is supposed that the method will be used extensively, any instantiation that is not necessary should be avoided.
Could someone show me a way to do it?

These are the HTML escape codes. Additionally, HTML features arbitrary escaping with codes of the format : and equivalently :
A simple string-replace is not going to be efficient with so many strings to replace. I suggest you split the string by entity matches, such as this:
String[] parts = str.split("&([A-Za-z]+|[0-9]+|x[A-Fa-f0-9]+);");
if(parts.length <= 1) return str; //No matched entities.
Then you can re-build the string with the replaced parts inserted.
StringBuilder result = new StringBuilder(str.length());
result.append(parts[0]); //First part always exists.
int pos = parts[0].length + 1; //Skip past the first entity and the ampersand.
for(int i = 1;i < parts.length;i++) {
String entityName = str.substring(pos,str.indexOf(';',pos));
if(entityName.matches("x[A-Fa-f0-9]+") && entityName.length() <= 5) {
result.append((char)Integer.decode("0" + entityName));
} else if(entityName.matches("[0-9]+")) {
result.append((char)Integer.decode(entityName));
} else {
switch(entityName) {
case "euml": result.append('ë'); break;
case "auml": result.append('ä'); break;
...
default: result.append("&" + entityName + ";"); //Unknown entity. Give the original string.
}
}
result.append(parts[i]); //Append the text after the entity.
pos += entityName.length() + parts[i].length() + 2; //Skip past the entity name, the semicolon and the following part.
}
return result.toString();
Rather than copy-pasting this code, type it in your own project by hand. This gives you the opportunity to look at how the code actually works. I didn't run this code myself, so I can't guarantee it being correct. It can also be made slightly more efficient by pre-compiling the regular expressions.

Multiple arguments for !somearray.contains

Is it possible to have multiple arguments for a .contains? I am searching an array to ensure that each string contains one of several characters. I've hunted all over the web, but found nothing useful.
for(String s : fileContents) {
if(!s.contains(syntax1) && !s.contains(syntax2)) {
found.add(s);
}
}
for (String s : found) {
System.out.println(s); // print array to cmd
JOptionPane.showMessageDialog(null, "Note: Syntax errors found.");
}
How can I do this with multiple arguments? I've also tried a bunch of ||s on their own, but that doesn't seem to work either.

No, it can't have multiple arguments, but the || should work.
!s.contains(syntax1+"") || !s.contains(syntax2+"") means s doesn't contain syntax1 or it doesn't contain syntax2.
This is just a guess but you might want s contains either of the two:
s.contains(syntax1+"") || s.contains(syntax2+"")
or maybe s contains both:
s.contains(syntax1+"") && s.contains(syntax2+"")
or maybe s contains neither of the two:
!s.contains(syntax1+"") && !s.contains(syntax2+"")
If syntax1 and syntax2 are already strings, you don't need the +""'s.
I believe s.contains("") should always return true, so you can remove it.

It seems that what you described can be done with a regular expression.
In regular expression, the operator | marks you need to match one of several choices.
For example, the regex (a|b) means a or b.
The regex ".*(a|b).*" means a string that contains a or b, and other then that - all is OK (it assumes one line string, but that can be dealt with easily as well if needed).
Code example:
String s = "abc";
System.out.println(s.matches(".*(a|d).*"));
s = "abcd";
System.out.println(s.matches(".*(a|d).*"));
s = "fgh";
System.out.println(s.matches(".*(a|d).*"));
Regular Exprsssions is a powerful tool that I recommend learning. Have a look at this tutorial, you might find it helpful.

There is not such thing as multiple contains.
if you require to validate that a list of string is included in some other string you must iterate through them all and check.
public static boolean containsAll(String input, String... items) {
if(input == null) throw new IllegalArgumentException("Input must not be null"); // We validate the input
if(input.length() == 0) {
return items.length == 0; // if empty contains nothing then true, else false
}
boolean result = true;
for(String item : items) {
result = result && input.contains(item);
}
return result;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Implementing XML parser in Java for educational purposes - java

If you're not dealing with a huge XML file, consider DOM parsers for your purpose. I would suggest that you look at DocumentBuilder class for this purpose. You would actually need to call the different parse() methods (your source can be a file or any other InputSource)

Related

Making a string of postorder data of a given tree

Regex to print out all the matches inside ? and , or )

regex expression for nested structures

java replace HTML_Escapecodes

Multiple arguments for !somearray.contains

Categories

Resources