Fixing malformed XML with java - java

I have several files containing the following XML element:
<table cellpadding="0" cellspacing="0" border="0"style="width:100%">
The part that says border="0"style=" needs a space between the 0 value and style attribute.
Unfortunately there are too many files with this issue to make manually going and inserting the space a viable option.
I can edit attributes and I can edit values by creating an Xpath that gets the table as a NodeList, creates a node and gets the attributes.. but how would I add a space between the attribute and the value??

We could always just String.split("\""); aka split on the commas.
Here, try this:
/** In reality, you would probably read file to string?
* or read line by line? either way is an easy fix!
*/
String input = ("<table cellpadding=\"0\" cellspacing=\"0\" border=\"0\"style=\"width:100%\">");
String xmlTag = StringUtils.substringBetween(input, "<", ">");
Starting with index number, array after split contains as follows:
XML Tag Name
ODD INDICES ~ 1, 3, 5, and so on, contain: attribute name.
EVEN INDICES ~ 2, 4, 6, and so on, contain: attribute value.
int arrSize = xmlCharValPairs.length()
String[] xmlCharValPairs = xmlTag.split("\"");
StringBuilder sb = new StringBuilder(arrSize);
sb.append("<" + xmlCharValPairs[0] + " ");
for (int i = 1; i < arrSize-1; i++) {
if (i%2 == 0)
sb.append("\"" + xmlCharValPairs[i].trim() + "\" ");
else
sb.append(xmlCharValPairs[i]);
}
String returnXMLFormat = sb.toString();
This will leave you with an XML String in your requested format :)

If it's consistent length then all you need to write is a simple string parser that would add extra "" at X position.
If it's not the same everything I think I would try to check if char is " then a char -1 from it and then check if it's =" or (some letter)" for example a".
width="100" vs width="100" anotherparam="...
This could tell you if it's begining or end of param. If it's the ending then simply add a space char after it.
Obiously you could then check if it's "(someletter) or "(space) to know if there is space char after your apostrophe.
width="100" param2="..." vs width="100"param2=""
If you have lets say 200 files to edit you could use something similar to this:
File folder = new File("your/path");
File[] listOfFiles = folder.listFiles();
Then simply open files in a loop, edit them and save them to new files with their orginal names or just overwrite current files. It's up to you.

Your file isn't well-formed XML so you will need a tool that can handle files that aren't well-formed XML. That rules anything in the XSLT/XQuery/XPath family.
You can probably fix nearly all occurrences of the problem, with low risk of adverse side effects, by using a regular expression that inserts a space after any occurrence of " that isn't immediately preceded by =. (This will add some unnecessary spaces, but the XML parser will ignore them.)

Related

How can i make that everything after "," in .txt file will be checked separetly :)?

I need to improve this code that will read from a text file named file.txt
sergy,many,mani,kserder
I would like to use this form :
file = login.getText();
if (file.equals("sergy"))
I just need to do something that will read everything in the text file separately and ignoring "," sign, or something else other than "," sign.
You could split the string by the , character and check if any of the array's elements are equal to the value you're looking for:
file = login.getText();
if (Arrays.asList(file.split(",")).contains("sergy")) {
// do something...

How to solve the IllegalDataException in jdom2 library?

I am using jdom 2.0.6 version and received this IllegalDataException:
Error in setText for tokenization:
it fails on calling the setText() method.
Element text = new Element("Text");
text.setText(doc.getText());
It seems some characters in 'text' it doesn't accept. For two examples:
Originally Posted by Yvette H( 45) Odd socks, yes, no undies yes, no coat yes, no shoes odd. 🏻
ParryOtter said: Posted
Should I specify encoding somewhere or for some other reasons?
In fact you just have to escape your text which contains illegal characters with CDATA :
Element text = new Element("Text");
text.setContent(new CDATA(doc.getText()));
The reverse operation (reading text escaped with CDATA is transparent in JDOM2, you won't have to escape it back).
For my tests I added an illegal character at the end of my text by creating one from hex value 0x2 like that :
String text = doc.getText();
int hex = 0x2;
text += (char) hex;

Java remove unwanted spaces in a text file and replace with character

I have the following text file. I want to remove the lines and spaces so that the text file has a clear delimter to process. I cannot think of any way to remove the gaps between lines, is there a way?
Student+James Smith+Status: Current Student+Student+James Fits+Status: Not a current Student
Textfile
Student
James Smith
Status: Current Student
Student
James Fits
Status: Not a current Student
I know that this
a.replaceAll("\\s+","");
removes whitespaces.
You could remove end of line characters in a similar fashion
a.replaceAll("\n","");
Where 'a' is a String.
use a regex take the whole text in to a string and
string txt = "whole String";
String formatted = txt.replaceAll("[^A-Za-z0-9]", "-");
this will result in changing + sign and " " to replace with "-" sign. so now you have a specific deleimeter.
Something like find \s*\r?\n\s* replace +
Trims whitespace and adds delimiter '+'
Result:
Student+James Smith+Status: Current Student+Student+James Fits+Status: Not a current Student
Try using this one.
\n+\s*
just use it like this :
yourStrVar.replaceAll("\n+ *", "+")

Why this regex not giving expected output?

i have string which contains some value as given below. i want to replace the html img tags containing specific customerId with some new text. i tried small java program which is not giving me expected output.here is the program info
My input string is
String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p>"
+ "<p>someText</p><img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456/> ..Ending here";
Regex is
String regex = "(?s)\\<img.*?customerId=3340.*?>";
new text i want to put inside input string
EDIT Starts:
String newText = "<img src=\"getCustomerNew.do\">";
EDIT ENDS:
now i am doing
String outputText = inputText.replaceAll(regex, newText);
output is
Starting here.. Replacing Text ..Ending here
but my expected output is
Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/></p><p>someText</p>Replacing Text ..Ending here
Please note in my expected output only img tag which is containing customerId=3340 got replaced with Replacing Text. i am not getting why in the output i am getting both the img tags are getting replced?
You've got "wildcard"/"any" patterns (.*) in there which will extend the match to the longest possible matching string, and the last fixed text in the pattern is a > character, which therefore matches the last > character in the input text, i.e. the very last one!
You should be able to fix this by changing the .* parts to something like [^>]+ so that the matching won't span past the first > character.
Parsing HTML with regular expressions is bound to cause pain.
As other people have told you in the comments, HTML is not a regular language so using regex for manipulating it is usually painful. Your best option is to use an HTML parser. I haven't used Jsoup before, but googling a little bit it seems you need something like:
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class MyJsoupExample {
public static void main(String args[]) {
String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123\"/></p>"
+ "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456\"/></p></body></html>";
Document doc = Jsoup.parse(inputText);
Elements myImgs = doc.select("img[src*=customerId=3340");
for (Element element : myImgs) {
element.replaceWith(new TextNode("my replaced text", ""));
}
System.out.println(doc.toString());
}
}
Basically the code gets the list of img nodes with a src attribute containing a given string
Elements myImgs = doc.select("img[src*=customerId=3340");
then loop over the list and replace those nodes with some text.
UPDATE
If you don't want to replace the whole img node with text but instead you need to give a new value to its src attribute then you can replace the block of the for loop with:
element.attr("src", "my new value"));
or if you want to change just a part of the src value then you can do:
String srcValue = element.attr("src");
element.attr("src", srcValue.replace("getCustomers.do", "getCustonerNew.do"));
which is very similar to what I posted in this thread.
What happens is that your regex starts matching the first img tag then consumes everything (regardless is greedy or not) until it finds customerId=3340 and then continues consuming everything until it finds >.
If you want it to consume just the img with customerId=3340 think of what makes different this tag from other tags that it may match.
In this particular case, one possible solution is to look at what is behind that img tag using a look-behind operator (which doesn't consume a match). This regex will work:
String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";

How to efficiently parse concatenated XML documents from a file

I have a file that consists of concatenated valid XML documents. I'd like to separate individual XML documents efficiently.
Contents of the concatenated file will look like this, thus the concatenated file is not itself a valid XML document.
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
Each individual XML document around 1-4 KB, but there is potentially a few hundred of them. All XML documents correspond to same XML Schema.
Any suggestions or tools? I am working in the Java environment.
Edit: I am not sure if the xml-declaration will be present in documents or not.
Edit: Let's assume that the encoding for all the xml docs is UTF-8.
Don't split! Add one big tag around it! Then it becomes one XML file again:
<BIGTAG>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
</BIGTAG>
Now, using /BIGTAG/SomeData would give you all the XML roots.
If processing instructions are in the way, you can always use a RegEx to remove them. It's easier to just remove all processing instructions than to use a RegEx to find all root nodes.
If encoding differs for all documents then remember this: the whole document itself must have been encoded by some encoding type, thus all those XML documents it includes will be using the same encoding, no matter what each header is telling you. If the big file is encoded as UTF-16 then it doesn't matter if the XML processing instructions say the XML itself is UTF-8. It won't be UTF-8 since the whole file is UTF-16. The encoding in those XML processing instructions is therefor invalid.
By merging them into one file, you've altered the encoding...
By RegEx, I mean regular expressions. You just have to remove all text that's between a <? and a ?> which should not be too difficult with a regular expression and slightly more complicated if you're trying other string manipulation techniques.
As Eamon says, if you know the <?xml> thing will always be there, just break on that.
Failing that, look for the ending document-level tag. That is, scan the text counting how many levels deep you are. Every time you see a tag that begins with "<" but not "</" and that does not end with "/>", add 1 to the depth count. Every time you see a tag that begins "</", subtract 1. Every time you subtract 1, check if you are now at zero. If so, you've reached the end of an XML document.
Since you're not sure the declaration will always be present, you can strip all declarations (a regex such as <\?xml version.*\?> can find these), prepend <doc-collection>, append </doc-collection>, such that the resultant string will be a valid xml document. In it, you can retrieve the separate documents using (for instance) the XPath query /doc-collection/*. If the combined file can be large enough that memory consumption becomes an issue, you may need to use a streaming parser such as Sax, but the principle remains the same.
In a similar scenario which I encountered, I simply read the concatenated document directly using an xml-parser: Although the concatenated file may not be a valid xml document, it is a valid xml fragment (barring the repeated declarations) - so, once you strip the declarations, if your parser supports parsing fragments, then you can also just read the result directly. All top-level elements will then be the root elements of the concatenated documents.
In short, if you strip all declarations, you'll have a valid xml fragment which is trivially parseable either directly or by surrounding it with some tag.
This is my answer for the C# version. very ugly code that works :-\
public List<T> ParseMultipleDocumentsByType<T>(string documents)
{
var cleanParsedDocuments = new List<T>();
var serializer = new XmlSerializer(typeof(T));
var flag = true;
while (flag)
{
if(documents.Contains(typeof(T).Name))
{
var startingPoint = documents.IndexOf("<?xml");
var endingString = "</" +typeof(T).Name + ">";
var endingPoing = documents.IndexOf(endingString) + endingString.Length;
var document = documents.Substring(startingPoint, endingPoing - startingPoint);
var singleDoc = (T)XmlDeserializeFromString(document, typeof(T));
cleanParsedDocuments.Add(singleDoc);
documents = documents.Remove(startingPoint, endingPoing - startingPoint);
}
else
{
flag = false;
}
}
return cleanParsedDocuments;
}
public static object XmlDeserializeFromString(string objectData, Type type)
{
var serializer = new XmlSerializer(type);
object result;
using (TextReader reader = new StringReader(objectData))
{
result = serializer.Deserialize(reader);
}
return result;
}
I don't have a Java answer, but here's how I solved this problem with C#.
I created a class named XmlFileStreams to scan the source document for the XML document declaration and break it up logically into multiple documents:
class XmlFileStreams {
List<int> positions = new List<int>();
byte[] bytes;
public XmlFileStreams(string filename) {
bytes = File.ReadAllBytes(filename);
for (int pos = 0; pos < bytes.Length - 5; ++pos)
if (bytes[pos] == '<' && bytes[pos + 1] == '?' && bytes[pos + 2] == 'x' && bytes[pos + 3] == 'm' && bytes[pos + 4] == 'l')
positions.Add(pos);
positions.Add(bytes.Length);
}
public IEnumerable<Stream> Streams {
get {
if (positions.Count > 1)
for (int i = 0; i < positions.Count - 1; ++i)
yield return new MemoryStream(bytes, positions[i], positions[i + 1] - positions[i]);
}
}
}
To use XmlFileStreams:
foreach (Stream stream in new XmlFileStreams(#"c:\tmp\test.xml").Streams) {
using (var xr = XmlReader.Create(stream, new XmlReaderSettings() { XmlResolver = null, ProhibitDtd = false })) {
// parse file using xr
}
}
There are a couple of caveats.
It reads the entire file into memory for processing. This could be a problem if the file is really big.
It uses a simple brute force search to look for the XML document boundaries.

Categories

Resources