I am in the process of adding some html based formatting features to a JTextPane. The idea is that the user would select text in the JTextPane, then click a button (bold, italic etc) to insert the html tags at the appropriate locations. I can do this without difficulty using the JTextPane.getSelectionStart() and .getSelectionEnd() methods.
My problem is that I also want to scan each character in the JTextPane to index all the html tag locations - this is so the software can detect where the JTextPane caret is in relation to the html tags. This information is then used if the user wants to remove the formatting tags.
I am having difficulty synchronising this character index with the caret position in the JTextPane. Here is the code I have been using:
public void scanHTML(){
try {
boolean blnDocStartFlag = false;
alTagRecords = new ArrayList(25);
alTextOnlyIndex = new ArrayList();
String strTagBuild = "";
int intTagIndex = 0; // The index for a tag pair record in alTagRecords.
int intTextOnlyCount = 0; // Counts each text character, ignoring all html tags.
// Loop through HTMLDoc character array:
for (int i = 0; i <= strHTMLDoc.length() -1; i ++){
// Look for the "<" angle bracket enclosing the tag keyword ...
if (strHTMLDoc.charAt(i) == '<'){// It is a html tag ...
int intTagStartLocation = i; // this value will go into alTagFields(?,0) later ...
while (strHTMLDoc.charAt(i) != '>'){
strTagBuild += strHTMLDoc.charAt(i);
i ++; // continue incrementing the iterator whilst in this sub loop ...
}
strTagBuild += '>'; // makes sure the closing tag is not missed from the string
if (!strTagBuild.startsWith("</")){
// Create new tag record:
ArrayList<Integer> alTagFields = new ArrayList(3);
alTagFields.add(0, intTagStartLocation); // Tag start location index ...
alTagFields.add(1, -1); // Tag end not known at this stage ...
alTagFields.add(2, getTagType(strTagBuild));
alTagRecords.add(intTagIndex, alTagFields); // Tag Type
System.out.println("Tag: " + strTagBuild);
intTagIndex ++; // Increment the tag records index ...
} else { // find corresponding start tag and store its location in the appropriate field of alTagFields:
int intManipulatedTagIndex = getMyOpeningTag(getTagType(strTagBuild));
ArrayList<Integer> alManipulateTagFields = alTagRecords.get(intManipulatedTagIndex);
alManipulateTagFields.set(1, (intTagStartLocation + strTagBuild.length() -1) ); // store the position of the end angled bracket of the closing tag ...
alTagRecords.set(intManipulatedTagIndex, alManipulateTagFields);
System.out.println("Tag: " + strTagBuild);
}
strTagBuild = "";
} else {
// Create the text index:
if (blnDocStartFlag == false){
int intAscii = (int) strHTMLDoc.charAt(i);
if (intAscii >= 33){ // Ascii character 33 is an exclamation mark(!). It is the first character after a space.
blnDocStartFlag = true;
}
}
// Has the first non space text character has been reached? ...
if (blnDocStartFlag == true){ // Index the character if it has ...
alTextOnlyIndex.add(i);
intTextOnlyCount ++;
}
}
}
} catch (Exception ex){
System.err.println("Error at HTMLTagIndexer.scanHTML: " + ex);
}
}
The problem with the code above is that the string variable strHTMLDoc is obtained using JTextPane.getText, and this appears to have inserted some extra space characters within the string. Consequently this has put it out of sync with the corresponding caret position in the text pane.
Can anybody suggest an alternative way to do what I am trying to achieve?
Many thanks
Related
I have a .docx template with placeholders to be filled, such as ${programming_language}, ${education}, etc.
The placeholder keywords must be easily distinguished from the other plain words, hence they are enclosed with ${ }.
for (XWPFTable table : doc.getTables()) {
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (XWPFRun run : paragraph.getRuns()) {
System.out.println("run text: " + run.text());
/** replace text here, etc. */
}
}
}
}
}
I want to extract the placeholders together with the enclosing ${ } characters. The problem is, that is seems like the enclosing characters are treated as different runs...
run text: ${
run text: programming_language
run text: }
run text: Some plain text here
run text: ${
run text: education
run text: }
Instead, I would like to achieve the following effect:
run text: ${programming_language}
run text: Some plain text here
run text: ${education}
I have tried using other enclosing characters, such as: { }, < >, # #, etc.
I do not want to do some weird concatenations of runs, etc. I want to have it in a single XWPFRun.
If I cannot find the proper solution, I will just make it like so: VAR_PROGRAMMING_LANGUGE, VAR_EDUCATION, I think.
Current apache poi 4.1.2 provides TextSegment to deal with those Word text-run issues. XWPFParagraph.searchText searches for a string in a paragraph and returns a TextSegment. This provides access to the begin run and the end run of that text in that paragraph (BeginRun and EndRun). It also provides access to the start character position in begin run and end character position in end run (BeginChar and EndChar).
It additionally provides access to the index of the text element in the text run (BeginText and EndText). This always should be 0, because default text runs only have one text element.
Having this, we can do the following:
Replace the found partial string in begin run by the replacement. To do so, get the text part which was before the searched string and concatenate the replacement to it. After that the begin run fully contains the replacement.
Delete all text runs between begin run and end run as they contain parts of the searched string which is not more needed.
Let remain only the text part after the searched string in end run.
Doing so we are able replacing text which is in multiple text runs.
Following example shows this.
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
public class WordReplaceTextSegment {
static public void replaceTextSegment(XWPFParagraph paragraph, String textToFind, String replacement) {
TextSegment foundTextSegment = null;
PositionInParagraph startPos = new PositionInParagraph(0, 0, 0);
while((foundTextSegment = paragraph.searchText(textToFind, startPos)) != null) { // search all text segments having text to find
System.out.println(foundTextSegment.getBeginRun()+":"+foundTextSegment.getBeginText()+":"+foundTextSegment.getBeginChar());
System.out.println(foundTextSegment.getEndRun()+":"+foundTextSegment.getEndText()+":"+foundTextSegment.getEndChar());
// maybe there is text before textToFind in begin run
XWPFRun beginRun = paragraph.getRuns().get(foundTextSegment.getBeginRun());
String textInBeginRun = beginRun.getText(foundTextSegment.getBeginText());
String textBefore = textInBeginRun.substring(0, foundTextSegment.getBeginChar()); // we only need the text before
// maybe there is text after textToFind in end run
XWPFRun endRun = paragraph.getRuns().get(foundTextSegment.getEndRun());
String textInEndRun = endRun.getText(foundTextSegment.getEndText());
String textAfter = textInEndRun.substring(foundTextSegment.getEndChar() + 1); // we only need the text after
if (foundTextSegment.getEndRun() == foundTextSegment.getBeginRun()) {
textInBeginRun = textBefore + replacement + textAfter; // if we have only one run, we need the text before, then the replacement, then the text after in that run
} else {
textInBeginRun = textBefore + replacement; // else we need the text before followed by the replacement in begin run
endRun.setText(textAfter, foundTextSegment.getEndText()); // and the text after in end run
}
beginRun.setText(textInBeginRun, foundTextSegment.getBeginText());
// runs between begin run and end run needs to be removed
for (int runBetween = foundTextSegment.getEndRun() - 1; runBetween > foundTextSegment.getBeginRun(); runBetween--) {
paragraph.removeRun(runBetween); // remove not needed runs
}
}
}
public static void main(String[] args) throws Exception {
XWPFDocument doc = new XWPFDocument(new FileInputStream("source.docx"));
String textToFind = "${This is the text to find}"; // might be in different runs
String replacement = "Replacement text";
for (XWPFParagraph paragraph : doc.getParagraphs()) { //go through all paragraphs
if (paragraph.getText().contains(textToFind)) { // paragraph contains text to find
replaceTextSegment(paragraph, textToFind, replacement);
}
}
FileOutputStream out = new FileOutputStream("result.docx");
doc.write(out);
out.close();
doc.close();
}
}
Above code works not in all cases because XWPFParagraph.searchText has bugs. So I will provide a better searchText method:
/**
* this methods parse the paragraph and search for the string searched.
* If it finds the string, it will return true and the position of the String
* will be saved in the parameter startPos.
*
* #param searched
* #param startPos
*/
static TextSegment searchText(XWPFParagraph paragraph, String searched, PositionInParagraph startPos) {
int startRun = startPos.getRun(),
startText = startPos.getText(),
startChar = startPos.getChar();
int beginRunPos = 0, candCharPos = 0;
boolean newList = false;
//CTR[] rArray = paragraph.getRArray(); //This does not contain all runs. It lacks hyperlink runs for ex.
java.util.List<XWPFRun> runs = paragraph.getRuns();
int beginTextPos = 0, beginCharPos = 0; //must be outside the for loop
//for (int runPos = startRun; runPos < rArray.length; runPos++) {
for (int runPos = startRun; runPos < runs.size(); runPos++) {
//int beginTextPos = 0, beginCharPos = 0, textPos = 0, charPos; //int beginTextPos = 0, beginCharPos = 0 must be outside the for loop
int textPos = 0, charPos;
//CTR ctRun = rArray[runPos];
CTR ctRun = runs.get(runPos).getCTR();
XmlCursor c = ctRun.newCursor();
c.selectPath("./*");
try {
while (c.toNextSelection()) {
XmlObject o = c.getObject();
if (o instanceof CTText) {
if (textPos >= startText) {
String candidate = ((CTText) o).getStringValue();
if (runPos == startRun) {
charPos = startChar;
} else {
charPos = 0;
}
for (; charPos < candidate.length(); charPos++) {
if ((candidate.charAt(charPos) == searched.charAt(0)) && (candCharPos == 0)) {
beginTextPos = textPos;
beginCharPos = charPos;
beginRunPos = runPos;
newList = true;
}
if (candidate.charAt(charPos) == searched.charAt(candCharPos)) {
if (candCharPos + 1 < searched.length()) {
candCharPos++;
} else if (newList) {
TextSegment segment = new TextSegment();
segment.setBeginRun(beginRunPos);
segment.setBeginText(beginTextPos);
segment.setBeginChar(beginCharPos);
segment.setEndRun(runPos);
segment.setEndText(textPos);
segment.setEndChar(charPos);
return segment;
}
} else {
candCharPos = 0;
}
}
}
textPos++;
} else if (o instanceof CTProofErr) {
c.removeXml();
} else if (o instanceof CTRPr) {
//do nothing
} else {
candCharPos = 0;
}
}
} finally {
c.dispose();
}
}
return null;
}
This will be called like:
...
while((foundTextSegment = searchText(paragraph, textToFind, startPos)) != null) {
...
Just like someone has commented your question, you can't have control where or when Word will split the paragraph in some runs. If the other answer still didn't help you, then I have the way I got around it:
First of all, this "solution" have a big problem, but still, I will put it here for the reason that someone can solve it.
public void mainMethod(XWPFParagraph paragraph) {
if (paragraph.getRuns().size() > 1) {
String myRun = unifyRuns(paragraph.getRuns());
// make the verification of placeholders ${...}
paragraph.getRuns().get(0).setText(myRun);
while(paragraph.getRuns().size() > 1) {
paragraph.removeRun(1);
}
}
}
private String unifyRuns(List<XWPFRun> runElements) {
StringBuilder unifiedRun = new StringBuilder();
for (XWPFRun run : runElements) {
unifiedRun.append(run);
}
return unifiedRun.toString();
}
The code may contain some error since I'm doing it as I remember.
The problem here is that when Word separates paragraphs into runs, it doesn't do it for nothing, because when there are texts with different fonts (like font-family or font-size), it separates the texts in different runs.
In the text "Here's my bold text", Word will split the text to separate the bold and normal text. Then, the code above is a bad solution if you are using POI to create large documents with different types of fonts. In that case you would need to verify first if the run is actualy in bold, then you will treat the placeholders.
Again, this a "solution" that i found, and it's not complete yet. Sorry for english errors, i'm using Google Translate to write this answer.
The android app reading paragraphs and some properties in Ms Word document with Aspose Words for Android library. It's getting paragraph text, style name and is seperated value. There are some words have hyperlink in paragraph line. How to get start and end boundaries of the hyperlink of words? For example:
This is an inline hyperlink paragraph example that the start bound is 18 and end bound is 27.
public static ArrayList<String[]> GetBookLinesByTag(String file) {
ArrayList<String[]> bookLines = new ArrayList<>();
try {
Document doc = new Document(file);
ParagraphCollection paras = doc.getFirstSection().getBody().getParagraphs();
for(int i = 0; i < paras.getCount(); i++){
String styleName = paras.get(i).getParagraphFormat().getStyleName().trim();
String isStyleSeparator = Integer.toString(paras.get(i).getBreakIsStyleSeparator() ? 1 : 0);
String content = paras.get(i).toString(SaveFormat.TEXT).trim();
bookLines.add(new String[]{content, styleName, isStyleSeparator});
}
} catch (Exception e){}
return bookLines;
}
Edit:
Thanks Alexey Noskov, solved with you.
public static ArrayList<String[]> GetBookLinesByTag(String file) {
ArrayList<String[]> bookLines = new ArrayList<>();
try {
Document doc = new Document(file);
ParagraphCollection paras = doc.getFirstSection().getBody().getParagraphs();
for(int i = 0; i < paras.getCount(); i++){
String styleName = paras.get(i).getParagraphFormat().getStyleName().trim();
String isStyleSeparator = Integer.toString(paras.get(i).getBreakIsStyleSeparator() ? 1 : 0);
String content = paras.get(i).toString(SaveFormat.TEXT).trim();
for (Field field : paras.get(i).getRange().getFields()) {
if (field.getType() == FieldType.FIELD_HYPERLINK) {
FieldHyperlink hyperlink = (FieldHyperlink) field;
String urlId = hyperlink.getSubAddress();
String urlText = hyperlink.getResult();
// Reformat linked text: urlText:urlId
content = urlText + ":" + urlId;
}
}
bookLines.add(new String[]{content, styleName, isStyleSeparator});
}
} catch (Exception e){}
return bookLines;
}
Hyperlinks in MS Word documents are represented as fields. If you press Alt+F9 in MS Word you will see something like this
{ HYPERLINK "https://aspose.com" }
Follow the link to learn more about fields in Aspose.Words document model and in MS Word.
https://docs.aspose.com/display/wordsjava/Introduction+to+Fields
In your case you need to locate position of FieldStart – this will be the start position, then measure length of content between FieldSeparator and FieldEnd – start position plus the calculated length will the end position.
Disclosure: I work at Aspose.Words team.
I need to remove property in Text (setRise) , if t.setRise(+-) gets out of fields paper.
PdfDocument pdfDoc = new PdfDocument(pdfWriter);
Document doc = new Document(pdfDoc, PageSize.A5);
doc.setMargins(0,0,0,36);
for (int i = 0; i <50 ; i++) {
Text t = new Text("hello " + i);
if(i ==0){
t.setTextRise(7);
}
if(i==31){
t.setTextRise(-35);
}
Paragraph p = new Paragraph(t);
p.setNextRenderer(new ParagraphRen(p,doc));
p.setFixedLeading(fixedLeading);
doc.add(p);
}
doc.close();
}
class ParagraphRen extends ParagraphRenderer{
private float heightDoc;
private float marginTop;
private float marginBot;
public ParagraphRen(Paragraph modelElement, Document doc) {
super(modelElement);
this.heightDoc =doc.getPdfDocument().getDefaultPageSize().getHeight();
this.marginTop = doc.getTopMargin();
this.marginBot = doc.getBottomMargin();
}
#Override
public void drawChildren(DrawContext drawContext) {
super.drawChildren(drawContext);
Rectangle rect = this.getOccupiedAreaBBox();
List<IRenderer> childRenderers = this.getChildRenderers();
//check first line
if(rect.getTop()<=heightDoc- marginTop) {
for (IRenderer iRenderer : childRenderers) {
if (iRenderer.getModelElement().hasProperty(72)) {
Object property = iRenderer.getModelElement().getProperty(72);
float v = (Float) property + rect.getTop();
//check text more AreaPage
if(v >heightDoc){
iRenderer.getModelElement().deleteOwnProperty(72);
}
}
}
}
//check last line
if(rect.getBottom()-marginBot-rect.getHeight()*2<0){
for (IRenderer iRenderer : childRenderers) {
if (iRenderer.getModelElement().hasProperty(72)) {
Object property = iRenderer.getModelElement().getProperty(72);
//if setRise(-..) more margin bottom setRise remove
if(rect.getBottom()-marginBot-rect.getHeight()+(Float) property<0)
iRenderer.getModelElement().deleteOwnProperty(72);
}
}
}
}
}
Here i check if first lines with setRise more the paper area I remove setRise property.
And if last lines with serRise(-35) more then margin bottom I remove it.
But it doesn't work. Properties don't remove.
Your problem is as follows: drawChildren method gets called after rendering has been done. At this stage iText usually doesn't consider properties of any elements: it just places the element in its occupied area, which has been calculated before, at layout() stage.
You can overcome it with layout emulation.
Let's add all your paragraphs to a div rather than directly to the document. Then emulate adding this div to the document:
LayoutResult result = div.createRendererSubTree().setParent(doc.getRenderer()).layout(new LayoutContext(new LayoutArea(0, PageSize.A5)));
In the snippet above I've tried to layout our div on a A5-sized document.
Now you can consider the result of layout and change some elements, which will be then processed for real with Document#add. For example, to get the 30th layouted paragraph one can use:
((DivRenderer)result.getSplitRenderer()).getChildRenderers().get(30);
Some more tips:
split renderer represent the part of the content which iText can place on the area, overflow - the content which overflows.
I entered text in an EditText while designing a form filling app. Now, if i select part of that text, and wish to delete/modify it, I am not getting how to do it. All other options show how to clear entire textbox. How to clear just selected text.
EditText inputText=(EditText)findViewById(R.id.edit);
String inputString=inputText.getText().toString();
//To get the selected String
int selectionStart=inputText.getSelectionStart();
int selectionEnd=inputText.getSelectionEnd();
String selectedText = inputString.substring(selectionStart, selectionEnd);
if(!selectedText.isEmpty())
{
//Modify the selected StringHere
String modifiedString="...your modification logic here...";
//If you wish to delete the selected text
String selectionDeletedString=inputString.replace(selectedText,"");
inputText.setText(selectionDeletedString);
//If you wish to modify the selected text
String selectionModifiedString=inputString.replace(selectedText,modifiedString);
inputText.setText(selectionModifiedString);
}
You need to extract the string with
String fromEditText = editText.getText();
Now, you can do with string whatever you want and then put it back like
editText.setText(myString);
For operations with strings google working with strings and chars java on google.
Try this..
public void modifyText(View view ) {
if (view instanceof EditText) {
if(view.getText().toString().equals("Selected text")){
view.setText("Your Text");
}
}
if (view instanceof ViewGroup) {
for (int i = 0; i < ((ViewGroup) view).getChildCount(); i++) {
View innerView = ((ViewGroup) view).getChildAt(i);
modifyText(innerView);
}
}
}
call this in your activity modifyText(findViewById(R.id.rootView));
This will modify all EditText in the current activity
I found this to be the best solution.
String contents = editText.getText().toString(), newText;
newText = contents.substring(0, noteEdit.getSelectionStart()) +
contents.substring(editText.getSelectionEnd(), contents.length());
editText.setText(newText);
Edit Selection
An easy and best(maybe) way of modifying the selected text.
int start = descriptionBox.getSelectionStart();
int end = descriptionBox.getSelectionEnd();
String modifiedText = "*" + descriptionBox.getText().subSequence(start, end) + "*";
descriptionBox.getText().replace(start, end, modifiedText);
Input Text ( '|' indicate selection )
hello |world|
Output Text
hello *world*
Delete Selection
All you have to do is replace the start and end of selection with empty String.
int start = descriptionBox.getSelectionStart();
int end = descriptionBox.getSelectionEnd();
descriptionBox.getText().replace(start, end, "");
I've got method which returns me a Map from an XML file. I've converted that map to separate Keys and Values into List.
However I'm noticing there are newline characters in the values list. How can I strip out the newline and replace them with a space or leave them blank.
Code:
#Test
public void testGetXMLModelData() throws Exception {
File f = new File("xmlDir/example.xml");
Model m = getXMLModelData(f);
logger.debug("Models Keys: "+m.getInputs());
logger.debug("Models Values: "+m.getValues());
}
public Model getXMLModelData(File f) throws Exception {
Model model = new Model();
Map<String,String> map = p(f);
List<String> listKeys = new ArrayList<String>(map.keySet());
List<String> listValues = new ArrayList<String>(map.values());
model.setInputs(listKeys);
model.setValues(listValues);
return model;
}
public Map<String, String> p(File file) throws Exception {
Map<String, String> map = new HashMap<String,String>();
XMLStreamReader xr = XMLInputFactory.newInstance().createXMLStreamReader(new FileInputStream(file));
while(xr.hasNext()) {
int e = xr.next();
if (e == XMLStreamReader.START_ELEMENT) {
String name = xr.getLocalName();
xr.next();
String value = null;
try {
value = xr.getText();
} catch (IllegalStateException exep) {
exep.printStackTrace();
}
map.put(name, value);
}
}
return map;
}
Output:
2015-08-19 20:13:52,327 : Models Keys: [IRS1095A, MonthlyPlanPremiumAmtPP, WagesSalariesAndTipsAmt, MonthlyAdvancedPTCAmtPP, MonthCdPP, ReturnData, IndividualReturnFilingStatusCd, PrimaryResidentStatesInfoGrpPP, MonthlyPTCInformationGrpPP, IRS1040, ResidentStateInfoPP, SelfSelectPINGrp, MonthlyPremiumSLCSPAmtPP, Filer, ResidentStateAbbreviationCdPP, PrimaryBirthDt, Return, ReturnHeader, TotalExemptionsCnt, AdjustedGrossIncomeAmt, PrimarySSN]
2015-08-19 20:13:52,328 : Models Values: [
, 136, 22000, 125, SEPTEMBER,
, 1,
,
,
,
,
, 250,
, CA, 1970-01-01,
,
, 1, 22000, 555-11-2222]
Any help or assistance would be much appreciated. Thanks in advance
Edit:
XML file
<Return xmlns="http://www.irs.gov/efile">
<ReturnData>
<IRS1095A uuid="a77f40a2-af31-4404-a27d-4c1eaad730c2">
<MonthlyPTCInformationGrpPP uuid="69dc9dd5-5415-4ee4-a199-19b2dbb701be">
<MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
<MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
<MonthCdPP>SEPTEMBER</MonthCdPP>
<MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
</MonthlyPTCInformationGrpPP>
</IRS1095A>
<IRS1040>
<IndividualReturnFilingStatusCd>1</IndividualReturnFilingStatusCd>
<WagesSalariesAndTipsAmt>22000</WagesSalariesAndTipsAmt>
<TotalExemptionsCnt>1</TotalExemptionsCnt>
<AdjustedGrossIncomeAmt>22000</AdjustedGrossIncomeAmt>
</IRS1040>
</ReturnData>
<ReturnHeader>
<SelfSelectPINGrp>
<PrimaryBirthDt>1970-01-01</PrimaryBirthDt>
</SelfSelectPINGrp>
<Filer>
<PrimarySSN>555-11-2222</PrimarySSN>
<PrimaryResidentStatesInfoGrpPP>
<ResidentStateInfoPP uuid="a77f40a2-af31-4404-a27d-4c1eaad730c2">
<ResidentStateAbbreviationCdPP>CA</ResidentStateAbbreviationCdPP>
</ResidentStateInfoPP>
</PrimaryResidentStatesInfoGrpPP>
</Filer>
</ReturnHeader>
</Return>
Set value = xr.getText().trim(). That will trim extraneous characters from the beginning and end of the values.
To then prevent adding the value, wrap the map.put(name, value) with an if (value != null && !value.isEmpty())
Your code is extracting the element name and the text immediately following the start element, ignoring any text following an end element.
So, it collects:
Return = <newline><space><space>
ReturnData = <newline><space><space><space><space>
IRS1095A = <newline><space><space><space><space><space><space>
MonthlyPTCInformationGrpPP = <newline><space><space><space><space><space><space><space><space>
MonthlyPlanPremiumAmtPP = 136
...
And then you add those to a HashMap, which shuffles the key/value pairs in random order, making it difficult to see what happened.
Updated
I'm not going to write the code for you, but if you want "value elements" then you need to:
Remember start element when seen
Collect any text, concatenating with other text already collected, e.g. when you see <text><cdata><text>
When seeing a start element and a start element is remembered, verify text is empty or all whitespace, then discard text
When seeing an end element:
if start element is remembered, add elementName/text to result, then forget start element and discard text. Note: Don't use map if same element name can occur more than once.
if start element is not remembered (was forgotton), verify text is empty or all whitespace, then discard text
This will collect just the leaf elements, ignoring any "layout".
Code exactly as written above
Well, I did add missing resource cleanup.
Map<String, String> map = new HashMap<>();
try (FileInputStream in = new FileInputStream(file)) {
XMLStreamReader xr = XMLInputFactory.newInstance().createXMLStreamReader(in);
try (
String elementName = null;
StringBuilder textBuf = new StringBuilder();
while (xr.hasNext()) {
switch (xr.next()) {
case XMLStreamConstants.START_ELEMENT:
// 3. When seeing a start element and a start element is remembered
if (elementName != null) {
// verify text is empty or all whitespace
if (! textBuf.toString().trim().isEmpty())
throw new IllegalArgumentException("Found text mixed with elements");
// then discard text
textBuf.setLength(0);
}
// 1. Remember start element when seen
elementName = xr.getLocalName();
break;
case XMLStreamConstants.CHARACTERS:
case XMLStreamConstants.CDATA:
case XMLStreamConstants.SPACE:
// 2. Collect any text
textBuf.append(xr.getText());
break;
case XMLStreamConstants.END_ELEMENT: // 4. When seeing an end element
if (elementName != null) { // 1. if start element is remembered
// add elementName/text to result
map.put(elementName, textBuf.toString());
// then forget start element
elementName = null;
// and discard text
textBuf.setLength(0);
} else { // 2. if start element is not remembered (was forgotton)
// verify text is empty or all whitespace
if (! textBuf.toString().trim().isEmpty())
throw new IllegalArgumentException("Found text mixed with elements");
// then discard text
textBuf.setLength(0);
}
break;
default:
// ignore
}
}
} finally {
xr.close();
}
}
return map;