java android very large xml parsing - java

I have got a very large xml file with categories in one xml file which maps to sub categories in another xml file according to category id. The xml file with only category id and names is loading fast, but the xml file which has subcategories with images path, description, latitude-longitude etc...is taking time to load.
I am using javax.xml package and org.w3c.dom package.
The list action is loading the file in each click to look for subcategories.
Is there any way to make this whole process faster?
Edit-1
Heres the code i am using to getch subcategories:
Document doc = this.builder.parse(inStream, null);
doc.getDocumentElement().normalize();
NodeList pageList = doc.getElementsByTagName("page");
final int length = pageList.getLength();
for (int i = 0; i < length; i++)
{
boolean inCategory = false;
Element categories = (Element) getChild(pageList.item(i), "categories");
if(categories != null)
{
NodeList categoryList = categories.getElementsByTagName("category");
for(int j = 0; j < categoryList.getLength(); j++)
{
if(Integer.parseInt(categoryList.item(j).getTextContent()) == catID)
{
inCategory = true;
break;
}
}
}
if(inCategory == true)
{
final NamedNodeMap attr = pageList.item(i).getAttributes();
//
//get Page ID
final int categoryID = Integer.parseInt(getNodeValue(attr, "id"));
//get Page Name
final String categoryName = (getChild(pageList.item(i), "title") != null) ? getChild(pageList.item(i), "title").getTextContent() : "Untitled";
//get ThumbNail
final NamedNodeMap thumb_attr = getChild(pageList.item(i), "thumbnail").getAttributes();
final String categoryImage = "placethumbs/" + getNodeValue(thumb_attr, "file");
//final String categoryImage = "androidicon.png";
Category category = new Category(categoryName, categoryID, categoryImage);
this.list.add(category);
Log.d(tag, category.toString());
}
}

Use SAX based parser, DOM is not good for large xml.

Maybe a SAX processor would be quicker (assuming your App is slowing down due to memory requirements of using a DOM-style approach?)
Article on processing XML on android
SOF question about SAX processing on Android

Related

How to get hyperlink boundaries of inline words with Aspose Words for Androd?

The android app reading paragraphs and some properties in Ms Word document with Aspose Words for Android library. It's getting paragraph text, style name and is seperated value. There are some words have hyperlink in paragraph line. How to get start and end boundaries of the hyperlink of words? For example:
This is an inline hyperlink paragraph example that the start bound is 18 and end bound is 27.
public static ArrayList<String[]> GetBookLinesByTag(String file) {
ArrayList<String[]> bookLines = new ArrayList<>();
try {
Document doc = new Document(file);
ParagraphCollection paras = doc.getFirstSection().getBody().getParagraphs();
for(int i = 0; i < paras.getCount(); i++){
String styleName = paras.get(i).getParagraphFormat().getStyleName().trim();
String isStyleSeparator = Integer.toString(paras.get(i).getBreakIsStyleSeparator() ? 1 : 0);
String content = paras.get(i).toString(SaveFormat.TEXT).trim();
bookLines.add(new String[]{content, styleName, isStyleSeparator});
}
} catch (Exception e){}
return bookLines;
}
Edit:
Thanks Alexey Noskov, solved with you.
public static ArrayList<String[]> GetBookLinesByTag(String file) {
ArrayList<String[]> bookLines = new ArrayList<>();
try {
Document doc = new Document(file);
ParagraphCollection paras = doc.getFirstSection().getBody().getParagraphs();
for(int i = 0; i < paras.getCount(); i++){
String styleName = paras.get(i).getParagraphFormat().getStyleName().trim();
String isStyleSeparator = Integer.toString(paras.get(i).getBreakIsStyleSeparator() ? 1 : 0);
String content = paras.get(i).toString(SaveFormat.TEXT).trim();
for (Field field : paras.get(i).getRange().getFields()) {
if (field.getType() == FieldType.FIELD_HYPERLINK) {
FieldHyperlink hyperlink = (FieldHyperlink) field;
String urlId = hyperlink.getSubAddress();
String urlText = hyperlink.getResult();
// Reformat linked text: urlText:urlId
content = urlText + ":" + urlId;
}
}
bookLines.add(new String[]{content, styleName, isStyleSeparator});
}
} catch (Exception e){}
return bookLines;
}
Hyperlinks in MS Word documents are represented as fields. If you press Alt+F9 in MS Word you will see something like this
{ HYPERLINK "https://aspose.com" }
Follow the link to learn more about fields in Aspose.Words document model and in MS Word.
https://docs.aspose.com/display/wordsjava/Introduction+to+Fields
In your case you need to locate position of FieldStart – this will be the start position, then measure length of content between FieldSeparator and FieldEnd – start position plus the calculated length will the end position.
Disclosure: I work at Aspose.Words team.

Get more than one Element JSoup Java Android

I am trying to get a list of items to form a playlist, and I am only able to retrieve one of the items. Here is the code I have going to my recyclerview's bindView:
#Override
public void onBindViewHolder(PlaylistViewHolder holder, int position)
{
try
{
String url = "https://www.c895.org/playlist";
Document document = Jsoup.connect(url).get();
Element playlist = document.select("#playlist").first();
List<TrackInfo> tracks = new ArrayList<>();
for(Element track : playlist.children())
{
long time = Long.parseLong(track.dataset().get("ts"));
String title = track.select(".title").text();
String artist = track.select(".artist").text();
tracks.add(new TrackInfo(new Date(time * 1000), title, artist));
}
for(int i = 0; i < tracks.size() - 1; i++)
{
holder.titlesView.setText(tracks.get(i).toString());
}
}
catch(IOException e)
{
e.printStackTrace();
}
}
Ideally I'd like to get about 10-20 results. Is there anyway I could do this?
It's because the html part that you need is in the following tag:
<div id="playlist">
</div>
So you can't use the following:
Element playlist = document.select("#playlist").first();
but you need to use div#playlist to get all the playlist item:
Element playlist = document.select("div#playlist");

how to set html node value using webview with javafx

i am trying to set the value of the html form elements after loaded into webview.I tried to set using
org.w3c.dom.Document doc = webEngine.getDocument();
HTMLFormElement form = (HTMLFormElement) doc.getElementsByTagName("form").item(0);
NodeList nodes = form.getElementsByTagName("input");
nodes.item(1).setNodeValue("yadayada"); //this is where i am setting the value
but no success. can anybody help me out. here is my code.
org.w3c.dom.Document doc = webEngine.getDocument();
if (doc!=null && doc.getElementsByTagName("form").getLength() > 0) {
HTMLFormElement form = (HTMLFormElement) doc.getElementsByTagName("form").item(0);
String username = null;
String password = null;
NodeList nodes = form.getElementsByTagName("input");
for (int i = 0; i < nodes.getLength(); i++) {
if(nodes.item(i).hasAttributes()){
NamedNodeMap attr = nodes.item(i).getAttributes();
for (int j=0 ; j<attr.getLength();j++){
Attr atribute = (Attr)attr.item(j);
if(atribute.getValue().equals("password")){
System.out.println("Password detected");
nodes.item(i).setNodeValue("123456");
}
}
}
}
}
i found the solution after surfing the web. The problem was i was using set node value but values of input tags are set using HTMLInputElement.This link was valuabe for me
Performing an automated form post of login using webview
for example
HTMLInputElement password = (HTMLInputElement) nodes.item(0).setValue("yadayada");

Unable to parse element attribute with XOM

I'm attempting to parse an RSS field using the XOM Java library. Each entry's image URL is stored as an attribute for the <img> element, as seen below.
<rss version="2.0">
<channel>
<item>
<title>Decision Paralysis</title>
<link>https://xkcd.com/1801/</link>
<description>
<img src="https://imgs.xkcd.com/comics/decision_paralysis.png"/>
</description>
<pubDate>Mon, 20 Feb 2017 05:00:00 -0000</pubDate>
<guid>https://xkcd.com/1801/</guid>
</item>
</channel>
</rss>
Attempting to parse <img src=""> with .getFirstChildElement("img") only returns a null pointer, making my code crash when I try to retrieve <img src= ...>. Why is my program failing to read in the <img> element, and how can I read it in properly?
import nu.xom.*;
public class RSSParser {
public static void main() {
try {
Builder parser = new Builder();
Document doc = parser.build ( "https://xkcd.com/rss.xml" );
Element rootElement = doc.getRootElement();
Element channelElement = rootElement.getFirstChildElement("channel");
Elements itemList = channelElement.getChildElements("item");
// Iterate through itemList
for (int i = 0; i < itemList.size(); i++) {
Element item = itemList.get(i);
Element descElement = item.getFirstChildElement("description");
Element imgElement = descElement.getFirstChildElement("img");
// Crashes with NullPointerException
String imgSrc = imgElement.getAttributeValue("src");
}
}
catch (Exception error) {
error.printStackTrace();
System.exit(1);
}
}
}
There is no img element in the item. Try
if (imgElement != null) {
String imgSrc = imgElement.getAttributeValue("src");
}
What the item contains is this:
<description><img
src="http://imgs.xkcd.com/comics/us_state_names.png"
title="Technically DC isn't a state, but no one is too
pedantic about it because they don't want to disturb the snakes
."
alt="Technically DC isn't a state, but no one is too pedantic about it because they don't want to disturb the snakes." />
</description>
That's not an img elment. It's plain text.
I managed to come up with a somewhat hacky solution using regex and pattern matching.
// Iterate through itemList
for (int i = 0; i < itemList.size(); i++) {
Element item = itemList.get(i);
String descString = item.getFirstChildElement("description").getValue();
// Parse image URL (hacky)
String imgSrc = "";
Pattern pattern = Pattern.compile("src=\"[^\"]*\"");
Matcher matcher = pattern.matcher(descString);
if (matcher.find()) {
imgSrc = descString.substring( matcher.start()+5, matcher.end()-1 );
}
}

Parsing with htmlcleaner

I developed a method which allows you to extract items from a specific class using htmlcleaner now I was wondering...
How would you be able to extract the body and all its elements inside an html using htmlcleaner?
public String htmlParser(String html){
TagNode rootNode;
HtmlCleaner html_cleaner = new HtmlCleaner();
rootNode = html_cleaner.clean(html);
TagNode[] items = rootNode.getElementsByName("body", true);
ParseBody(items[0]);
html = item_found;
return html;
}
String item_found;
public void ParseBody(TagNode root){
if(root.getAllElements(true).length > 0){
for(TagNode node: root.getAllElements(true)){
ParseBody(node);
}
}else{
item_found = item_found + root.toString();// root.toString() only brings out the first name inside TagNode
- In here I wanted just the text of all items in the body but it would still be beneficial for everyone if the question is complete-
//if(root.getText().toString() != null || !(root.getText().toString().equals("null"))){
//item_found = item_found + root.getText().toString();
//}
}
}

Categories

Resources