analyzing inputstream xml format java - java

I have an InputStream containing xml format like the following :-
InputStream is = asStream("<TransactionList>\n" +
" <Transaction type=\"C\" amount=\"1000\"narration=\"salary\" />\n" +
" <Transaction type=\"X\" amount=\"400\" narration=\"rent\"/>\n" +
" <Transaction type=\"D\" amount=\"750\" narration=\"other\"/>\n" +
"</TransactionList>");
xmlTransactionProcessor.importTransactions(is);
I'm trying to analyze this and store the values into an array-list of Transaction object (user-defined), but I am still unable to do so.
I tried many solutions but I am still not getting any benefits.
I read about reading xml files but still am not able to deal with an InputStream like this.
Can anybody help ? This is my last try but it is still failing somewhere .
// TODO Auto-generated method stub
BufferedReader inputReader = new BufferedReader(new InputStreamReader(is));
StringBuilder sb = new StringBuilder();
String inline = "";
try {
while ((inline = inputReader.readLine()) != null) {
sb.append(inline);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
SAXBuilder builder = new SAXBuilder();
try {
org.jdom2.Document document = (org.jdom2.Document) builder.build(new ByteArrayInputStream(sb.toString().getBytes()));
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

You don't have to parse the XML yourself with SAX parser. There are several libraries that allow XML Binding: serialize and deserialize XML documents into custom POJO classes (or collection of these).
There is even a standard for XML binding in the JDK. It is called JAXB. You can use annotations to map the XML element names to the properties of your custom POJO.
Here's an example with my personal favorute library: Jackson. It is primarily desgined to process JSON formatted text, but has an extension to support XML (and JAXB).
import java.util.*;
import com.fasterxml.jackson.databind.*;
import com.fasterxml.jackson.dataformat.xml.*;
public class XMLTest
{
public static void main(String[] args)
{
String input =
"<TransactionList>\n" +
" <Transaction type=\"C\" amount=\"1000\" narration=\"salary\" />\n" +
" <Transaction type=\"X\" amount=\"400\" narration=\"rent\"/>\n" +
" <Transaction type=\"D\" amount=\"750\" narration=\"other\"/>\n" +
"</TransactionList>";
try {
XmlMapper xmlMapper = new XmlMapper();
xmlMapper.setDefaultUseWrapper(false);
// this is how we tell Jackson the target type of the deserialization
JavaType transactionListType = xmlMapper.getTypeFactory().constructCollectionType(List.class, Transaction.class);
List<Transaction> transactionList = xmlMapper.readValue(input, transactionListType );
System.out.println(transactionList);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class Transaction
{
public String type;
public int amount;
public String narration;
#Override
public String toString() {
return String.format("{ type:%s, amount:%d, narration:%s }", type, amount, narration);
}
}
}

As explained by Sharon Ben Asher, you could use annotated data mapping using JAXB or Jackson with XML data formatter. This would be easier.
If you want to fix your existing code using the SAXParser here's how it is.
You have to iterate the document object as in the code below.
public static void main(String[] args) {
InputStream is = new ByteArrayInputStream(("<TransactionList>\n" +
" <Transaction type=\"C\" amount=\"1000\" narration=\"salary\" />\n" +
" <Transaction type=\"X\" amount=\"400\" narration=\"rent\"/>\n" +
" <Transaction type=\"D\" amount=\"750\" narration=\"other\"/>\n" +
"</TransactionList>").getBytes(StandardCharsets.UTF_8));
ArrayList transactions = importTransactions(is);
}
In the importTransaction method use getRootElement to get the root level Transactions element. Then iterate through each of the Transaction child elements using getChildren and a for-each loop.
public static ArrayList<Transaction> importTransactions(InputStream is){
ArrayList<Transaction> transactions = new ArrayList<>();
BufferedReader inputReader = new BufferedReader(new InputStreamReader(is));
StringBuilder sb = new StringBuilder();
String inline = "";
try {
while ((inline = inputReader.readLine()) != null) {
sb.append(inline);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
SAXBuilder builder = new SAXBuilder();
try {
org.jdom2.Document document = builder.build(new ByteArrayInputStream(sb.toString().getBytes()));
Element transactionsElement = document.getRootElement();
List<Element> transactionList = transactionsElement.getChildren();
for (Element transaction:transactionList) {
Transaction t = new Transaction();
t.setType(transaction.getAttribute("type").getValue());
t.setAmount(transaction.getAttribute("amount").getValue());
transactions.add(t);
}
} catch (Exception e) {
// Log the error....
e.printStackTrace();
}
return transactions;
}

Related

Spring Jaxb2: How to append batch data to XML file with no reading it to memory?

I need to write data to xml in batches.
There are following domain objects:
#XmlRootElement(name = "country")
public class Country {
#XmlElements({#XmlElement(name = "town", type = Town.class)})
private Collection<Town> towns = new ArrayList<>();
....
}
And:
#XmlRootElement(name = "town")
public class Town {
#XmlElement
private String townName;
// etc
}
I'm marhalling objects with Jaxb2. Configuration as follows:
marshaller = new Jaxb2Marshaller();
marshaller.setClassesToBeBound(Country.class, Town.class);
Because simple marshalling doesn't work here as marhaller.marshall(fileName, country) - it malformes xml.
Is there a way to tweek marhaller so that it would create file if it's not exists with all marhalled data or if exists just append it at the end of xml file ?
Also as this files are potentially large I don't want to read whole file in memory, append data and then write to disk.
I've used StAX for xml processing as it stream based, consumes less memory then DOM and has ability to read and write comparing to SAX which can only parse xml data, but can't write it.
The is the approach I came up with:
public enum StAXBatchWriter {
INSTANCE;
private static final Logger LOGGER = LoggerFactory.getLogger(StAXBatchWriter.class);
public void writeUrls(File original, Collection<Town> towns) {
XMLEventReader eventReader = null;
XMLEventWriter eventWriter = null;
try {
String originalPath = original.getPath();
File from = new File(original.getParent() + "/old-" + original.getName());
boolean isRenamed = original.renameTo(from);
if (!isRenamed)
throw new IllegalStateException("Failed to rename file: " + original.getPath() + " to " + from.getPath());
File to = new File(originalPath);
XMLInputFactory inFactory = XMLInputFactory.newInstance();
eventReader = inFactory.createXMLEventReader(new FileInputStream(from));
XMLOutputFactory outFactory = XMLOutputFactory.newInstance();
eventWriter = outFactory.createXMLEventWriter(new FileWriter(to));
XMLEventFactory eventFactory = XMLEventFactory.newInstance();
while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
eventWriter.add(event);
if (event.getEventType() == XMLEvent.START_ELEMENT && event.asStartElement().getName().toString().contains("country")) {
for (Town town : towns) {
writeTown(eventWriter, eventFactory, town);
}
}
}
boolean isDeleted = from.delete();
if (!isDeleted)
throw new IllegalStateException("Failed to delete old file: " + from.getPath());
} catch (IOException | XMLStreamException e) {
LOGGER.error(e.getMessage(), e);
throw new RuntimeException(e);
} finally {
try {
if (eventReader != null)
eventReader.close();
} catch (XMLStreamException e) {
LOGGER.error(e.getMessage(), e);
}
try {
if (eventWriter != null)
eventWriter.close();
} catch (XMLStreamException e) {
LOGGER.error(e.getMessage(), e);
}
}
}
private void writeTown(XMLEventWriter eventWriter, XMLEventFactory eventFactory, Town town) throws XMLStreamException {
eventWriter.add(eventFactory.createStartElement("", null, "town"));
// write town id
eventWriter.add(eventFactory.createStartElement("", null, "id"));
eventWriter.add(eventFactory.createCharacters(town.getId()));
eventWriter.add(eventFactory.createEndElement("", null, "id"));
//write town name
if (StringUtils.isNotEmpty(town.getName())) {
eventWriter.add(eventFactory.createStartElement("", null, "name"));
eventWriter.add(eventFactory.createCharacters(town.getName()));
eventWriter.add(eventFactory.createEndElement("", null, "name"));
}
// write other fields
eventWriter.add(eventFactory.createEndElement("", null, "town"));
}
}
It's not the best approach, dispite the fact that it's stream based and it's working, it has some overhead. When a batch will be added - the old file has to be re-read.
It will be nice to have an option to append the data at some point in file (like "append data to that file after 4 line"), but seems this can't be done.

how to write java csv parser using opencsv

I have to parse csv file .
number of columns would be variable.
I have written following code for fixed columns.
I have used csvtobean and MappingStrategy apis for parsing.
Please help me how can I create mappings dynamically.
public class OpencsvExecutor2 {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
CsvToBean csv = new CsvToBean();
String csvFilename="C:\\Users\\ersvvwa\\Desktop\\taks\\supercsv\\20160511-0750--MaS_GsmrRel\\20160511-0750--MaS_GsmrRel.txt";
CSVReader csvReader = null;
List objList=new ArrayList<DataBean>();
try {
FileInputStream fis = new FileInputStream(csvFilename);
BufferedReader myInput = new BufferedReader(new InputStreamReader(fis));
csvReader = new CSVReader(new InputStreamReader(new FileInputStream(csvFilename), "UTF-8"), ' ', '\'', 1);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
csvReader.getRecordsRead();
//Set column mapping strategy
List<DataBean> list = csv.parse(setColumMapping(csvReader), csvReader);
for (Object object : list) {
DataBean obj = (DataBean) object;
// System.out.println(obj.Col1);
objList.add(obj);
}
csvReader.close();
System.out.println("list size "+list.size());
System.out.println("objList size "+objList.size());
String outFile="C:\\Users\\ersvvwa\\Desktop\\taks\\supercsv\\20160511-0750--MaS_GsmrRel\\20160511-0750--MaS_GsmrRel.csv";
try {
CSVWriter csvWriter = null;
csvWriter = new CSVWriter(new FileWriter(outFile),CSVWriter.DEFAULT_SEPARATOR,CSVWriter.NO_QUOTE_CHARACTER);
//csvWriter = new CSVWriter(out,CSVWriter.DEFAULT_SEPARATOR,CSVWriter.NO_QUOTE_CHARACTER);
String[] columns = new String[] {"col1","col2","col3","col4"};
// Writer w= new FileWriter(out);
BeanToCsv bc = new BeanToCsv();
List ls;
csvWriter.writeNext(columns);
//bc.write(setColumMapping(), csvWriter, objList);
System.out.println("complete");
csvWriter.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static MappingStrategy setColumMapping(CSVReader csvReader) throws IOException {
// TODO Auto-generated method stub
ColumnPositionMappingStrategy strategy = new ColumnPositionMappingStrategy();
strategy.setType(DataBean2.class);
String[] columns = new String[] {"col1","col2","col3","col4"};
strategy.setColumnMapping(columns);
return strategy;
}
}
If I understood correctly, you can read the file line by line and use split.
Example READ CSV: Example extracted from mkyong
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
public class ReadCVS {
public static void main(String[] args) {
ReadCVS obj = new ReadCVS();
obj.run();
}
public void run() {
String csvFile = "/Users/mkyong/Downloads/GeoIPCountryWhois.csv";
BufferedReader br = null;
String line = "";
String cvsSplitBy = ",";
try {
br = new BufferedReader(new FileReader(csvFile));
while ((line = br.readLine()) != null) {
// use comma as separator
String[] country = line.split(cvsSplitBy);
System.out.println("Country [code= " + country[4]
+ " , name=" + country[5] + "]");
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
System.out.println("Done");
}
}
Example for WRITE a CSV file: Example extracted from mkyong
import java.io.FileWriter;
import java.io.IOException;
public class GenerateCsv
{
public static void main(String [] args)
{
generateCsvFile("c:\\test.csv");
}
private static void generateCsvFile(String sFileName)
{
try
{
FileWriter writer = new FileWriter(sFileName);
writer.append("DisplayName");
writer.append(',');
writer.append("Age");
writer.append('\n');
writer.append("MKYONG");
writer.append(',');
writer.append("26");
writer.append('\n');
writer.append("YOUR NAME");
writer.append(',');
writer.append("29");
writer.append('\n');
//generate whatever data you want
writer.flush();
writer.close();
}
catch(IOException e)
{
e.printStackTrace();
}
}
}
However, I would recommend to use a library. There are many (e.g., opencsv, Apache Commons CSV, Jackson Dataformat CSV, etc). You don't have to re-invent the wheel.
OPENCSV website has a lot of example that you can use.
If you Google "opencsv read example" you will get a lot of examples using the OPENCSV library (e.g., "Parse / Read / write CSV files : OpenCSV tutorial")
Hopefully this would help you!.
Assuming that your code works, I would try to use Generics for the setColumnMapping method.
The method setType gets a parameter "Class type". Use this as a parameter for your own method setColumnMapping e.g., (CSVReader csvReader, Class type). This way you can pass the DataBean2.class to the method, or any other class. Furthermore you need a variable column to bean mapping, because {"col1","col2","col3","col4"} is not sufficient for every bean, as you know. Think about how you can make this dynamic (you can pass a String[] to the setColumnMethod for example).
You also need to adjust List usage inside your main apparently.
I suggest looking for a brief tutorial on java generics before you start programming.
Finally i was able to parse csv and write it in desired format like
csvWriter = new CSVWriter(new FileWriter(outFile),CSVWriter.DEFAULT_SEPARATOR,CSVWriter.NO_QUOTE_CHARACTER);
csvReader = new CSVReader(new InputStreamReader(new FileInputStream(csvFilename), "UTF-8"), ' ');
String header = "NW,MSC,BSC,CELL,CELL_0";
List<String> headerList = new ArrayList<String>();
headerList.add(header);
csvWriter.writeNext(headerList.toArray(new String[headerList.size()]));
while ((nextLine = csvReader.readNext()) != null) {
// nextLine[] is an array of values from the line
for(int j=0;j< nextLine.length;j++){
// System.out.println("next " +nextLine[1]+" "+nextLine [2]+ " "+nextLine [2]);
if(nextLine[j].contains("cell")||
nextLine[j].equalsIgnoreCase("NW") ||
nextLine[j].equalsIgnoreCase("MSC") ||
nextLine[j].equalsIgnoreCase("BSC") ||
nextLine[j].equalsIgnoreCase("CELL")){
hm.put(nextLine[j], j);
}
}
break;
}
String[] out=null;
while ((row = csvReader.readNext()) != null) {
String [] arr=new String[4];
outList = new ArrayList<>();
innerList = new ArrayList<>();
finalList=new ArrayList<String[]>();
String[] str=null;
int x=4;
for(int y=0; y<hm.size()-10;y++){
if(!row[x].equalsIgnoreCase("NULL")|| !row[x].equals(" ")){
System.out.println("x "+x);
str=new String[]{row[0],row[1],row[2],row[3],row[x]};
}
finalList.add(str);;
x=x+3;
}
csvWriter.writeAll(finalList);
break;
}
csvReader.close();
csvWriter.close();
}

How to write a unit test for an XML parser I wrote in Java

The context is as follows:
I've got objects that represent Tweets (from Twitter). Each object has an id, a date and the id of the original tweet (if there was one).
I receive a file of tweets (where each tweet is in the format of 05/04/2014 12:00:00, tweetID, originalID and is in its' own line) and I want to save them as an XML file where each field has its' own tag.
I want to then be able to read the file and return a list of Tweet objects corresponding to the Tweets from the XML file.
After writing the XML parser that does this I want to test that it works correctly. I've got no idea how to test this.
The XML Parser:
public class TweetToXMLConverter implements TweetImporterExporter {
//there is a single file used for the tweets database
static final String xmlPath = "src/main/resources/tweetsDataBase.xml";
//some "defines", as we like to call them ;)
static final String DB_HEADER = "tweetDataBase";
static final String TWEET_HEADER = "tweet";
static final String TWEET_ID_FIELD = "id";
static final String TWEET_ORIGIN_ID_FIELD = "original tweet";
static final String TWEET_DATE_FIELD = "tweet date";
static File xmlFile;
static boolean initialized = false;
#Override
public void createDB() {
try {
Element tweetDB = new Element(DB_HEADER);
Document doc = new Document(tweetDB);
doc.setRootElement(tweetDB);
XMLOutputter xmlOutput = new XMLOutputter();
// display nice nice? WTF does that chinese whacko want?
xmlOutput.setFormat(Format.getPrettyFormat());
xmlOutput.output(doc, new FileWriter(xmlPath));
xmlFile = new File(xmlPath);
initialized = true;
} catch (IOException io) {
System.out.println(io.getMessage());
}
}
#Override
public void addTweet(Tweet tweet) {
if (!initialized) {
//TODO throw an exception? should not come to pass!
return;
}
SAXBuilder builder = new SAXBuilder();
try {
Document document = (Document) builder.build(xmlFile);
Element newTweet = new Element(TWEET_HEADER);
newTweet.setAttribute(new Attribute(TWEET_ID_FIELD, tweet.getTweetID()));
newTweet.setAttribute(new Attribute(TWEET_DATE_FIELD, tweet.getDate().toString()));
if (tweet.isRetweet())
newTweet.addContent(new Element(TWEET_ORIGIN_ID_FIELD).setText(tweet.getOriginalTweet()));
document.getRootElement().addContent(newTweet);
} catch (IOException io) {
System.out.println(io.getMessage());
} catch (JDOMException jdomex) {
System.out.println(jdomex.getMessage());
}
}
//break glass in case of emergency
#Override
public void addListOfTweets(List<Tweet> list) {
for (Tweet t : list) {
addTweet(t);
}
}
#Override
public List<Tweet> getListOfTweets() {
if (!initialized) {
//TODO throw an exception? should not come to pass!
return null;
}
try {
SAXBuilder builder = new SAXBuilder();
Document document;
document = (Document) builder.build(xmlFile);
List<Tweet> $ = new ArrayList<Tweet>();
for (Object o : document.getRootElement().getChildren(TWEET_HEADER)) {
Element rawTweet = (Element) o;
String id = rawTweet.getAttributeValue(TWEET_ID_FIELD);
String original = rawTweet.getChildText(TWEET_ORIGIN_ID_FIELD);
Date date = new Date(rawTweet.getAttributeValue(TWEET_DATE_FIELD));
$.add(new Tweet(id, original, date));
}
return $;
} catch (JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return null;
}
}
Some usage:
private TweetImporterExporter converter;
List<Tweet> tweetList = converter.getListOfTweets();
for (String tweetString : lines)
converter.addTweet(new Tweet(tweetString));
How can I make sure the the XML file I read (that contains tweets) corresponds to the file I receive (in the form stated above)?
How can I make sure the tweets I add to the file correspond to the ones I tried to add?
Assuming that you have the following model:
public class Tweet {
private Long id;
private Date date;
private Long originalTweetid;
//getters and seters
}
The process would be the following:
create an isntance of TweetToXMLConverter
create a list of Tweet instances that you expect to receive after parsing the file
feed the converter the list you generated
compare the list received by parsing the list and the list you initiated at the start of the test
public class MainTest {
private TweetToXMLConverter converter;
private List<Tweet> tweets;
#Before
public void setup() {
Tweet tweet = new Tweet(1, "05/04/2014 12:00:00", 2);
Tweet tweet2 = new Tweet(2, "06/04/2014 12:00:00", 1);
Tweet tweet3 = new Tweet(3, "07/04/2014 12:00:00", 2);
tweets.add(tweet);
tweets.add(tweet2);
tweets.add(tweet3);
converter = new TweetToXMLConverter();
converter.addListOfTweets(tweets);
}
#Test
public void testParse() {
List<Tweet> parsedTweets = converter.getListOfTweets();
Assert.assertEquals(parsedTweets.size(), tweets.size());
for (int i=0; i<parsedTweets.size(); i++) {
//assuming that both lists are sorted
Assert.assertEquals(parsedTweets.get(i), tweets.get(i));
};
}
}
I am using JUnit for the actual testing.

parse an xml string in java?

how do you parse xml stored in a java string object?
Java's XMLReader only parses XML documents from a URI or inputstream. is it not possible to parse from a String containing an xml data?
Right now I have the following:
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser sp = factory.newSAXParser();
XMLReader xr = sp.getXMLReader();
ContactListXmlHandler handler = new ContactListXmlHandler();
xr.setContentHandler(handler);
xr.p
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
And on my handler i have this:
public class ContactListXmlHandler extends DefaultHandler implements Resources {
private List<ContactName> contactNameList = new ArrayList<ContactName>();
private ContactName contactItem;
private StringBuffer sb;
public List<ContactName> getContactNameList() {
return contactNameList;
}
#Override
public void startDocument() throws SAXException {
// TODO Auto-generated method stub
super.startDocument();
sb = new StringBuffer();
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
// TODO Auto-generated method stub
super.startElement(uri, localName, qName, attributes);
if(localName.equals(XML_CONTACT_NAME)){
contactItem = new ContactName();
}
sb.setLength(0);
}
#Override
public void characters(char[] ch, int start, int length){
// TODO Auto-generated method stub
try {
super.characters(ch, start, length);
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
sb.append(ch, start, length);
}
#Override
public void endDocument() throws SAXException {
// TODO Auto-generated method stub
super.endDocument();
}
/**
* where the real stuff happens
*/
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
// TODO Auto-generated method stub
//super.endElement(arg0, arg1, arg2);
if(contactItem != null){
if (localName.equalsIgnoreCase("title")) {
contactItem.setUid(sb.toString());
Log.d("handler", "setTitle = " + sb.toString());
} else if (localName.equalsIgnoreCase("link")) {
contactItem.setFullName(sb.toString());
} else if (localName.equalsIgnoreCase("item")){
Log.d("handler", "adding rss item");
contactNameList.add(contactItem);
}
sb.setLength(0);
}
}
Thanks in advance
The SAXParser can read an InputSource.
An InputSource can take a Reader in its constructor
So, you can put parse XML string via a StringReader
new InputSource(new StringReader("... your xml here....")));
Try jcabi-xml (see this blog post) with a one-liner:
XML xml = new XMLDocument("<document>...</document>")
Your XML might be simple enough to parse manually using the DOM or SAX API, but I'd still suggest using an XML serialization API such as JAXB, XStream, or Simple instead because writing your own XML serialization/deserialization code is a drag.
Note that the XStream FAQ erroneously claims that you must use generated classes with JAXB:
How does XStream compare to JAXB (Java API for XML Binding)?
JAXB is a Java binding tool. It generates Java code from a schema and
you are able to transform from those classes into XML matching the
processed schema and back. Note, that you cannot use your own objects,
you have to use what is generated.
It seems this was true was true at one time, but JAXB 2.0 no longer requires you to use Java classes generated from a schema.
If you go this route, be sure to check out the side-by-side comparisons of the serialization/marshalling APIs I've mentioned:
http://blog.bdoughan.com/2010/10/how-does-jaxb-compare-to-xstream.html
http://blog.bdoughan.com/2010/10/how-does-jaxb-compare-to-simple.html
Take a look at this: http://www.rgagnon.com/javadetails/java-0573.html
import javax.xml.parsers.*;
import org.xml.sax.InputSource;
import org.w3c.dom.*;
import java.io.*;
public class ParseXMLString {
public static void main(String arg[]) {
String xmlRecords =
"<data>" +
" <employee>" +
" <name>John</name>" +
" <title>Manager</title>" +
" </employee>" +
" <employee>" +
" <name>Sara</name>" +
" <title>Clerk</title>" +
" </employee>" +
"</data>";
try {
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xmlRecords));
Document doc = db.parse(is);
NodeList nodes = doc.getElementsByTagName("employee");
// iterate the employees
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
NodeList name = element.getElementsByTagName("name");
Element line = (Element) name.item(0);
System.out.println("Name: " + getCharacterDataFromElement(line));
NodeList title = element.getElementsByTagName("title");
line = (Element) title.item(0);
System.out.println("Title: " + getCharacterDataFromElement(line));
}
}
catch (Exception e) {
e.printStackTrace();
}
/*
output :
Name: John
Title: Manager
Name: Sara
Title: Clerk
*/
}
public static String getCharacterDataFromElement(Element e) {
Node child = e.getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "?";
}
}

Text Extraction from HTML Java

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.
I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows;
FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;
while ((s = br.readLine()) !=null) {
if(s.contains("<p>")) {
try {
out.write(s);
} catch (IOException e) {
}
}
}
i was trying to add another while loop, which would tell the program to keep writing to file until the line contains the </p> tag, by saying;
while ((s = br.readLine()) !=null) {
if(s.contains("<p>")) {
while(!s.contains("</p>") {
try {
out.write(s);
} catch (IOException e) {
}
}
}
}
But this doesn't work. Could someone please help.
jsoup
Another html parser I really liked using was jsoup. You could get all the <p> elements in 2 lines of code.
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements ps = doc.select("p");
Then write it out to a file in one more line
out.write(ps.text()); //it will append all of the p elements together in one long string
or if you want them on separate lines you can iterate through the elements and write them out separately.
jericho is one of several posible html parsers that could make this task both easy and safe.
JTidy can represent an HTML document (even a malformed one) as a document model, making the process of extracting the contents of a <p> tag a rather more elegant process than manually thunking through the raw text.
Try (if you don't want to use a HTML parser library):
FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;
int writeTo = 0;
while ((s = br.readLine()) !=null)
{
if(s.contains("<p>"))
{
writeTo = 1;
try
{
out.write(s);
}
catch (IOException e)
{
}
}
if(s.contains("</p>"))
{
writeTo = 0;
try
{
out.write(s);
}
catch (IOException e)
{
}
}
else if(writeTo==1)
{
try
{
out.write(s);
}
catch (IOException e)
{
}
}
}
I've had success using TagSoup & XPath to parse HTML.
http://home.ccil.org/~cowan/XML/tagsoup/
Use a ParserCallback. Its a simple class thats included with the JDK. It notifies you every time a new tag is found and then you can extract the text of the tag. Simple example:
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class ParserCallbackTest extends HTMLEditorKit.ParserCallback
{
private int tabLevel = 1;
private int line = 1;
public void handleComment(char[] data, int pos)
{
displayData(new String(data));
}
public void handleEndOfLineString(String eol)
{
System.out.println( line++ );
}
public void handleEndTag(HTML.Tag tag, int pos)
{
tabLevel--;
displayData("/" + tag);
}
public void handleError(String errorMsg, int pos)
{
displayData(pos + ":" + errorMsg);
}
public void handleMutableTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
displayData("mutable:" + tag + ": " + pos + ": " + a);
}
public void handleSimpleTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
displayData( tag + "::" + a );
// tabLevel++;
}
public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
displayData( tag + ":" + a );
tabLevel++;
}
public void handleText(char[] data, int pos)
{
displayData( new String(data) );
}
private void displayData(String text)
{
for (int i = 0; i < tabLevel; i++)
System.out.print("\t");
System.out.println(text);
}
public static void main(String[] args)
throws IOException
{
ParserCallbackTest parser = new ParserCallbackTest();
// args[0] is the file to parse
Reader reader = new FileReader(args[0]);
// URLConnection conn = new URL(args[0]).openConnection();
// Reader reader = new InputStreamReader(conn.getInputStream());
try
{
new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
System.out.println(e);
}
}
}
So all you need to do is set a boolean flag when the paragraph tag is found. Then in the handleText() method you extract the text.
Try this.
public static void main( String[] args )
{
String url = "http://en.wikipedia.org/wiki/Big_data";
Document document;
try {
document = Jsoup.connect(url).get();
Elements paragraphs = document.select("p");
Element firstParagraph = paragraphs.first();
Element lastParagraph = paragraphs.last();
Element p;
int i=1;
p=firstParagraph;
System.out.println("* " +p.text());
while (p!=lastParagraph){
p=paragraphs.get(i);
System.out.println("* " +p.text());
i++;
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
You may just be using the wrong tool for the job:
perl -ne "print if m|<p>| .. m|</p>|" infile.txt >outfile.txt

Categories

Resources