I am writing a crawler/parser that should be able to process different types of content, being RSS, Atom and just plain html files. To determine the correct parser, I wrote a class called ParseFactory, which takes an URL, tries to detect the content-type, and returns the correct parser.
Unfortunately, checking the content-type using the provided in method in URLConnection doesn't always work. For example,
String contentType = url.openConnection().getContentType();
doesn't always provide the correct content-type (e.g "text/html" where it should be RSS) or doesn't allow to distinguish between RSS and Atom (e.g. "application/xml" could be both an Atom or a RSS feed). To solve this problem, I started looking for clues in the InputStream. Problem is that I am having trouble coming up an elegant class design, where I need to download the InputStream only once. In my current design I have wrote a separate class first that determines the correct content-type, next the ParseFactory uses this information to create an instance of the corresponding parser, which in turn, when the method 'parse()' is called, downloads the entire InputStream a second time.
public Parser createParser(){
InputStream inputStream = null;
String contentType = null;
String contentEncoding = null;
ContentTypeParser contentTypeParser = new ContentTypeParser(this.url);
Parser parser = null;
try {
inputStream = new BufferedInputStream(this.url.openStream());
contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();
assert (contentType != null);
inputStream = new BufferedInputStream(this.url.openStream());
if (contentType.equals(ContentTypes.rss))
{
logger.info("RSS feed detected");
parser = new RssParser(this.url);
parser.parse(inputStream);
}
else if (contentType.equals(ContentTypes.atom))
{
logger.info("Atom feed detected");
parser = new AtomParser(this.url);
}
else if (contentType.equals(ContentTypes.html))
{
logger.info("html detected");
parser = new HtmlParser(this.url);
parser.setContentEncoding(contentEncoding);
}
else if (contentType.equals(ContentTypes.UNKNOWN))
logger.debug("Unable to recognize content type");
if (parser != null)
parser.parse(inputStream);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return parser;
}
Basically, I am looking for a solution that allows me to eliminate the second "inputStream = new BufferedInputStream(this.url.openStream())".
Any help would be greatly appreciated!
Side note 1: Just for the sake of being complete, I also tried using the URLConnection.guessContentTypeFromStream(inputStream) method, but this returns null way too often.
Side note 2: The XML-parsers (Atom and Rss) are based on SAXParser, the Html-parser on Jsoup.
Can you just call mark and reset?
inputStream = new BufferedInputStream(this.url.openStream());
inputStream.mark(2048); // Or some other sensible number
contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();
inputstream.reset(); // Let the parser have a crack at it now
Perhaps your ContentTypeParser should cache the content internally and feed it to the appropiate ContentParser instead of reacquiring data from InputStream.
Related
I am trying to parse a xml using stax but the error I get is:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[414,47]
Message: The reference to entity "R" must end with the ';' delimiter.
Which get stuck on the line 414 which has P&Rinside the xml file. The code I have to parse it is:
public List<Vild> getVildData(File file){
XMLInputFactory factory = XMLInputFactory.newFactory();
try {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(Files.readAllBytes(file.toPath()));
XMLStreamReader reader = factory.createXMLStreamReader(byteArrayInputStream, "iso8859-1");
List<Vild> vild = saveVild(reader);
reader.close();
return vild;
} catch (IOException e) {
e.printStackTrace();
} catch (XMLStreamException e) {
e.printStackTrace();
}
return Collections.emptyList();
}
private List<Vild> saveVild(XMLStreamReader streamReader) {
List<Vild> vildList = new ArrayList<>();
try{
Vild vild = new Vild();
while (streamReader.hasNext()) {
streamReader.next();
//Creating list with data
}
}catch(XMLStreamException | IllegalStateException ex) {
ex.printStackTrace();
}
return Collections.emptyList();
}
I read online that the & is invalid xml code but I don't know how to change it before it throws this error inside the saveVild method. Does someone know how to do this efficiently?
Change the question: you're not trying to parse an XML file, you're trying to parse a non-XML file. For that, you need a non-XML parser, and to write such a parser you need to start with a specification of the language you are trying to parse, and you'll need to agree the specification of this language with the other partners to the data interchange.
How much work you could all save by conforming to standards!
Treat broken XML arriving in your shop the way you would treat any other broken goods coming from a supplier: return it to sender marked "unfit for purpose".
The problem here, as you mention is that the parser finds the & and it expects also the ;
This gets fixed escaping the character, so that the parser finds & instead.
Take a look here for further reference
I go through this link for java nlp https://www.tutorialspoint.com/opennlp/index.htm
I tried below code in android:
try {
File file = copyAssets();
// InputStream inputStream = new FileInputStream(file);
ParserModel model = new ParserModel(file);
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
} catch (Exception e) {
}
i download file **en-parser-chunking.bin** from internet and placed in assets of android project but code stop on third line i.e ParserModel model = new ParserModel(file); without giving any exception. Need to know how can this work in android? if its not working is there any other support for nlp in android without consuming any services?
The reason the code stalls/breaks at runtime is that you need to use an InputStream instead of a File to load the binary file resource. Most likely, the File instance is null when you "load" it the way as indicated in line 2. In theory, this constructor of ParserModelshould detect this and an IOException should be thrown. Yet, sadly, the JavaDoc of OpenNLP is not precise about this kind of situation and you are not handling this exception properly in the catch block.
Moreover, the code snippet you presented should be improved, so that you know what actually went wrong.
Therefore, loading a POSModel from within an Activity should be done differently. Here is a variant that takes care for both aspects:
AssetManager assetManager = getAssets();
InputStream in = null;
try {
in = assetManager.open("en-parser-chunking.bin");
POSModel posModel;
if(in != null) {
posModel = new POSModel(in);
if(posModel!=null) {
// From here, <posModel> is initialized and you can start playing with it...
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "ParserModel could not initialized.");
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "OpenNLP binary model file could not found in assets.");
}
}
catch (Exception ex) {
Log.e("NLP", "message: " + ex.getMessage(), ex);
// proper exception handling here...
}
finally {
if(in!=null) {
in.close();
}
}
This way, you're using an InputStream approach and at the same time you take care for proper exception and resource handling. Moreover, you can now use a Debugger in case something remains unclear with the resource path references of your model files. For reference, see the official JavaDoc of AssetManager#open(String resourceName).
Note well:
Loading OpenNLP's binary resources can consume quite a lot of memory. For this reason, it might be the case that your Android App's request to allocate the needed memory for this operation can or will not be granted by the actual runtime (i.e., smartphone) environment.
Therefore, carefully monitor the amount of requested/required RAM while posModel = new POSModel(in); is invoked.
Hope it helps.
I'm trying to create a dynamic Rest client, where I can set the HTTP Method(GET-POST-PUT-DELETE), Query Params and body(Json, plain, XML), this is basically what I need, for the request I think i know how I can do it, but my concern is for reading the answer, since I know what I should get ( format) but I dont know how to read it properly, so far I return an object, below the code (only for POST, but the idea is the same):
Response responseRest = null;
Client client = null;
try {
client = new ResteasyClientBuilder().establishConnectionTimeout(TIME_OUT, TimeUnit.MILLISECONDS).socketTimeout(TIME_OUT, TimeUnit.MILLISECONDS).build();
WebTarget target = client.target(request.getUrlTarget());
MediaType type = assignResponseType(request.getTypeResponse());
switch (request.getProtocol()) {
case POST: {
if (request.getParamQuery() != null) {
for (VarRequestDTO varRequest : request.getParamQuery()) {
target = target.queryParam(varRequest.getName(), varRequest.getValue());
}
}
responseRest = target.request().post(Entity.entity(new ResponseWrapper(), type));
break;
}
default:
//HTTP METHOD No supported
}
Object result = responseRest.readEntity(Object.class);
}
catch (Exception e) {
response.setError(Boolean.TRUE);
response.setMessage(e.getMessage());
e.printStackTrace();
} finally {
if (responseRest != null) {
responseRest.close();
}
if (client != null) {
client.close();
}
}
What I basically I need is to return the object in the format needed, and where is called it's supposed to do a cast to the correct format, I just need it to be dynamic and used for any service.
Thanks
Every request that a ReST client makes to a ReST service, it passes an "Accept" header.
This is to indicate to the service the MIME-type of the resource the client is willing to accept.
In the above case, what are the acceptable formats (json/ plain text/ etc.) for you?
Depending on the "accept" format you choose, and the "Content-type" header that you receive, you can write a deserializer to accept that data and process.
Also, instead of returning an Object which is too generic, consider returning a readable Stream to the caller.
I have an issue when displaying strings received from a server in a JTable. Some specific characters appear as little white squares instead of "é" or "à" etc. I tried a lot of things but none of them fixed my problem. I'm working with Eclipse under Windows. The server was developped using Visual Studio 2010.
The server sends an XML file using tinyXML2, the client uses JDom to read it. The font used is "Dialog". The server takes the strings from an Oracle database.
I assume this is an encoding problem, but I haven't been able to fix it yet.
Does anyone have an idea ?
Thx
Arnaud
EDIT : As requested, this is how I use JDom
public static Player fromXML(Element e)
{
Player result = new Player();
String e_text = null;
try
{
e_text = e.getChildText(XMLTags.XML_Player_playerId);
if (e_text != null) result.setID(Integer.parseInt(e_text));
e_text = e.getChildText(XMLTags.XML_Player_lastName);
if (e_text != null) result.setName(e_text);
e_text = e.getChildText(XMLTags.XML_Player_point_scored);
if (e_text != null) result.addSpecial(STAT_SCORED, Double.parseDouble(e_text));
e_text = e.getChildText(XMLTags.XML_Player_point_scored_last);
if (e_text != null) result.addSpecial(STAT_SCORED_LAST, Double.parseDouble(e_text));
}
catch (Exception ex) {
ex.printStackTrace();
}
return result;
}
public static Document load(String filename) {
File XMLFile = new File(CLIENT_TO_SERVER, filename);
SAXBuilder sxb = new SAXBuilder();
Document document = new Document();
try
{
document = sxb.build(new File(XMLFile.getPath()));
} catch(Exception e){e.printStackTrace();}
return document;
}
read the file using correct encoding, something like:
document = sxb.build(new BufferedReader(new InputStreamReader(new FileInputStream(XMLFile.getPath()), "UTF8")));
Note: 1. 1st determine which char encoding used in that file. specify that charset instead of UTF8 above.
Incase encoding is not known or it's being generated from various systems with different encoding, you may use 'encoding detector library of Mozilla'. #see https://code.google.com/p/juniversalchardet/
need to handle UnsupportedEncodingException
I am new to org.xhtmlrenderer.pdf.ITextRenderer and have this problem:
The PDFs that my test servlet streams to my Downloads folder are in fact empty files.
The relevant method, streamAndDeleteTheClob, is shown below.
The first try block is definitely not a problem.
The server spends a lot of time in the second try block. No exception thrown.
Can anyone suggest a solution to this problem or a good approach to to debugging it?
Can anyone point me to essentially similar code that really works?
Any help would be much appreciated.
res.setContentType("application/pdf");
ServletOutputStream out = res.getOutputStream();
...
private boolean streamAndDeleteTheClob(int pageid,
Connection con,
ServletOutputStream out) throws IOException, ServletException {
Statement statement;
Clob htmlpage;
StringBuffer pdfbuf = new StringBuffer();
final String pageToSendQuery = "SELECT text FROM page WHERE pageid = " + pageid;
// create xhtml file as a CLOB (Oracle large character object) and stream it into StringBuffer pdfbuf
try { // definitely no problem in this block
statement = con.createStatement();
resultSet = statement.executeQuery(pageToSendQuery);
if (resultSet.next()) {
htmlpage = resultSet.getClob(1);
} else {
return true;
}
final Reader in = htmlpage.getCharacterStream();
final char[] buffer = new char[4096];
while ((in.read(buffer)) != -1) {
pdfbuf.append(buffer);
}
} catch (Exception ex) {
out.println("buffering CLOB failed: " + ex);
}
// create pdf from StringBuffer
try {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader(pdfbuf.toString())));
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(doc, null);
renderer.layout();
renderer.createPDF(out);
out.close();
} catch (Exception ex) {
out.println("streaming of pdf failed: " + ex);
}
deleteClob(con, pageid);
return false;
}
Using the DocumentBuilder.parse this way will try to resolve the DTD referenced in the XHTML page. It takes a really long time. The easyest way to aviod that if you are using the Flying Saucer (xhtmlrenderer), is to create the document this way:
Document myDocument = XMLResource.load(myInputStream).getDocument();
Note that you can use XMLResource.load with a Reader too.
Two things I can think of.
1) If the iText document is not closed, it'll be empty. Looks like renderer.finish() will work, but createPDF(out) should do that already.
2) If there are no pages, you could get an empty doc as well... so an empty input could result in a 0-byte PDF.
3) You might be getting a perfectly valid PDF that's not being streamed properly. Try writing to a ByteArrayOutputStream and checking the length there.
4) An almost fanatical dedication to the Pope!