FileNotFound while File is there - java

I am using getClassLoader().getResources to find the path for Jsoup to parse.
String path = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath();
Document document = Jsoup.parse(new File(path), "utf-8");
Elements names = document.getElementsByTag("name");
System.out.println(names.size());
My student.xml has been placed under the src folder in my module "day11_xml" and this code snippet comes from the class JsoupDemo1 in the package cn.itcast.xml.jsoup under the same module of "day11_xml". The error messages reads as follows:
java.io.FileNotFoundException:/Users/dingshun/Downloads/New%20Java%20Projects/demo/out/production/day11_xml/student.xml (No such file or directory)
I don't get it, as I can find the exact file in the given path. I'm confused, but could you guys help me out? Also, I'm new to both Java programming and this forum and if this question sounds silly or my question format is not right, please let me know.

What you're doing looks good. Maybe use the stream version JSoup.parse.
URL url = JsoupDemo1.class.getClassLoader().getResource("student.xml");
InputStream stream = JsoupDemo1.class.getClassLoader().getResourceAsStream("student.xml");
document = Jsoup.parse(stream, "utf-8", url.toURI()toString());
The documentation linked seems to imply it will work with html not xml, so maybe you need to use the other argument which provides a parser?

Actually, it turned out that Jsoup could not find my file because the path name "New%20Java%20Projects" has spaces between them. When I reload the file in a folder which has no spaces in its name, it works out just fine. So it can parse xml using parse​(File in, String charsetName) method. It seems it cannot parse path name which has spaces in it.

Related

OWL Parsing From EFO

I have been trying endlessly to parse the Experimental Factor Ontology (EFO) file, but I am not able to parse it. The file I have opens fine in Protege, but I cannot seem to get it to load in Java. I have looked at a few sets of example code, and I am copying them seemingly exactly, but I do not understand why parsing fails. Here is my code:
System.setProperty("entityExpansionLimit","100000000");
OWLOntologyManager manager = OWLManager.createOWLOntologyManager();
URI uri = URI.create("file:~/efo.owl");
IRI iri = IRI.create(uri);
OWLOntology ontology = manager.loadOntologyFromOntologyDocument(iri);
And here are the errors I get:
Could not load ontology: Problem parsing
file:/~/efo.owl
Could not parse ontology. Either a suitable parser could not be found, or
parsing failed. See parser logs below for explanation.
The following parsers were tried:
Thank you, I know some similar posts have been made, but I have been unable to figure it out and am quite desperate! I can provide the stack trace if necessary, but it is quite long as there is a trace for each parser.
File URI need to be absolute for OWLAPI to parse them, but as you have a local file you can just create a File instance and pass that to IRI.create().
Alternatively pass the File instance to OWLOntologyManager::loadOntologyFromOntologyDocument()
There must be something wrong with the local, downloaded file. Loading the ontology directly from the ontology IRI worked.

Java saving a file with special characters in file name

I'm having a problem on Java file encoding.
I have a Java program will save a input stream as a file with a given file name, the code snippet is like:
File out = new File(strFileName);
Files.copy(inStream, out.toPath());
It works fine on Windows unless the file name contains some special characters like Ö, with these characters in the file name, the saved file will display a garbled file name on Windows.
I understand that by applying JVM option -Dfile.encoding=UTF-8 this issue can be fixed, but I would have a solution in my code rather than ask all my users to change their JVM options.
While debugging the program I can see the file name string always shows the correct character, so I guess the problem is not about internal encoding.
Could someone please explain what went wrong behind the scene? and is there a way to avoid this problem programmatically? I tried get the bytes from the string and change the encoding but it doesn't work.
Thanks.
Using the URLEncoder class would work:
String name = URLEncoder.encode("fileName#", "UTF-8");
File output = new File(name);

Microdata extraction from HTML in Java

I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.
Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.
I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.
I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.
Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:
HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");
These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.
If anyone found the way to use MicrodataExtractor, please leave the answer here.
Thank you.
xpath is generally the way to consume html or xml.
have a look at: How to read XML using XPath in Java

How to fix Invalid byte 1 of 1-byte UTF-8 sequence

I am trying to fetch the below xml from db using a java method but I am getting an error
Code used to parse the xml
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new ByteArrayInputStream(cond.getBytes()));
Document doc = db.parse(is);
Element elem = doc.getDocumentElement();
// here we expect a series of <data><name>N</name><value>V</value></data>
NodeList nodes = elem.getElementsByTagName("data");
TableID jobId = new TableID(_processInstanceId);
Job myJob = Job.queryByID(_clientContext, jobId, true);
if (nodes.getLength() == 0) {
log(Level.DEBUG, "No data found on condition XML");
}
for (int i = 0; i < nodes.getLength(); i++) {
// loop through the <data> in the XML
Element dataTags = (Element) nodes.item(i);
String name = getChildTagValue(dataTags, "name");
String value = getChildTagValue(dataTags, "value");
log(Level.INFO, "UserData/Value=" + name + "/" + value);
myJob.setBulkUserData(name, value);
}
myJob.save();
The Data
<ContactDetails>307896043</ContactDetails>
<ContactName>307896043</ContactName>
<Preferred_Completion_Date>
</Preferred_Completion_Date>
<service_address>A-End Address: 1ST HELIERST HELIERJT2 3XP832THE CABLES 1 POONHA LANEST HELIER JE JT2 3XP</service_address>
<ServiceOrderId>315473043</ServiceOrderId>
<ServiceOrderTypeId>50</ServiceOrderTypeId>
<CustDesiredDate>2013-03-20T18:12:04</CustDesiredDate>
<OrderId>307896043</OrderId>
<CreateWho>csmuser</CreateWho>
<AccountInternalId>20100333</AccountInternalId>
<ServiceInternalId>20766093</ServiceInternalId>
<ServiceInternalIdResets>0</ServiceInternalIdResets>
<Primary_Offer_Name action='del'>MyMobile Blue £44.99 [12 month term]</Primary_Offer_Name>
<Disc_Reason action='del'>8</Disc_Reason>
<Sup_Offer action='del'>80000257</Sup_Offer>
<Service_Type action='del'>A-01-00</Service_Type>
<Priority action='del'>4</Priority>
<Account_Number action='del'>0</Account_Number>
<Offer action='del'>80000257</Offer>
<msisdn action='del'>447797142520</msisdn>
<imsi action='del'>234503184</imsi>
<sim action='del'>5535</sim>
<ocb9_ARM action='del'>false</ocb9_ARM>
<port_in_required action='del'>
</port_in_required>
<ocb9_mob action='del'>none</ocb9_mob>
<ocb9_mob_BB action='del'>
</ocb9_mob_BB>
<ocb9_LandLine action='del'>
</ocb9_LandLine>
<ocb9_LandLine_BB action='del'>
</ocb9_LandLine_BB>
<Contact_2>
</Contact_2>
<Acc_middle_name>
</Acc_middle_name>
<MarketCode>7</MarketCode>
<Acc_last_name>Port_OUT</Acc_last_name>
<Contact_1>
</Contact_1>
<Acc_first_name>.</Acc_first_name>
<EmaiId>
</EmaiId>
The ERROR
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I read in some threads it's because of some special characters in the xml.
How to fix this issue ?
How to fix this issue ?
Read the data using the correct character encoding. The error message means that you are trying to read the data as UTF-8 (either deliberately or because that is the default encoding for an XML file that does not specify <?xml version="1.0" encoding="somethingelse"?>) but it is actually in a different encoding such as ISO-8859-1 or Windows-1252.
To be able to advise on how you should do this I'd have to see the code you're currently using to read the XML.
Open the xml in notepad
Make sure you dont have extra space at the beginning and end of the document.
Select File -> Save As
select save as type -> All files
Enter file name as abcd.xml
select Encoding - UTF-8 -> Click Save
Try:
InputStream inputStream= // Your InputStream from your database.
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handler);
If it's anything else than UTF-8, just change the encoding part for the good one.
I was getting the xml as a String and using xml.getBytes() and getting this error. Changing to xml.getBytes(Charset.forName("UTF-8")) worked for me.
I had the same problem in my JSF application which was having a comment line containing some special characters in the XMHTL page. When I compared the previous version in my eclipse it had a comment,
//Some �  special characters found
Removed those characters and the page loaded fine. Mostly it is related to XML files, so please compare it with the working version.
I had this problem, but the file was in UTF-8, it was just that somehow on character had come in that was not encoded in UTF-8. To solve the problem I did what is stated in this thread, i.e. I validated the file:
How to check whether a file is valid UTF-8?
Basically you run the command:
$ iconv -f UTF-8 your_file -o /dev/null
And if there is something that is not encoded in UTF-8 it will give you the line and row numbers so that you can find it.
I happened to run into this problem because of an Ant build.
That Ant build took files and applied filterchain expandproperties to it. During this file filtering, my Windows machine's implicit default non-UTF-8 character encoding was used to generate the filtered files - therefore characters outside of its character set could not be mapped correctly.
One solution was to provide Ant with an explicit environment variable for UTF-8.
In Cygwin, before launching Ant: export ANT_OPTS="-Dfile.encoding=UTF-8".
This error comes when you are trying to load jasper report file with the extension .jasper
For Example
c://reports//EmployeeReport.jasper"
While you should load jasper report file with the extension .jrxml
For Example
c://reports//EmployeeReport.jrxml"
[See Problem Screenshot ][1] [1]: https://i.stack.imgur.com/D5SzR.png
[See Solution Screenshot][2] [2]: https://i.stack.imgur.com/VeQb9.png
I had a similar problem.
I had saved some xml in a file and when reading it into a DOM document, it failed due to special character. Then I used the following code to fix it:
String enco = new String(Files.readAllBytes(Paths.get(listPayloadPath+"/Payload.xml")), StandardCharsets.UTF_8);
Document doc = builder.parse(new ByteArrayInputStream(enco.getBytes(StandardCharsets.UTF_8)));
Let me know if it works for you.
I have met the same problem and after long investigation of my XML file I found the problem: there was few unescaped characters like « ».
Those like me who understand character encoding principles, also read Joel's article which is funny as it contains wrong characters anyway and still can't figure out what the heck (spoiler alert, I'm Mac user) then your solution can be as simple as removing your local repo and clone it again.
My code base did not change since the last time it was running OK so it made no sense to have UTF errors given the fact that our build system never complained about it....till I remembered that I accidentally unplugged my computer few days ago with IntelliJ Idea and the whole thing running (Java/Tomcat/Hibernate)
My Mac did a brilliant job as pretending nothing happened and I carried on business as usual but the underlying file system was left corrupted somehow. Wasted the whole day trying to figure this one out. I hope it helps somebody.
I had the same issue. My problem was it was missing “-Dfile.encoding=UTF8” argument under the JAVA_OPTION in statWeblogic.cmd file in WebLogic server.
You have a library that needs to be erased
Like the following library
implementation 'org.apache.maven.plugins:maven-surefire-plugin:2.4.3'
This error surprised me in production...
The error is because the char encoding is wrong, so the best solution is implement a way to auto detect the input charset.
This is one way to do it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
someReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...

Get real file extension -Java code

I would like to determine real file extension for security reason.
How can I do that?
Supposing you really mean to get the true content type of a file (ie it's MIME type) you should refer to this excellent answer.
You can get the true content type of a file in Java using the following code:
File file = new File("filename.asgdsag");
InputStream is = new BufferedInputStream(new FileInputStream(file));
String mimeType = URLConnection.guessContentTypeFromStream(is);
There are a number of ways that you can do this, some more complicated (and more reliable) than others. The page I linked to discusses quite a few of these approaches.
Not sure exactly what you mean, but however you do this it is only going to work for the specific set of file formats which are known to you
you could exclude executables (are you talking windows here?) - there's some file header information here http://support.microsoft.com/kb/65122 - you could scan and block files which look like they have an exe header - is this getting close to what you mean when you say 'real file extension'?

Categories

Resources