reading an html with matlab

reading an html with matlab - java

I would like to read some html using matlab.
I've already tried urlread but got the url read error
Getting data into MATLAB from HTTPS:
so i tried using java with this:
Handling an invalid security certificate using MATLAB's urlread command.
unfortunately i don't know how to use java with matlab.
so i tried this code and it seems to work
url = 'https://stackoverflow.com/questions/11053664/use-java-in-matlab';
is = java.net.URL([], url ).openConnection().getInputStream();
br = java.io.BufferedReader(java.io.InputStreamReader(is));
str = char(br.readLine());
however i would like to get the whole html page. so i can use regexp.
My kingdom for some help

There is a function in matlab that does that... The name is urlread!
See http://www.mathworks.com/matlabcentral/answers/973

Related

Using the created document trough FPDF with PHP/JAVA

I created a PDF document with PHP using FPDF. The next thing I want to do is silently printing the document without downloading the PDF file to the computer.
I've made the following code:
$pdfprintable = $pdf->Output(''.'.pdf','S');
$printcmd = "java -classpath jPDFPrint.jar;pdfprintcli.jar cli.PDFPrintCLI $pdfprintable";
exec($printcmd);
And it returns the following error message:
Warning: exec(): NULL byte detected. Possible attack in C:\Users\Jordy\Desktop\XAMPP\htdocs\php\stickers\pdf.php on line 392
If I echo the $pdfprintable in PHP it shows a lot of weird characters.

Are you sure the java command is supposed to be used with an hexadecimal string represenation of the PDF ?
use option
$pdfprintable = $pdf->Output('USEAFULLPATHTOFILE.pdf','F');
With the above the PDF is generated and then you can try to print it with the java application if that one works.
Also if you are loading the PDF correctly in FPDF you should be able to use the option D in ->Output
$pdfprintable = $pdf->Output('USEAFULLPATHTOFILE.pdf','D');
Use this to verify the that the PDF is loaded and also managed correctly by FPDF.
Also notice your example code is very limited.
If you need more troubleshooting pls show the Java and the full PHP source relevant to printing operation, loading or creation of the PDF in FPDF

imacros - read text file using javascript

I want to read text file using javascript not using imacros loop
I tried to read it using java with no luck
function frdln(n){ var fr,s=''; try{
fr=new java.io.BufferedReader(new java.io.FileReader(n));
s=fr.readLine();
if(s==null){s=''}else{s=''+s};
fr.close();
fr=null; }catch(e){
alert(''+e); }; return s; };
give me message error "ReferenceError: java is not defined"
Note:I installed latest version of java and the same error appear
if there is any other way to read text file or fix my code because I have no idea

using javascript this can work XMLHttpRequest() but XMLHttpRequest() is no longer supported in firefox 15+ You must have to define it:
const XMLHttpRequest = Components.Constructor("#mozilla.org/xmlextras/xmlhttprequest;1");
var request = XMLHttpRequest();

Install FireFox 15 and don't update it and that function will work. Java is not supported in FF16+ but in FF15 it works.

Getting wrong answer from MediaWiki engine when passing Hebrew paramaters

I'm making Android version of some hebrew website that use WikiEngine but when I try to get some data via it's API using hebrew title names I got wrong answer.
Like if I try to get this URL
http://www.some-web-site.co.il/w/he/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles="HEBREW_TITLE"
I got response from API that title is missing. However if I pass string like this
Blockquote %D7%A1%D7%99%D7%95%D7%A2_%D7%91%D7%A8%D7%9B%D7%99%D7%A9%D7%AA_%D7%9E%D7%9B%D7%A9%D7%99%D7%A8%D7%99_%D7%94%D7%9C%D7%99%D7%9B%D7%94
I got rigth response. This string I got when I copy-paste url from browser. So my question hot can I transfer hebrew topic name to string with this format using Java.
Thanks

Try
String title = "THE_HEBREW_TITLE";
String encodedTitle = URLEncoder.encode(title , "UTF-8");
and use encodedTitle to compose the URL you are using to query the web service.

Non-english characters are decoded incorrectly on Android with HtlmCleaner

I'm using HtmlCleaner to scrape a ISO-8859-1 encoded web site in Android.
I've implemented this in an external jar file that I import into my Android app.
When I run the unit tests in Eclipse it handles Norwegian letters (æ,ø,å) correct (I can verify that in the debugger), but in the Android app these characters look like inverted question marks.
If I attach the debugger to my Android app I can see that these letters are not correct in the exact same places they were good when running unit test from Eclipse, so it's not a display/render/view issue in the Android app.
When I copy the text from the debuggers I get these results:
Java Process (Unit Test): «Blårek», «Benny»
Android Process (In emulator): «Bl�rek», «Benny»
I would expect these Strings to be equal, but notice how the "å" is replaed by the inverted question marks in Android.
I have tried running htmlCleaner.getProperties().setRecognizeUnicodeChars(true) without any luck. Also, I found no way of forcing UTF-8 or ISO-8859-1 encoding in html cleaner, but I' not sure if that would have made a difference.
Here is the code i run:
HtmlCleaner htmlCleaner = new HtmlCleaner();
// connect to url and get root TagNode from HtmlCleaner
InputSteram is = new URL( url ).openConnection().getInputStream();
TagNode rootNode = htmlCleaner.clean( is );
// navigate through some TagNodes, getting the ContentNode
ContentNode cn = rootNode...
// This String contains the incorrectly decoded characters on Android.
// Good in Oracle JVM though..
String value = cn.toString().trim();
Does anyone knows what could cause the decoding behavoir to be different on Android? I guess the main difference between the two environments is that the Android app uses Android's java.io stack while my unit tests use Sun/Oracle's stack.
Thanks,
Geir

HtmlCleaner can't tell what encoding to use; you are passing only the body of the response in the InputStream, but the encoding is in the "content-type" header.
You can set the character encoding on the properties of the HtmlCleaner to the correct encoding from the HTTP connection. But that would require you to parse the correct parameter from the content-type header. Alternatively, you can pass a URL instance to HtmlCleaner and let it manage the connection. Then, it will have access to all the information it needs to decode properly.

Running a JavaScript command from MATLAB to fetch a PDF file

I'm currently writing some MATLAB code to interact with my company's internal reports database. So far I can access the HTML abstract page using code which looks like this:
import com.mathworks.mde.desk.*;
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.setCurrentLocation(ReportURL(8:end));
pause(1);
s={};
while isempty(s)
s=char(wb.getHtmlText);
pause(.1);
end
desk=MLDesktop.getInstance;
desk.removeClient(wb);
I can extract out various bits of information from the HTML text which ends up in the variable s, however the PDF of the report is accessed via what I believe is a JavaScript command (onClick="gotoFulltext('','[Report Number]')").
Any ideas as to how I execute this JavaScript command and get the contents of the PDF file into a MATLAB variable?
(MATLAB sits on top of Java, so I believe a Java solution would work...)

I think you should take a look at the JavaScript that is being called and see what the final request to the webserver looks like.
You can do this quite easily in Firefox using the FireBug plugin.
https://addons.mozilla.org/en-US/firefox/addon/1843
Once you have found the real server request then you can just request this URL or post to this URL instead of trying to run the JavaScript.

Once you have gotten the correct URL (a la the answer from pjp), your next problem is to "get the contents of the PDF file into a MATLAB variable". Whether or not this is possible may depend on what you mean by "contents"...
If you want to get the raw data in the PDF file, I don't think there is a way currently to do this in MATLAB. The URLREAD function was the first thing I thought of to read content from a URL into a string, but it has this note in the documentation:
s = urlread('url') reads the content
at a URL into the string s. If the
server returns binary data, s will
be unreadable.
Indeed, if you try to read a PDF as in the following example, s contains some text intermingled with mostly garbage:
s = urlread('http://samplepdf.com/sample.pdf');
If you want to get the text from the PDF file, you have some options. First, you can use URLWRITE to save the contents of the URL to a file:
urlwrite('http://samplepdf.com/sample.pdf','temp.pdf');
Then you should be able to use one of two submissions on The MathWorks File Exchange to extract the text from the PDF:
Extract text from a PDF document by Dimitri Shvorob
PDF Reader by Tom Gaudette
If you simply want to view the PDF, you can just open it in Adobe Acrobat with the OPEN function:
open('temp.pdf');

wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.executeScript('javascript:alert(''Some code from a link'')');
desk=com.mathworks.mde.desk.MLDesktop.getInstance;
desk.removeClient(wb);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

reading an html with matlab - java

There is a function in matlab that does that... The name is urlread! See http://www.mathworks.com/matlabcentral/answers/973

Related

Using the created document trough FPDF with PHP/JAVA

imacros - read text file using javascript

Getting wrong answer from MediaWiki engine when passing Hebrew paramaters

Non-english characters are decoded incorrectly on Android with HtlmCleaner

Running a JavaScript command from MATLAB to fetch a PDF file

Categories

Resources