I have the following contents in the same PDF page, in different ObjectX:
First:
[(some text)] TJ ET Q
[(some other text)] TJ ET Q
Very simple and basic so far...
The second:
[( H T M L E x a m p l e)] TJ ET Q
[( S o m e s p e c i a l c h a r a c t e r s : < ¬ ¬ ¬ & ט ט © > \\ s l a s h \\ \\ d o u b l e - s l a s h \\ \\ \\ t r i p l e - s l a s h )] TJ ET Q
NOTE: It is not noticeable in text above, but:
'H T M L E x a m p l e' is actually 0H0T0M0L0[32]0E0x0a0m0p0l0e where each 0 is a literal value 0 == ((char)0) so if I ignore all the 0 values, this actually turns to be like the upper example...
Some Bytes:
htmlexample == [0, 72, 0, 84, 0, 77, 0, 76, 0, 32, 0, 69, 0, 120, 0, 97, 0, 109, 0, 112, 0, 108, 0, 101]
<content> == [0, 32, 32, -84, 0, 32, 32, -84, 0, 32, 32, -84, 0, 32, 0, 38, 0, 32, 0, -24, 0, 32, 0, -24, 0, 32, 0, -87, 0, 32, 0]
But in the next line I need to combine every two bytes into a char because of the following:
< ¬ ¬ ¬...> is actually <0[32][32]¬0[32][32]¬0[32][32]¬...> where the combination of [32]¬ is €
The problem I'm facing is not the conversion itself I use:
new String(sb.toString().getBytes("UTF-8"),"UTF-16BE")
The problem is to know when to apply it and when to keep the UTF-8.
== UPDATE ==
The font used for the problematic Object is:
#7 0# {
'Name' : "F4"
'BaseFont' : "AAAAAE+DejaVuSans-Bold"
'Subtype' : "Type0"
'ToUnicode' : #41 0# {
'Filter' : "FlateDecode"
'Length' : 1679.0f
} + Stream(5771 bytes)
'Encoding' : "Identity-H"
'DescendantFonts' : [#42 0# {
'FontDescriptor' : #43 0# {
'MaxWidth' : 2016.0f
'AvgWidth' : 573.0f
'FontBBox' : [-1069.0f, -415.0f, 1975.0f, 1174.0f]
'MissingWidth' : 600.0f
'FontName' : "AAAAAE+DejaVuSans-Bold"
'Type' : "FontDescriptor"
'CapHeight' : 729.0f
'StemV' : 60.0f
'Leading' : 0.0f
'FontFile2' : #34 0# {
'Filter' : "FlateDecode"
'Length1' : 83036.0f
'Length' : 34117.0f
} + Stream(83036 bytes)
'Ascent' : 928.0f
'Descent' : -236.0f
'XHeight' : 547.0f
'StemH' : 26.0f
'Flags' : 32.0f
'ItalicAngle' : 0.0f
}
'Subtype' : "CIDFontType2"
'W' : [32.0f, [348.0f, 456.0f, 521.0f, 838.0f, 696.0f, 1002.0f, 872.0f, 306.0f, 457.0f, 457.0f, 523.0f, 838.0f, 380.0f, 415.0f, 380.0f, 365.0f], 48.0f, 57.0f, 696.0f, 58.0f, 59.0f, 400.0f, 60.0f, 62.0f, 838.0f, 63.0f, [580.0f, 1000.0f, 774.0f, 762.0f, 734.0f, 830.0f, 683.0f, 683.0f, 821.0f, 837.0f, 372.0f, 372.0f, 775.0f, 637.0f, 995.0f, 837.0f, 850.0f, 733.0f, 850.0f, 770.0f, 720.0f, 682.0f, 812.0f, 774.0f, 1103.0f, 771.0f, 724.0f, 725.0f, 457.0f, 365.0f, 457.0f, 838.0f, 500.0f, 500.0f, 675.0f, 716.0f, 593.0f, 716.0f, 678.0f, 435.0f, 716.0f, 712.0f, 343.0f, 343.0f, 665.0f, 343.0f, 1042.0f, 712.0f, 687.0f, 716.0f, 716.0f, 493.0f, 595.0f, 478.0f, 712.0f, 652.0f, 924.0f, 645.0f, 652.0f, 582.0f, 712.0f, 365.0f, 712.0f, 838.0f], 160.0f, [348.0f, 456.0f, 696.0f, 696.0f, 636.0f, 696.0f, 365.0f, 500.0f, 500.0f, 1000.0f, 564.0f, 646.0f, 838.0f, 415.0f, 1000.0f, 500.0f, 500.0f, 838.0f, 438.0f, 438.0f, 500.0f, 736.0f, 636.0f, 380.0f, 500.0f, 438.0f, 564.0f, 646.0f], 188.0f, 190.0f, 1035.0f, 191.0f, 191.0f, 580.0f, 192.0f, 197.0f, 774.0f, 198.0f, [1085.0f, 734.0f], 200.0f, 203.0f, 683.0f, 204.0f, 207.0f, 372.0f, 208.0f, [838.0f, 837.0f], 210.0f, 214.0f, 850.0f, 215.0f, [838.0f, 850.0f], 217.0f, 220.0f, 812.0f, 221.0f, [724.0f, 738.0f, 719.0f], 224.0f, 229.0f, 675.0f, 230.0f, [1048.0f, 593.0f], 232.0f, 235.0f, 678.0f, 236.0f, 239.0f, 343.0f, 240.0f, [687.0f, 712.0f, 687.0f, 687.0f, 687.0f, 687.0f, 687.0f], 247.0f, [838.0f, 687.0f], 249.0f, 252.0f, 712.0f, 253.0f, [652.0f, 716.0f]]
'Type' : "Font"
'BaseFont' : "AAAAAE+DejaVuSans-Bold"
'CIDSystemInfo' : {
'Supplement' : 0.0f
'Ordering' : "Identity" + Stream(8 bytes)
'Registry' : "Adobe" + Stream(5 bytes)
}
'DW' : 600.0f
'CIDToGIDMap' : #44 0# {
'Filter' : "FlateDecode"
'Length' : 10200.0f
} + Stream(131072 bytes)
}]
'Type' : "Font"
}
There is no indication to the encoding type of the font.
== Update ==
As for the ToUnicode object, in the case of these font it is an unnecessary it should have been Identity-H but instead it is an X == X mapping here are some examples that goes from until FFFF:
<0000> <00ff> <0000>
<0100> <01ff> <0100>
<0200> <02ff> <0200>
<0300> <03ff> <0300>
<0400> <04ff> <0400>
<0500> <05ff> <0500>
<0600> <06ff> <0600>
<0700> <07ff> <0700>
<0800> <08ff> <0800>
<0900> <09ff> <0900>
<0a00> <0aff> <0a00>
<0b00> <0bff> <0b00>
<0c00> <0cff> <0c00>
<0d00> <0dff> <0d00>
<0e00> <0eff> <0e00>
<0f00> <0fff> <0f00>
<1000> <10ff> <1000>
<1100> <11ff> <1100>
....
....
....
<fc00> <fcff> <fc00>
<fd00> <fdff> <fd00>
<fe00> <feff> <fe00>
<ff00> <ffff> <ff00>
So the mapping is not in the ToUnicode object, but still other renderers can render it well!
Any Ideas?
I use: new String(sb.toString().getBytes("UTF-8"),"UTF-16BE")
The problem is to know when to apply it and when to keep the UTF-8.
The OP assumes, probably after examining some sample PDF files, that strings in PDF content streams are encoded using either UTF-8 or UTF-16BE.
This assumption is wrong.
PDF allows some standard single-byte encodings (MacRomanEncoding, MacExpertEncoding, and WinAnsiEncoding) none of which is UTF-8 (due to relations between different encodings, especially ASCII, Latin1, and UTF-8, they may be confused with each other when confronted with a limited sample). Furthermore numerous predefined multi-byte encodings are also allowed, some of which are indeed UTF-16-related..
But PDF allows completely custom encodings, both single-byte and multi-byte, to be used, too!
E.g. this text drawing operation
(ABCCD) Tj
for a simple font with this encoding:
<<
/Type /Encoding
/Differences [ 65 /H /e /l /o ]
>>
displays the word Hello!
And while this may look like an artificially constructed example, the procedure to create a custom encoding like this (i.e. by assigning codes from some start value upwards to glyphs in the order in which they first occur on the page or in the document) is fairly often used.
Furthermore, the OP's current solution
If your font object has a CMap, then you treat it as a UTF-16, otherwise not.
will only work for a very few documents because
a) simple fonts (using single-byte encodings) may also supply a ToUnicode CMap and
b) composite fonts CMaps also need not be UTF-like but instead can use a mixed multi-byte encoding.
Thus, there is no way around an in-depth analysis of the used font information, cf. 9.5..9.9 of the PDF specification ISO 32000-1.
PS On some comments by the OP:
this: new String(sb.toString().getBytes("UTF-8"),"UTF-16BE") was an example to the how the problem is solved not a solution! The solution is done while fetching the glyphs whether I treat the data as 16-bit or 8-bit
and
the ToUnicode map is 16-bit(The only ones I've seen) per key,
The data may be mixed data, e.g. have a look at the Adobe CMap and CIDFont
Files Specification, here the CMap example 9 contains the section
4 begincodespacerange
<00> <80>
<8140> <9ffc>
<a0> <de>
<e040> <fbec>
endcodespacerange
which is explained to mean
Figure 6 shows how the codespace definition in this example comprises two single-byte linear ranges of codes (<00> to <80> and <A0> to <DF>) and two double-byte rectangular ranges of codes (<8140> to <9FFC> and <E040> to <FBFC>). The first two-byte region comprises all codes bounded by first-byte values of 81 through 9F and second-byte values of 40 through FC. Thus, the input code <86A9> is within the region because both bytes are within bounds. That code is valid. The input code <8210> is not within the region, even though its first byte is between 81 and 9F, because its second byte is not within bounds. That code is invalid. The second two-byte region is similarly bounded.
OK, So as this seems to be complicated, and the reason for this bug is stupid, especially on my end, but there is a lesson to be learned with regards to when to treat the chars as UTF-16, and when not to.
My problem was not while parsing the fonts, but while rendering them. according to the details specified in the Font object you can determine the type of the font and apply the correct logic to it.
I am seriously new to Matlab and Matlab Builder JA. I am trying to make a simple Java web by calling a Matlab functions. So, I have model.m file. I have put in the functions to be used as a library to run in a server and have some variables to be input by users, which are 'Time' (eg. 20) and 'Dose' (eg. 1 2 0 0). I have tried to integrate all the codes and include the Matlab library. But unfortunately, only the .html file can be run. Below are my codes and error that I get. Hope someone could help me to check whether I am compiling the model.m file correctly and the Java file is coded also correct. For your information, I am using Eclipse Kepler and Tomcat 7.0.54. Thanks in advance!
First of all, this is the error that I got when I run the application.
EDITED
HTTP Status 500-
java.lang.NullPointerException
com.mathworks.toolbox.javabuilder.internal.MWMCR.mclFeval(Native Method)
com.mathworks.toolbox.javabuilder.internal.MWMCR.access$600(MWMCR.java:23)
com.mathworks.toolbox.javabuilder.internal.MWMCR$6.mclFeval(MWMCR.java:833)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
java.lang.reflect.Method.invoke(Unknown Source)
com.mathworks.toolbox.javabuilder.internal.MWMCR$5.invoke(MWMCR.java:731)
com.sun.proxy.$Proxy11.mclFeval(Unknown Source)
com.mathworks.toolbox.javabuilder.internal.MWMCR.invoke(MWMCR.java:406)
runPKmodelV1.Function.runPKmodelV1(Function.java:217)
SecondServlet.doGet(SecondServlet.java:78)
javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51)
This is the model.m file that have all the functions needed, with an output figure.
MATLAB model.m
function w = model(tf_sim,y0)
%%--------------------------------------------------------------------
% Running ODE model - Initial Conditions and setting up ode solver
%---------------------------------------------------------------------
%Initial Conditions:
%Initial Conditions:
% Dose_Central = 1;
% Drug_Central = 0;
% Dose_Peripheral = 0;
% Drug_Peripheral = 0;
%tf_sim=10; %Simulation time
% y0(1) = 1; % Dose in Central Compartment
% y0(2) = 0; % Drug in Central Compartment
% y0(3) = 0; % Dose in Peripheral Compartment
% y0(4) = 0; % Drug in Peripheral Compartment
options = odeset('RelTol',1e-4,'AbsTol',1e-5);
%%% Solving ODEs, final time = 2e+9 sec (aprox. 63 years) %%%
[t,y]=ode15s(#PK_Model_v1,[0 tf_sim],y0,options);
%%% Plotting state variables %%%
f = figure;
subplot(2,2,1);
plot(t,y(:,1),'r-');
subplot(2,2,2);
plot(t,y(:,2),'b-');
subplot(2,2,3);
plot(t,y(:,3),'m-');
subplot(2,2,4);
plot(t,y(:,4),'k-');
w = webfigure(f);
close(f);
end
function dy = PK_Model_v1(time,y)
%Parameter Values:
Tk0_Central = 1;
TLag_Central = 1;
Km_Central = 1;
Vm_Central = 30;
Tk0_Peripheral = 1;
TLag_Peripheral = 1;
Q12 = 1;
k12 = 1;
k21 = 1;
Central = 1;
Peripheral = 1;
ka_Central = 0.5;
ka_Peripheral = 0.1;
%Fluxes:
ReactionFlux1 = ka_Central*y(1);
ReactionFlux2 = Vm_Central*y(2)/(Km_Central+y(2));
ReactionFlux3 = ka_Peripheral*y(3);
ReactionFlux4 = (k12*y(2))*Central-(k21*y(4))*Peripheral;
dy(1,1) = -ReactionFlux1;
dy(2,1) = 1/Central*(ReactionFlux1 - ReactionFlux2 - ReactionFlux4);
dy(3,1) = -ReactionFlux3;
dy(4,1) = 1/Peripheral*(ReactionFlux3 + ReactionFlux4);
end
This is the welcome page, with all the variables to be input by users.
Eclipse page.html
<form action="SecondServlet">
<p> </p>
<p>PK Model Example</p>
<p> </p>
<p>Time</p>
<input type="text" name="tf_sim" value="" />
<p>Dosage</p>
<input type="text" name="y0" value="" />
<!-- Submit -->
<input type="submit" value="Display" name="DoPLot" />
<p> </p>
</form>
There is nothing in the index.jsp page. Only the webfigure after the submit button from page.html is clicked.
Eclipse index.jsp
<div ALIGN="CENTER">
<wf:web-figure root="WebFigures" name="Project_Figure" scope="session" />
</div>
This consists of codes used to get the data keyed in by the users and finally calculate the result with the use of Matlab library and dispatch the result into the index.jsp.
EDITED
Eclipse SecondServlet.java
protected void doGet(HttpServletRequest request,
HttpServletResponse response) throws ServletException, IOException {
Object[] param = new MWArray[2];
Object[] result = null;
param[0] = new MWNumericArray(Integer.parseInt(request
.getParameter("tf_sim")), MWClassID.DOUBLE);
String[] str_elems = request.getParameter("y0").split("\\s+");
int[] numbers = new int[str_elems.length];
for (int i = 0; i < str_elems.length; i++) {
numbers[i] = Integer.parseInt(str_elems[i]);
}
try {
result = pkModel.model(1, param);
WebFigure webFig = (WebFigure) ((MWJavaObjectRef) result[0]).get();
// Set the figure scope to session
request.getSession().setAttribute("Project_Figure", webFig);
// Bind the figure's lifetime to session
request.getSession().setAttribute("Project_Figure_Binder",
new MWHttpSessionBinder(webFig));
updateSession(request.getSession(), result);
RequestDispatcher dispatcher = request
.getRequestDispatcher("/index.jsp");
dispatcher.forward(request, response);
} catch (MWException e) {
e.printStackTrace();
} finally {
MWArray.disposeArray(result);
}
}
public void updateSession(HttpSession session, Object[] param) {
int outputCount = param.length - 1;
session.setAttribute("numOutputs", outputCount);
}
Really hope someone could help. Thanks!
Take a look at the error message:
java.lang.NumberFormatException: For input string: "1 2 0 0"
java.lang.NumberFormatException.forInputString(Unknown Source)
java.lang.Integer.parseInt(Unknown Source)
java.lang.Integer.parseInt(Unknown Source)
SecondServlet.doGet(SecondServlet.java:70)
The first line states that the problem is the number format, the third tells us that it encounter the problem when using the parseInt method, and the last line tells us it's in the doGet method.
Taking a quick look at the doGet method, we see:
param[1] = new MWNumericArray(Integer.parseInt(request
.getParameter("y0")), MWClassID.DOUBLE);
Now, judging by your comments on the data format earlier y0 is meant to be an array of four numbers. The problem then is that Interger.parseInt is meant for parsing single numbers, not arrays of them.
To solve this you would need to add a separate step to split your input into individual numbers and parse them one at a time. Something like this:
String[] str_elems = request.getParameter("y0").split(" ");
List<Integer> int_elems = new LinkedList<Integer>();
for (int i = 0; i < str_elems.length; i++)
int_elems.add(Integer.parseInt(str_elems[i]));
Or something similar, modified to account for the input preferences of MWNumericArray, documentation for which I couldn't find.
I wrote this r function that uses the xlsx package to write one or more data frames out to a .xlsx file. When given the same input (3 data frames: 6185 obs of 23 variables, 4 of 17 and 2 of 3) it throws an error most of the time, but not all of the time.
Can anyone tell me how to optimize my code, get the same outcome more elegantly or continue in spite of the error?
Here is the console output:
Running: WriteToFile()
WriteToFile:1
WriteToFile:4
Error in: Example Report
java.lang.OutOfMemoryError: GC overhead limit exceededReturning from: Example Report
and here is the function:
WriteToFile <- function() {
# Write a data frame(s) out to .xlsx file
if(debug > 2) {message("WriteToFile:1")}
# If the file already exists today, then delete it
if(file.exists(paste0(report.name, '(', today(), ')', ".xlsx"))) {
if(debug > 2) {message("WriteToFile:2")}
writeLines(paste0("File '", report.name, '(', today(), ')', ".xlsx", "' already exists and will be replaced."))
flush.console()
file.remove(paste0(report.name, '(', today(), ')', ".xlsx"))
}
# Get tab names that were generated in ProcessOutputs()
tabs <- ls(pattern=paste0("\\.data$"), name=.GlobalEnv)
# If no tabs, send fail-mail
if(length(tabs) == 0) {
if(debug > 2) {message("WriteToFile:3")}
SendPerlMail(fail.mail=TRUE, fail.msg="No data for tabs in WriteToFile()")
return(-1)
}
# Write first tab to first sheet, then write any remaining tabs to additional sheets
else {
if(debug > 2) {message("WriteToFile:4")}
write.xlsx2(x=get(tabs[1]), file=paste0(report.name, ' (', today(), ')', ".xlsx"),
sheetName=substr(x=tabs[1], start=0, stop=nchar(tabs[1])-5), col.names=TRUE, row.names=FALSE, append=FALSE)
if(length(tabs) > 1) {
if(debug > 2) {message("WriteToFile:5")}
for(t in mget(tabs[2:length(tabs)])) {
write.xlsx2(x=t, file=paste0(report.name, ' (', today(), ')', ".xlsx"),
sheetName=substr(x=substitute(t), start=0, stop=nchar(t)-5), col.names=TRUE, row.names=FALSE, append=TRUE)
}
}
}
email.file <<- paste0(getwd(), "/", report.name, ' (', today(), ')', ".xlsx")
return(1)
}
The outside variables referenced in the function are:
options(java.parameters = "-Xms2048m -Xmx4096m", "-XX:-UseGCOverheadLimit") (I set these Java parameters because they worked for someone in another post with a the same GC error)
debug = 3
report.name = "Example Report"
Example Query 1.data = data frame of 6185x23
Example Query 2.data = data frame of 4x17
Example Query 3.data = data frame of 2x3
Sometimes it outputs WriteToFile:5 to the console and sometimes it succeeds without throwing the error. Any help is greatly appreciated--I've been trying to figure out why this doesn't work reliably for a few hours now.