I'm implementing an api that reads data from json response and writes the resulting objects to csv.
Is there a way to convert an object in java to a table format (row-column)?
E.g. assume I have these objects:
public class Test1 {
private int a;
private String b;
private Test2 c;
private List<String> d;
private List<Test2> e;
// getters-setters ...
}
public class Test2 {
private int x;
private String y;
private List<String> z;
// getters-setters ...
}
Lets say I have an instance with the following values
Test1 c1 = new Test1();
c1.setA(11);
c1.setB("12");
c1.setC(new Test2(21, "21", Arrays.asList(new String[] {"211", "212"}) ));
c1.setD(Arrays.asList(new String[] {"111", "112"}));
c1.setE(Arrays.asList(new Test2[] {
new Test2(31, "32"),
new Test2(41, "42")
}));
I would like to see something like this returned as a List<Map<String, Object>> or some other object:
a b c.x c.y c.z d e.x e.y
---- ---- ------ ------- ------ ---- ------ ------
11 12 21 21 211 111 31 32
11 12 21 21 211 111 41 42
11 12 21 21 211 112 31 32
11 12 21 21 211 112 41 42
11 12 21 21 212 111 31 32
11 12 21 21 212 111 41 42
11 12 21 21 212 112 31 32
11 12 21 21 212 112 41 42
I have already implemented something in order to achieve this result using reflections but my solution is too slow for larger objects.
I was thinking in using an in memory database so to convert the object into a database table and then select the result, something like MongoDB or ObjectDB, but I think its an overkill, and maybe slower than my approach. Also, these two do not support in memory database and I do not want to use another disk database, since I'm already using MySQL with hibernate. Usint ramdisk is not an option, since my server only has limited ram. Is there there an in memory oodbms that can do this?
I would prefeer as a solution an algorithm, or even better, if there is already a library that can convert any object to a row-column format? something like jackson or jaxb that convert data to/from other formats.
Thanks for the help
Finally after one week of banging my head against any possible thing available in my house I managed to find a solution.
I shared the code on GitHub so that if anyone ever encounters this problem again, he can avoids a couple of migranes :)
you can get the code from here:
https://github.com/Sebb77/Denormalizer
Note: I had to use the getType() function and the FieldType enum for my specific problem.
In the future I will try to speed up the code with some caching, or something else :)
Note2: this is just a sample code that should be used only for reference. Lots of improvements can be done.
Anyone is free to use the code, just send me a thank you email :)
Any suggestions, improvements or bugs reports are very welcome.
Related
Not sure why I'm getting this error. I installed hadoop 2.7.3 via brew on my MBP. I think I'm running it in single node
Everything I'm asking about is from this hadoop tutorial site. I'm getting a NumberFormatException error, but it says it's "null".
First, here's the input file:
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
Only one space between each integer. The only weird thing is the single digit number but that's not null.
Next, here's the error message I get when running the program:
snip snip
snip snip
17/03/06 17:21:40 WARN mapred.LocalJobRunner: job_local1731001664_0001
java.lang.Exception: java.lang.NumberFormatException: null
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NumberFormatException: null // complains something is null here
at java.lang.Integer.parseInt(Integer.java:454)
at java.lang.Integer.parseInt(Integer.java:527)
at com.servicenow.bigdata.ProcessUtil$E_EMapper.map(ProcessUtil.java:35)
at com.servicenow.bigdata.ProcessUtil$E_EMapper.map(ProcessUtil.java:16)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
snip snip
snip snip
Lastly, here's a snippet from the offending line/function above:
public void map(LongWritable key, Text value, // offending line #16 here
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens())
{
lasttoken=s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken); // offneding #35 line here
output.collect(new Text(year), new IntWritable(avgprice));
Thanks in advance for your help. Hopefully I'm not wasting people's time if this is a simple mistake.
It seems s.hasMoreTokens() is false from the beginning on, therefore lasttoken remains null and hence the NumberFormatException: null when trying to parse it.
Also, if there is a space ' ' between each number and you are trying to split the tokens with a tab '\t'there won't be any tokens
TutorialsPoint has outdated code. It tells you to download Hadoop 1.2.1? That is several years old... Go check the official Hadoop MapReduce tutorials.
You have no tabs in your data that you copied, just spaces.
You can test that same exact code outside of MapReduce.
You can replace all that string stuff with this
if (value == null) return null;
String[] splits = value.toString().split("\\s+");
String year = splits[0];
String lasttoken = splits[splits.length - 1];
Make sure that your text file has only Space delimiter.
Change the Code as following also works.
StringTokenizer s = new StringTokenizer(line,"\t");
I posted earlier today about an error I was getting with using the predict function. I was able to get that corrected, and thought I was on the right path.
I have a number of observations (actuals) and I have a few data points that I want to extrapolate or predict. I used lm to create a model, then I tried to use predict with the actual value that will serve as the predictor input.
This code is all repeated from my previous post, but here it is:
df <- read.table(text = '
Quarter Coupon Total
1 "Dec 06" 25027.072 132450574
2 "Dec 07" 76386.820 194154767
3 "Dec 08" 79622.147 221571135
4 "Dec 09" 74114.416 205880072
5 "Dec 10" 70993.058 188666980
6 "Jun 06" 12048.162 139137919
7 "Jun 07" 46889.369 165276325
8 "Jun 08" 84732.537 207074374
9 "Jun 09" 83240.084 221945162
10 "Jun 10" 81970.143 236954249
11 "Mar 06" 3451.248 116811392
12 "Mar 07" 34201.197 155190418
13 "Mar 08" 73232.900 212492488
14 "Mar 09" 70644.948 203663201
15 "Mar 10" 72314.945 203427892
16 "Mar 11" 88708.663 214061240
17 "Sep 06" 15027.252 121285335
18 "Sep 07" 60228.793 195428991
19 "Sep 08" 85507.062 257651399
20 "Sep 09" 77763.365 215048147
21 "Sep 10" 62259.691 168862119', header=TRUE)
str(df)
'data.frame': 21 obs. of 3 variables:
$ Quarter : Factor w/ 24 levels "Dec 06","Dec 07",..: 1 2 3 4 5 7 8 9 10 11 ...
$ Coupon: num 25027 76387 79622 74114 70993 ...
$ Total: num 132450574 194154767 221571135 205880072 188666980 ...
Code:
model <- lm(df$Total ~ df$Coupon, data=df)
> model
Call:
lm(formula = df$Total ~ df$Coupon)
Coefficients:
(Intercept) df$Coupon
107286259 1349
Predict code (based on previous help):
(These are the predictor values I want to use to get the predicted value)
Quarter = c("Jun 11", "Sep 11", "Dec 11")
Total = c(79037022, 83100656, 104299800)
Coupon = data.frame(Quarter, Total)
Coupon$estimate <- predict(model, newdate = Coupon$Total)
Now, when I run that, I get this error message:
Error in `$<-.data.frame`(`*tmp*`, "estimate", value = c(60980.3823396919, :
replacement has 21 rows, data has 3
My original data frame that I used to build the model had 21 observations in it. I am now trying to predict 3 values based on the model.
I either don't truly understand this function, or have an error in my code.
Help would be appreciated.
Thanks
First, you want to use
model <- lm(Total ~ Coupon, data=df)
not model <-lm(df$Total ~ df$Coupon, data=df).
Second, by saying lm(Total ~ Coupon), you are fitting a model that uses Total as the response variable, with Coupon as the predictor. That is, your model is of the form Total = a + b*Coupon, with a and b the coefficients to be estimated. Note that the response goes on the left side of the ~, and the predictor(s) on the right.
Because of this, when you ask R to give you predicted values for the model, you have to provide a set of new predictor values, ie new values of Coupon, not Total.
Third, judging by your specification of newdata, it looks like you're actually after a model to fit Coupon as a function of Total, not the other way around. To do this:
model <- lm(Coupon ~ Total, data=df)
new.df <- data.frame(Total=c(79037022, 83100656, 104299800))
predict(model, new.df)
Thanks Hong, that was exactly the problem I was running into. The error you get suggests that the number of rows is wrong, but the problem is actually that the model has been trained using a command that ends up with the wrong names for parameters.
This is really a critical detail that is entirely non-obvious for lm and so on. Some of the tutorial make reference to doing lines like lm(olive$Area#olive$Palmitic) - ending up with variable names of olive$Area NOT Area, so creating an entry using anewdata<-data.frame(Palmitic=2) can't then be used. If you use lm(Area#Palmitic,data=olive) then the variable names are right and prediction works.
The real problem is that the error message does not indicate the problem at all:
Warning message: 'anewdata' had 1 rows but variable(s) found to have X
rows
instead of newdata you are using newdate in your predict code, verify once. and just use Coupon$estimate <- predict(model, Coupon)
It will work.
To avoid error, an important point about the new dataset is the name of independent variable. It must be the same as reported in the model. Another way is to nest the two function without creating a new dataset
model <- lm(Coupon ~ Total, data=df)
predict(model, data.frame(Total=c(79037022, 83100656, 104299800)))
Pay attention on the model. The next two commands are similar, but for predict function, the first work the second don't work.
model <- lm(Coupon ~ Total, data=df) #Ok
model <- lm(df$Coupon ~ df$Total) #Ko
Strange test failure after converting code from Lucene 3.6 to Lucene 4.1
public void testIndexPuid() throws Exception {
addReleaseOne();
RAMDirectory ramDir = new RAMDirectory();
createIndex(ramDir);
IndexReader ir = IndexReader.open(ramDir);
Fields fields = MultiFields.getFields(ir);
Terms terms = fields.terms("puid");
TermsEnum termsEnum = terms.iterator(null);
termsEnum.next();
assertEquals("efd2ace2-b3b9-305f-8a53-9803595c0e38", termsEnum.term());
}
returns:
Expected :efd2ace2-b3b9-305f-8a53-9803595c0e38
Actual :[65 66 64 32 61 63 65 32 2d 62 33 62 39 2d 33 30 35 66 2d 38 61 35 33 2d 39 38 30 33 35 39 35 63 30 65 33 38]
It seems to be adding the field as a binary field rather than a text field, but I checked and the field is being added using the deprecated
new Field("puid", value, Field.Index.NOT_ANALYZED_NO_NORMS, new KeywordAnalyzer())
so shouldn't that work the same way as before ?
Doh, my bad missing utf8ToString(), line should be: assertEquals("efd2ace2-b3b9-305f-8a53-9803595c0e38", termsEnum.term().utf8ToString()); – Paul Taylor Feb 19 at 22:20
I have a weird problem.
I have an application that crawl a webpage to get a list o names. Than this list is passed to another application that using those names, ask for information to a site, using its API's.
When I compare some strings in the first webpage to some others grabbed by API's usually I get wrong results.
I tried to get character value letter by letter I got this:
Rocco De Nicola
82 111 99 99 111 160 68 101 32 78 105 99 111 108 97 1st web page
82 111 99 99 111 32 68 101 32 78 105 99 111 108 97 2nd
As you can see, in the first string a space is codified by 160 (non-breaking space) instead of 32.
I can I codify correctly the first set of Strings?
I have also tried to set the Charset to UTF-8 but it didn't worked.
Maybe I just have to replace 160 to 32 ?
I would at first trim and replace complicated characters from the strings to compare. After this step follows the equals call. This brings also the advantages in cases you have language specific replacements in your text. It's also a good idea to convert the resulting strings to lower case.
Normally I use something like that ....
private String removeExtraCharsAndToLower(String str) {
str=str.toLowerCase();
str=str.replaceAll("ä", "ae");
str=str.replaceAll("ö", "oe");
str=str.replaceAll("ü", "ue");
str=str.replaceAll("ß", "ss");
return str.toLowerCase().replaceAll("[^a-z]","");
}
Using brute force. This lists all the character set which convert 160 to 32 when encoding.
String s = "" + (char) 160;
for (Map.Entry<String, Charset> stringCharsetEntry : Charset.availableCharsets().entrySet()) {
try {
ByteBuffer bytes = stringCharsetEntry.getValue().encode(s);
if (bytes.get(0) == 32)
System.out.println(stringCharsetEntry.getKey());
} catch (Exception ignored) {
}
}
prints nothing.
If I change the condition to
if (bytes.get(0) != (byte) 160)
System.out.println(stringCharsetEntry.getKey()+" "+new String(bytes.array(), 0));
I get quite a few examples.
Need a hint so I can convert a huge (300-400 mb) ASCII file to a CSV file.
My ASCII file is a database with a lot of products (about 600,000 pcs = 55,200,000 lines in the file).
Below is ONE product. It is like a tablerow in a database, with 88 columns.
If you count the below lines, there is 92 lines.
For every time we have the '00I+CR\LF' it indicates, that we have a new row/product.
Each line is ended with a CR+LF.
A whole product/row is ended with the following three lines:
A00
A10
A21
-as shown below.
Between the starting line '00I CR+LF' and the three ending lines, we have lines, starting with 2 digits (column name), and what comes after those digits, is the data for the column.
If we take the first line below the starting line '00I CR+LF' we will see:
'0109321609'. 01 indicates that it is the column named 01, and the rest is the data stored in that column: '09321609'.
I want to strip out the two digits, indicating each column name/line-number, so the first line (after the starting indication '00I'): 0109321609 comes out as the following: ”09321609”.
Putting it together with the next line (02), it should give an output like:
”09321609”,”15274”, etc.
When coming to the end, we want a new row.
The first line '00I' and the three last lines 'A00', 'A10' and 'A21' we don't want to be included in the output file.
Here is how a row looks like (every line is ended by a CR+LF):
00I
0109321609
0215274
032
0419685
05
062
072
081
09
111
121
15
161
17
1814740
1920120401
2020120401
2120120401
22
230
240
251
26BLAHBLAH 1000MG
27
281
29
30
31BLAHBLAH 1000 mg Filmtablets Hursutacinzki
32
3336
341
350
361
371
401
410
420
43
445774
45FTA
46
47AN03AX14
48BLAHBLAH00000000000000000000010
491
501
512
522
5317
542
552
561
572
581
591
60
61
62
631
641
65
66
67
681
69
721
74884
761
771
780
790
801
811
831
851474
86
871
880
891
901
911
922
930
941
951
961
97
98
990
A00
A10
A21
Anyone got a hint on how it can be converted?
The file is too big for a webserver with php and mysql to run. My thought was to put the file in a directory on my local server, and read the file, strip out the line numbers, and insert the data directly in a mysql database on the fly, but the file is too big, and the server stalls.
I'm able to run under Linux (Ubuntu) and Windows 7.
Maybe some python or java is recommended? I'm able to run both, but my experience with those is low, but I'm a quick learner, so if someone can give a hint? :-)
Best Regards
Bjarke :-)
If you are absolutely certain that each entry is 92 lines long:
from itertools import izip
import csv
with open('data.txt') as inf, open('data.csv','wb') as outf:
lines = (line[2:].rstrip() for line in inf)
rows = (data[1:89] for data in izip(*([lines]*92)))
csv.writer(outf).writerows(rows)
It should be like this in python.
import csv
fo = csv.writer(open('out.csv','wb'))
with open('eg.txt', 'r') as f:
for line in f:
assert line[:3] == '00I'
buf = []
for i in range(88):
line = f.next()
buf.append(line.strip()[2:])
line = f.next()
assert line[:3] == 'A00'
line = f.next()
assert line[:3] == 'A10'
line = f.next()
assert line[:3] == 'A21'
fo.writerow(buf)