I want to read python dictionary string using java. Example string:
{'name': u'Shivam', 'otherInfo': [[0], [1]], 'isMale': True}
This is not a valid JSON. I want it to convert into proper JSON using java code.
well, the best way would be to pass it through a python script that reads that data and outputs valid json:
>>> json.dumps(ast.literal_eval("{'name': u'Shivam', 'otherInfo': [[0], [1]], 'isMale': True}"))
'{"name": "Shivam", "otherInfo": [[0], [1]], "isMale": true}'
so you could create a script that only contains:
import json, ast; print(json.dumps(ast.literal_eval(sys.argv[1])))
then you can make it a python oneliner like so:
python -c "import sys, ast, json ; print(json.dumps(ast.literal_eval(sys.argv[1])))" "{'name': u'Shivam', 'otherInfo': [[0], [1]], 'isMale': True}"
that you can run from your shell, meaning you can run it from within java the same way:
String PythonData = "{'name': u'Shivam', 'otherInfo': [[0], [1]], 'isMale': True}";
String[] cmd = {
"python", "-c", "import sys, ast, json ; print(json.dumps(ast.literal_eval(sys.argv[1])))",
and as output you'll have a proper JSON string.
This solution is the most reliable way I can think of, as it's going to parse safely any python syntax without issue (as it's using the python parser to do so), without opening a window for code injection.
But I wouldn't recommend using it, because you'd be spawning a python process for each string you parse, which would be a performance killer.
As an improvement on top of that first answer, you could use some jython to run that python code in the JVM for a bit more performance.
PythonInterpreter interpreter = new PythonInterpreter();
interpreter.eval("to_json = lambda d: json.dumps(ast.literal_eval(d))")
PyObject ToJson = interpreter.get("to_json");
PyObject result = ToJson.__call__(new PyString(PythonData));
String realResult = (String) result.__tojava__(String.class);
The above is untested (so it's likely to fail and spawn dragons 👹) and I'm pretty sure you can make it more elegant. It's loosely adapted from this answer. I'll leave up to you as an exercise to see how you can include the jython environment in your Java runtime ☺.
P.S.: Another solution would be to try and fix every pattern you can think of using a gigantic regexp or multiple ones. But even if on simpler cases that might work, I would advise against that, because regex is the wrong tool for the job, as it won't be expressive enough and you'll never be comprehensive. It's only a good way to plant a seed for a bug that'll kill you at some point in the future.
P.S.2: Whenever you need to parse code from an external source, always make sure that data is sanitized and safe. Never forget about little bobby tables
In conjunction to the other answer: it is straight forward to simply invoke that python one-liner statement to "translate" a python-dict-string into a standard JSON string.
But doing a new Process for each row in your database might turn into a performance killer quickly.
Thus there are two options that you should consider on top of that:
establish some small "python server" that keeps running; its only job is to do that translation for JVMs that can connect to it
you can look into jython. Meaning: simply enable your JVM to run python code. In other words: instead of writing your own python-dict-string parser; you simply add "python powers" to your JVM; and rely on existing components to that translation for you.
My problem is somewhat similar to this question.
I try to communicate from Python with a Jython program (that needs to keep running because it communicates with a Java API).
However, i can't get the output in realtime, whatever i try:
p2=Popen([path_jython],stdin=PIPE, stdout=PIPE, stderr=PIPE, shell=False)
p2.stdin.write('print test\n')
for line in p2.stdout:
print line
Nothing happens, the program blocks. It doesn't change when i iterate over p2.stdout.readlines() or iter(md5.stdout.readline, '') or when I repeatedly call p2.stdout.read(1) like suggested in the linked question.
If i add print p2.communicate() after the flush() however, i get the desired output:
>>> ('hallo\r\n', '\r\n')
but the program terminates afterwards...
Does anyone have a solution for this problem? Or are there alternatives to communicate effectively with a running Jython process? Happy for any advice!
EDIT: Python 2.7.5, Jython 2.7.1b
try to make jython stdout/stderr unbuffered, pass -u command-line argument
if you set stderr=PIPE then you should read it, otherwise a deadlock may occur if the child process fills the OS pipe buffer corresponding to its stderr. You could redirect it to stdout: stderr=STDOUT
set Python side buffering explicitely: bufsize=1 (line-buffered)
use iter(p.stdout.readline, '') to avoid read-ahead bug on Python 2
from subprocess import Popen, PIPE, STDOUT
p = Popen(['jython', '-u', ...], stdin=PIPE, stdout=PIPE, stderr=STDOUT,
print >>p.stdin, 'test="hallo"' #NOTE: it uses `os.linesep`
print >>p.stdin, 'print test'
p.stdin.close() # + implicit .flush()
for line in iter(p.stdout.readline, ''):
print line, #NOTE: comma -- softspace hack to avoid duplicate newlines
rc = p.wait()
This is similar to:
Printing to the console vs writing to a file (speed)
I was confused because there are two conflicting answers. I wrote a simple java program
for(int i=0; i<1000000; i++){
and ran it with /usr/bin/time -v java test to measure time to output to stdout, then I tried /usr/bin/time -v java test > file and /usr/bin/time -v java > /dev/null. Writing to console was slowest (10 seconds) then file (6 seconds) and /dev/null was fastest (2 seconds). Why?
Because writing to console needs to refresh the screen each time something is written, which takes time.
Writing to a file needs to write bytes on disk, which takes time, but less time than refreshing the screen.
And writing to /dev/null doesn't write anything anywhere, which takes much less time.
Another problem with System.out.println is that System.out by default is in autoflush mode, println practically switches off buffering. Try this
PrintWriter a = new PrintWriter(System.out, false);
for (int i = 0; i < 1000000; i++) {
and you will see that output to a file becomes ten times faster.
I have a fortran exe which takes a input file and produces output file by doing some manipulation on input file.I am able to run the command in linux terminal.(I think fortran compiler is availble in linux).Now please suggest how to run this fortran executable file using java(in Linux machine).
What i attempted is,
String cmd="fortranExe arg1 arg2";
//fortranExe=exe path
//arg1,arg2 are arguments to fortran executable program
Process p=Runtime.getRuntime().exec(cmd);
But i am not getting output.When i tried to run Linux commands such as ls,dir are giving output.Is anything required for running fortran code in java?
Try using something like this
Process process = new ProcessBuilder("C:\\PathToExe\\fortran.exe","param1","param2").start();
No, if it's a reqular binary for the platform you are running your JVM on, it shouldn't matter.
How are you running the binary, when you run from console?
Once the process was generated successfully, you can read its stdout like this:
BufferedReader br = new BufferedReader(new InputStreamReader(process.getInputStream()));
String line;
while ((line = br.readLine()) != null)
If there was any runtime problem executing the process, e.g. no permission to run the binary etc., the process.exitValue() will return 127.
seeing the other comments, I can see that you are using redirected in/output to your binary.
So in fact there are no parameters, but you need to open InputFileName.txt and use the process.getOutputStream() Object to write to your process. No need to set OutputFilename.txt, because you read the output from the InputStream and if necessary can write it yourself to a file.
You can even call individual pieces like subroutines of your Fortan program directly from Java. JNA is generally used to invoke C/C++ programs and are generally suited for this type of use cases and I guess yours use case fits well here.
I am looking some java implementation of sorting algorithm. The file could be HUGE, say 20000*600=12,000,000 lines of records. The line is comma delimited with 37 fields and we use 5 fields as keys. Is it possible to sort it quickly, say 30 minutes?
If you got other approach other than java, it is welcome if it can be easily integrated into java system. For example, unix utility.
Edit: The lines need to be sort is dispersed into 600 files, with 20000 lines each, 4mb for each file. Finally I would like them to be 1 big sorted file.
I am trying to time a unix sort, would update that afterwards.
I appended all the files into a big one, and tried the unix sort function, it is pretty good. The time to sort a 2gb file is 12-13 minutes. The append action require 4 minutes for 600 files.
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r big.txt -o sorted.txt
How does the data get in the CSV format? Does it come from a relational database? You can make it such that whatever process creates the file writes its entries in the right order so you don't have to solve this problem down the line.
If you are doing a simple lexicographic order you can try the unix sort, but I am not sure how that will perform on a file with that size.
Calling unix sort program should be efficient. It does multiple passes to ensure it is not a memory hog. You can fork a process with java's Runtime, but the outputs of the process are redirected, so you have to some juggling to get the redirect to work right:
public static void sortInUnix(File fileIn, File sortedFile)
throws IOException, InterruptedException {
String[] cmd = {
"cmd", "/c",
// above should be changed to "sh", "-c" if on Unix system
"sort " + fileIn.getAbsolutePath() + " > "
+ sortedFile.getAbsolutePath() };
Process sortProcess = Runtime.getRuntime().exec(cmd);
// capture error messages (if any)
BufferedReader reader = new BufferedReader(new InputStreamReader(
String outputS = reader.readLine();
while (outputS != null) {
outputS = reader.readLine();
Use the java library big-sorter which is published to Maven Central and has an optional dependency on commons-csv for CSV processing. It handles files of any size by splitting to intermediate files, sorting and merging the intermediate files repeatedly until there is only one left. Note also that the max group size for a merge is configurable (useful for when there are a large number of input files).
Here's an example:
Given the CSV file below, we will sort on the second column (the "number" column):
WIPER BLADE,35,12.55
ALLEN KEY 5MM,27,3.80
Serializer<CSVRecord> serializer = Serializer.csv(
Comparator<CSVRecord> comparator = (x, y) -> {
int a = Integer.parseInt(x.get("number"));
int b = Integer.parseInt(y.get("number"));
return Integer.compare(a, b);
The result is:
ALLEN KEY 5MM,27,3.80
WIPER BLADE,35,12.55
I created a CSV file with 12 million rows and 37 columns and filled the grid with random integers between 0 and 100,000. I then sorted the 2.7GB file on the 11th column using big-sorter and it took 8 mins to do single-threaded on an i7 with SSD and max heap set at 512m (-Xmx512m).
See the project README for more details.
Java Lists can be sorted, you can try starting there.
Python on a big server.
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
data = [ row for row in rdr ]
data.sort( key=sort_key )
fields= rdr.fieldnames
with open('some_file_sorted.csv', 'wb') as target:
wtr= csv.DictWriter( target, fields }
wtr.writerows( data )
This should be reasonably quick. And it's very flexible.
On a small machine, break this into three passes: decorate, sort, undecorate
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
with open('temp.txt','w') as target:
for row in rdr:
target.write( "|".join( map(str,sort_key(row)) ) + "|" + row )
Part 2 is the operating system sort using "|" as the field separator
with open('sorted_temp.txt','r') as source:
with open('sorted.csv','w') as target:
for row in rdr:
keys, _, data = row.rpartition('|')
target.write( data )
You don't mention platform, so it is hard to come to terms with the time specified. 12x10^6 records isn't that many, but sorting is a pretty intensive task. Let's say 37 fields, say 100bytes/field would be 45GB? That's a bit much for most machines, but if the records average 10bytes/field your server should be able to fit the entire file in RAM, which would be ideal.
My suggestion: Break the file into chunks that are 1/2 the available RAM, sort each chunk, then merge-sort the resulting sorted chunks. This lets you do all of your sorting in memory rather than hitting swap, which is what I suspect of causing any slow-down.
Say (1G chunks, in a directory you can play around in):
split --line-bytes=1000000000 original_file chunk
for each in chunk*
sort $each > $each.sorted
sort -m chunk*.sorted > original_file.sorted
As your data set is huge as you have mentioned. Sorting it all at one go will be time consuming depending on your machine (If you try QuickSort).
But since you would like it to be done within 30 mins. I would suggest that you have a look at Map Reduce using
Apache Hadoop as your application server.
Please keep in mind it's not an easy approach, but in the longer run you can easily scale up depending upon your data size.
I am also pointing you to an excellent link on Hadoop setup
Work your way through single node setup and move to Hadoop cluster.
I would be glad to help you if you get stuck anywhere.
You really do need to make sure you have the right tools for the job. ( Today, I am hoping to get a 3.8 GHz PC with 24 GB memory for home use. It been a while since I bought myself a new toy. ;)
However, if you want to sort these lines and you don't have enough hardware, you don't need to break up the data because its in 600 files already.
Sort each file individually, then do a 600-way merge sort (you only need to keep 600 lines in memory at once) Its not as simple as doing them all at once, but you could probably do it on a mobile phone. ;)
Since you have 600 smaller files, it could be faster to sort all of them concurrently. This will eat up 100% of the CPU. That's the point, correct?
for f in ${SOURCE}/*
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o ${f}.srt ${f} &
waitlist="$waitlist $!"
wait $waitlist
LIST=`echo $SOURCE/*.srt`
sort --merge -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o sorted.txt ${LIST}
This will sort 600 small files all at the same time and then merge the sorted files. It may be faster than trying to sort a single large file.
Use Map/Reduce Hadoop to do the sorting.. i recommend Spring Data Hadoop. Java.
Well since you're talking about HUGE datasets this means you'll need some external sorting algorithm anyhow. There are some for java and pretty much any other language out there - since the result will have to be stored on the disk anyhow which language you're using is pretty uninteresting.