Using Java Spark to read large text files line by line - java

I am attempting to read a large text file (2 to 3 gb). I need to read the text file line by line and convert each line into a Json object. I have tried using .collect() and .toLocalIterator() to read through the text file. collect() is fine for small files but will not work for large files. I know that .toLocalIterator() collects data scattered around the cluster into a single cluster. According to the documentation .toLocalIterator() is ineffective when dealing with large RDD's as it will run into memory issues. Is there an efficient way to read large text files in a multi node cluster?
Below is a method with my various attempts at reading through the file and converting each line into a json.
public static void jsonConversion() {
JavaRDD<String> lines = sc.textFile(path);
String newrows = lines.first(); //<--- This reads the first line of the text file
// Reading through with
// tolocaliterator--------------------------------------------
Iterator<String> newstuff = lines.toLocalIterator();
System.out.println("line 1 " + newstuff.next());
System.out.println("line 2 " + newstuff.next());
// Inserting lines in a list.
// Note: .collect() is appropriate for small files
// only.-------------------------
List<String> rows = lines.collect();
// Sets loop limit based on the number on lines in text file.
int count = (int) lines.count();
System.out.println("Number of lines are " + count);
// Using google's library to create a Json builder.
GsonBuilder gsonBuilder = new GsonBuilder();
Gson gson = new GsonBuilder().setLenient().create();
// Created an array list to insert json objects.
ArrayList<String> jsonList = new ArrayList<>();
// Converting each line of the text file into a Json formatted string and
// inserting into the array list 'jsonList'
for (int i = 0; i <= count - 1; i++) {
String JSONObject = gson.toJson(rows.get(i));
Gson prettyGson = new GsonBuilder().setPrettyPrinting().create();
String prettyJson = prettyGson.toJson(rows.get(i));
jsonList.add(prettyJson);
}
// For printing out the all the json objects
int lineNumber = 1;
for (int i = 0; i <= count - 1; i++) {
System.out.println("line " + lineNumber + "-->" + jsonList.get(i));
lineNumber++;
}
}
Below is a list of libraries that I am using
//Spark Libraries
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
//Java Libraries
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
//Json Builder Libraries
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;

You can try to use map function on RDD instead of collecting all results.
JavaRDD<String> lines = sc.textFile(path);
JavaRDD<String> jsonList = lines.map(line -> <<all your json transformations>>)
In that way, you will achieve a distribute transformation of your data. More about map function.
Converting data to a list or array will force to a data collection on one node. If you want to achieve computations distribution in Spark, you need to use either RDD or Dataframe or Dataset.

JavaRDD<String> lines = sc.textFile(path);
JavaRDD<String> jsonList = lines.map(line ->line.split("/"))
Or you can define a new method inside the map
JavaRDD<String> jsonList = lines.map(line ->{
String newline = line.replace("","")
return newline ;
})
//Do convert the JavaRDD to DataFrame
Converting JavaRDD to DataFrame in Spark java
dfTobeSaved.write.format("json").save("/root/data.json")

Related

Transform a large jsonl file with unknown json properties into csv using apache beam google dataflow and java

How to convert a large jsonl file with unknown json properties into csv using Apache Beam, google dataflow and java
Here is my scenario:
A large jsonl file is in google storage
Json properties are unknown, so using Apache Beam's Schema can not be defined in Beam's pipeline.
Use Apache beam, google dataflow and java to convert jsonl to csv
Once transformation is done, store csv in google storage (same bucket where jsonl is stored)
Notify by some means, like transformation_done=true if possible (rest api or event)
Any help or guidance would be helpful, as I am new to Apache beam, though I am reading the doc from Apache Beam.
I have edited the question with an example JSONL data
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
{"Name":"May", "Session":"2012B", "Score":"14", "Completed":"false"}
{"Name":"Deloise", "Session":"2012A", "Score":"19", "Completed":"true"}
While json key's are there in an input file but it's not known while transforming.
I'll explain that by an example, suppose I have three clients and each got it's own google storage, so each upload their own jsonl file with different json properties.
Client 1: Input Jsonl File
{"city":"Mumbai", "pincode":"2012A"}
{"city":"Delhi", "pincode":"2012N"}
Client 2: Input Jsonl File
{"Relation":"Finance", "Code":"2012A"}
{"Relation":"Production", "Code":"20XXX"}
Client 3: Input Jsonl File
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
Question: How could I write A Generic beam pipeline which transforms all three as shown below
Client 1: Output CSV file
["city", "pincode"]
["Mumbai","2012A"]
["Delhi", "2012N"]
Client 2: Output CSV file
["Relation", "Code"]
["Finance", "2012A"]
["Production","20XXX"]
Client 3: Output CSV file
["Name", "Session", "Score", "true"]
["Gilbert", "2013", "24", "true"]
["Alexa", "2013", "29", "true"]
Edit: Removed the previous ans as questions have been modified with examples.
There is no generic way provided by anyone to achieve such result. You have to write the logic yourself depending on your requirements and how you are handling the pipeline.
Below there are some examples but you need to verify these for your case as I have only tried these on a small JSONL file.
TextIO
Approach 1
If you can collect the header value of the output csv
then it will be much easier. But getting the header beforehand itself another challenge.
//pipeline
pipeline.apply("ReadJSONLines",
TextIO.read().from("FILE URL"))
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processLines(#Element String line, OutputReceiver<String> receiver) {
String values = getCsvLine(line, false);
receiver.output(values);
}
}))
.apply("WriteCSV",
TextIO.write().to("FileName")
.withSuffix(".csv")
.withoutSharding()
.withDelimiter(new char[] { '\r', '\n' })
.withHeader(getHeader()));
private static String getHeader() {
String header = "";
//your logic to get the header line.
return header;
}
probable ways to get the header line(Only assumptions may not work in your case) :
You can have a text file in GCS which will store the header of a particular JSON File. And in your logic you can fetch the header by reading the file , check this SO thread about how to read files from GCS
You can try to pass the header as a runtime argument but that depends how you are configuring and executing your pipeline.
Approach 2
This is a workaround I found for small JsonFiles(~10k lines). This below example may not work for large files.
final int[] count = { 0 };
pipeline.apply(//read file)
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processLines(#Element String line, OutputReceiver<String> receiver) {
// check if its the first processing element. If yes then create the header
if (count[0] == 0) {
String header = getCsvLine(line, true);
receiver.output(header);
count[0]++;
}
String values = getCsvLine(line, false);
receiver.output(values);
}
}))
.apply(//write file)
FileIO
As mentioned by Saransh in comments by using FileIO all you have to do is read the JSONL line by line manually and then convert those into comma separated format.EG:
pipeline.apply(FileIO.match().filepattern("FILE PATH"))
.apply(FileIO.readMatches())
.apply(FlatMapElements
.into(TypeDescriptors.strings())
.via((FileIO.ReadableFile f) -> {
List<String> output = new ArrayList<>();
try (BufferedReader br = new BufferedReader(Channels.newReader(f.open(), "UTF-8"))) {
String line = br.readLine();
while (line != null) {
if (output.size() == 0) {
String header = getCsvLine(line, true);
output.add(header);
}
String result = getCsvLine(line, false);
output.add(result);
line = br.readLine();
}
} catch (IOException e) {
throw new RuntimeException("Error while reading", e);
}
return output;
}))
.apply(//write to gcs)
In the above examples I have used a getCsvLine method(created for code usability) which takes a single line from the file and converts it into a comma separated format.To parse the JSON object I have used GSON.
/**
* #param line take each JSONL line
* #param isHeader true : Returns output combining the JSON keys || false:
* Returns output combining the JSON values
**/
public static String getCsvLine(String line, boolean isHeader) {
List<String> values = new ArrayList<>();
// convert the line into jsonobject
JsonObject jsonObject = JsonParser.parseString(line).getAsJsonObject();
// iterate json object and collect all values
for (Map.Entry<String, JsonElement> entry : jsonObject.entrySet()) {
if (isHeader)
values.add(entry.getKey());
else
values.add(entry.getValue().getAsString());
}
String result = String.join(",", values);
return result;
}

mapping particular column of a csv file with particular POJO's field

I have to map particular CSV column based on index with particular POJO attributes. Mapping will be based on a json file which will contain columnIndex and attribute name which means that for a particular columnIndex from csv file you have to map particular attribute from Pojo class.
Below is a sample of json file which shows column mapping strategy with Pojo attributes.
[{"index":0,"columnname":"date"},{"index":1,"columnname":"deviceAddress"},{"index":7,"columnname":"iPAddress"},{"index":3,"columnname":"userName"},{"index":10,"columnname":"group"},{"index":5,"columnname":"eventCategoryName"},{"index":6,"columnname":"message"}]
I have tried with OpenCSV library but the challenges which i faced with that I am not able to read partial column with it. As in above json you can see that we are skipping index 2 and 4 to read from CSV file. Below is the code with openCSV file.
public static List<BaseDataModel> readCSVFile(String filePath,List<String> columnListBasedOnIndex) {
List<BaseDataModel> csvDataModels = null;
File myFile = new File(filePath);
try (FileInputStream fis = new FileInputStream(myFile)) {
final ColumnPositionMappingStrategy<BaseDataModel> strategy = new ColumnPositionMappingStrategy<BaseDataModel>();
strategy.setType(BaseDataModel.class);
strategy.setColumnMapping(columnListBasedOnIndex.toArray(new String[0]));
final CsvToBeanBuilder<BaseDataModel> beanBuilder = new CsvToBeanBuilder<>(new InputStreamReader(fis));
beanBuilder.withMappingStrategy(strategy);
csvDataModels = beanBuilder.build().parse();
} catch (Exception e) {
e.printStackTrace();
}
}
List<ColumnIndexMapping> columnIndexMappingList = dataSourceModel.getColumnMappingStrategy();
List<String> columnNameList = columnIndexMappingList.stream().map(ColumnIndexMapping::getColumnname)
.collect(Collectors.toList());
List<BaseDataModel> DataModels = Utility
.readCSVFile(file.getAbsolutePath() + File.separator + fileName, columnNameList);
I have also tried with univocity but with this library how can i map csv with particular attributes. Below is the code -
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically(); //detects the format
settings.getFormat().setLineSeparator("\n");
//extracts the headers from the input
settings.setHeaderExtractionEnabled(true);
settings.selectIndexes(0, 2); //rows will contain only values of columns at position 0 and 2
CsvRoutines routines = new CsvRoutines(settings); // Can also use TSV and Fixed-width routines
routines.parseAll(BaseDataModel.class, new File("/path/to/your.csv"));
List<String[]> rows = new CsvParser(settings).parseAll(new File("/path/to/your.csv"), "UTF-8");
Please have a look if someone can help me in this case.
Author of univocity-parsers here. You can define mappings to your class attributes in code instead of annotations. Something like this:
public class BaseDataModel {
private String a;
private int b;
private String c;
private Date d;
}
Then on your code, map the attributes to whatever column names you need:
ColumnMapper mapper = routines.getColumnMapper();
mapper.attributeToColumnName("a", "col1");
mapper.attributeToColumnName("b", "col2");
mapper.attributeToColumnName("c", "col3");
mapper.attributeToColumnName("d", "col4");
You can also use mapper.attributeToIndex("d", 3); to map attributes to a given column index.
Hope this helps.

Modifying JSON output for for two different functions

I have two functions that each take in an array list descriptors. I am trying to print different JSON outputs for each respective function. I am using the Gson library to help me accomplish this task. I use a Client Data model object to help format the JSON correctly. Attached below are the getters and setters for this.
import java.util.ArrayList;
import java.util.List;
import com.google.gson.annotations.SerializedName;
public class ClientData {
#SerializedName("TrialCountryCodes")
private List<String> trialCountryCodes;
#SerializedName("CancerGenePanel")
private String cancerGenePanel;
public ClientData() {
this.trialCountryCodes = new ArrayList<String>();
}
public List<String> getTrialCountryCodes() {
return trialCountryCodes;
}
public void setTrialCountryCodes(List<String> trialCountryCodes) {
this.trialCountryCodes = trialCountryCodes;
}
public String getCancerGenePanel() {
return cancerGenePanel;
}
public void setCancerGenePanel(String cancerGenePanel) {
this.cancerGenePanel = cancerGenePanel;
}
}
The problem comes in with the Trial Country Codes. When I call one function I want Trial Country Codes to be visible in the JSON output. When I call the other one I don't want Country Codes to be visible. Attached below are the two functions one takes in one file and the other takes in two files. When the function has one file I don't want Trial Country Codes to be visible. When the function has two files I do want Trial Country Codes to be visible
descriptors = HelperMethods.getBreastCarcinomaDescriptorsFromCsvFile("/Users/edgarjohnson/eclipse-workspace/CsvToJson/src/in.csv");
descriptors = HelperMethods.getBreastCarcinomaDescriptorsFromCsvFile("/Users/edgarjohnson/eclipse-workspace/CsvToJson/src/in.csv", "/Users/edgarjohnson/eclipse-workspace/CsvToJson/src/EU.csv");
HelperMethods.writeJsonFile(descriptors, "JsonOutput.json");
More BackGround info: I am getting these values from a CSV file in which I read the CSV file and write the JSON output to multiple files. This is the code that I use to format my JSON file:
public static List<BreastCarcinomaDescriptor> getBreastCarcinomaDescriptorsFromCsvFile(String fileName, String fileName2) {
List<BreastCarcinomaDescriptor> descriptorsAndCountrycodes = new ArrayList<BreastCarcinomaDescriptor>();
BufferedReader bufferedCsvFile = HelperMethods
.getCsvFileBuffer(fileName);
BufferedReader bufferedCsvFile2 = HelperMethods
.getCsvFileBuffer(fileName2);
List<String> lines = new ArrayList<String>();
List<String> line2 = new ArrayList<String>();
HelperMethods.readCsvToStrings(lines, bufferedCsvFile);
HelperMethods.readCsvToStrings(line2, bufferedCsvFile2);
List<String> countryList = new ArrayList<String>();
System.out.println(line2);
//populate the country list using file2
countryList = Arrays.asList(line2.get(0).split(","));
System.out.println(countryList);
for (String line : lines) {
BreastCarcinomaDescriptor descriptor= getBreastCarcinomaDescriptorFromCsvLine(line);
//enrich this object with country code property
descriptor.getClientData().setTrialCountryCodes(countryList);
descriptorsAndCountrycodes.add(descriptor);
}
return descriptorsAndCountrycodes;
}
private static BreastCarcinomaDescriptor getBreastCarcinomaDescriptorFromCsvLine(String line) {
BreastCarcinomaDescriptor breastCarcinomaDescriptor = new BreastCarcinomaDescriptor();
String[] data = line.split(",");
breastCarcinomaDescriptor.setBatchName(data[0]);
breastCarcinomaDescriptor.getMetadata().setCharset("utf-8");
breastCarcinomaDescriptor.getMetadata().setSchemaVersion("1.5");
if(data.length > 5) {
breastCarcinomaDescriptor.getSampleInfo().setAge(new Integer(data[5].trim()));
}
breastCarcinomaDescriptor.getSampleInfo().setCancerType(data[3].trim());
if(data.length>4) {
breastCarcinomaDescriptor.getSampleInfo().setGender(data[4].trim());
}
breastCarcinomaDescriptor.getFiles().add(data[1].concat(".*"));
// breastCarcinomaDescriptor.getClientData().getTrialCountryCodes().add(descriptorsAndCountrycodes[]);
//breastCarcinomaDescriptor.getClientData().getTrialCountryCodes().add("20");
breastCarcinomaDescriptor.getClientData().setCancerGenePanel("");
breastCarcinomaDescriptor.setCaseName(data[1]);
return breastCarcinomaDescriptor;
}
What I've Tried: I tried using custom serialization to only display Trial Country Codes when we take in one file but I am having trouble with this.
Does anyone have any ideas how I can accomplish this task. I feel like the solution is trivial. However, I don't know the Gson Library too well and I am new to java.
How formatted output should look for function that takes in 1 file:
How formatted output should look for function that takes in 2 files:
You can register two different TypeAdapters which serialize into the format you want depending on which function gets called. Then each of your functions uses it's own type adapter and can control the details of the transformation.
First function
GsonBuilder builder = new GsonBuilder();
builder.registerTypeAdapter(ClientData.class, new ClientDataWithCancerGenePanelAdapter());
Gson gson = builder.create();
Second function:
GsonBuilder builder = new GsonBuilder();
builder.registerTypeAdapter(ClientData.class, new ClientDataWithTrialCountryCodesAdapter());
Gson gson = builder.create();

Convert .prn file to csv file format in java

need your help to convert prn file to csv file using java.
Thank you so much.
Below is my prn file.
i would like to make it shows like this
Thank you so much.
In your example you have four entries as input, each in a row. In your result table they all are in one row. I assume the input describes a complete prn set. So if a file would contain n prn sets, it would have n * 4 rows.
To map the pm set to a csv file you have to
read in the entries from the input file
write a header row (with eight titles)
extract in each entry the relevant values
combine the extracted values from four entries in sequence to one csv row
write the row
repeat steps 3 to 5 as long as there are further entries
Here is my suggestion:
public class PrnToCsv {
private static final String DILIM_PRN = " ";
private static final String DILIM_CSV = ",";
private static final Pattern PRN_SPLITTER = Pattern.compile(DILIM_PRN);
public static void main(String[] args) throws URISyntaxException, IOException {
List<String> inputLines = Files.readAllLines(new File("C://Temp//csv/input.prn").toPath());
List<String[]> inputValuesInLines = inputLines.stream().map(l -> PRN_SPLITTER.split(l)).collect(Collectors.toList());
try (BufferedWriter bw = Files.newBufferedWriter(new File("C://Temp//csv//output.csv").toPath())) {
// header
bw.append("POL1").append(DILIM_CSV).append("POL1_Time").append(DILIM_CSV).append("OLV1").append(DILIM_CSV).append("OLV1_Time").append(DILIM_CSV);
bw.append("POL2").append(DILIM_CSV).append("POL2_Time").append(DILIM_CSV).append("OLV2").append(DILIM_CSV).append("OLV2_Time");
bw.newLine();
// data
for (int i = 0; i + 3 < inputValuesInLines.size(); i = i + 4) {
String[] firstValues = inputValuesInLines.get(i);
bw.append(getId(firstValues)).append(DILIM_CSV).append(getDateTime(firstValues)).append(DILIM_CSV);
String[] secondValues = inputValuesInLines.get(i + 1);
bw.append(getId(secondValues)).append(DILIM_CSV).append(getDateTime(secondValues)).append(DILIM_CSV);
String[] thirdValues = inputValuesInLines.get(i + 2);
bw.append(getId(thirdValues)).append(DILIM_CSV).append(getDateTime(thirdValues)).append(DILIM_CSV);
String[] fourthValues = inputValuesInLines.get(i + 3);
bw.append(getId(fourthValues)).append(DILIM_CSV).append(getDateTime(fourthValues));
bw.newLine();
}
}
}
public static String getId(String[] values) {
return values[1];
}
public static String getDateTime(String[] values) {
return values[2] + " " + values[3];
}
}
Some remarks to the code:
Using the nio-API you can read the whole file with one line of code.
To extract the values of an entry line I used a Pattern to split the line into an array with each single word as a value.
Then it is easy get the relevant values of an entry using the appropriate array indexes.
To write the csv file line by line (without additional libs) you can use a BufferedWriter.
The file you're writting to is a resource. It is recommended to use resources with the try-with-resource-statement.
I hope I could answer your question.

Redis java api for inserting multiple values

I have a csv file named abc.csv which contains data as follows :
All,Friday,0:00,315.06,327.92,347.24
All,Friday,1:00,316.03,347.73,370.55
and so on .....
I wish to import the data into Redis. How to do it through the Java API.
Please suggest the steps to do this.
I wish to run the jar and get the data imported into Redis db.
Any help on mass insert would also be helpful in case Java option is not possible.
You can do it by using Jedis(https://github.com/xetorthio/jedis), java client for redis. Just create a class with main method and create a connection, set keys.
void processData(){
List<String> lines = Files.readAllLines(
Paths.get("\Path\abc.csv"), Charset.forName("UTF-8"));
Jedis connection = new Jedis("host", port);
Pipeline p = connection.pipelined();
for(String line: lines){
String key = getKey(line);
String value = getValue(line);
p.set(key,value);
}
p.sync();
}
If the file is big you can create an inputstream out of it and read line by line instead of loading the whole file. You should also than call p.sync() in batches, just keep a counter and do a modulo with batch size.
Redisson allows to organize multiple commands to one. This called Redis pipelining. Here is example:
List<String> lines = ...
RBatch batch = redisson.createBatch();
RList<String> list = batch.getList("yourList");
for(String line: lines){
String key = getKey(line);
batch.getBucket(key).set(line);
}
// send to Redis as single command
batch.execute();
You can use Jedis Lib. Here I have extended my Java Object to get key and Hash to set in the object. In your case, you can have a list of parsed CSV values and keys directly sent as a list.
private static final int BATCH_SIZE = 1000;
public void setRedisHashInBatch(List<? extends RedisHashMaker> Obj) {
try (Jedis jedis = jedisPool.getResource()) {
log.info("Connection IP -{}", jedis.getClient().getSocket().getInetAddress());
Pipeline p = jedis.pipelined();
log.info("Setting Records in Cache. Size: " + Obj.size());
int j = 0;
for (RedisHashMaker obj : Obj) {
p.hmset(obj.getUniqueKey(), obj.getRedisHashStringFromObject());
j++;
if (j == BATCH_SIZE) {
p.sync();
j = 0;
}
}
p.sync();
}
}

Categories

Resources