Reading JSON file with BigQuery to make table - java

I'm new to Google Dataflow, and can't get this thing to work with JSON. I've been reading throughout the documentation, but can't solve my problem.
So, following the WordCount example i figured how data is loaded from .csv file with next line
PCollection<String> input = p.apply(TextIO.Read.from(options.getInputFile()));
where inputFile in .csv file from my gcloud bucket. I can transform read lines from .csv with:
PCollection<TableRow> table = input.apply(ParDo.of(new ExtractParametersFn()));
(Extract ParametersFn defined by me). So far so good!
But then I realize my .csv file is too big and had to convert it to JSON (https://cloud.google.com/bigquery/preparing-data-for-bigquery).
Since BigQueryIO is supposedly better for reading JSON, I tried with the following code:
PCollection<TableRow> table = p.apply(BigQueryIO.Read.from(options.getInputFile()));
(inputFile is then JSON file and the output when reading with BigQuery is PCollection with TableRows) I tried with TextIO too (which returns PCollection with Strings) and neither of the two IO options work.
What am I missing? The documentation is really not that detailed to find an answer there, but perhaps some of you guys already dealt with this problem before?
Any suggestions would be very appreciated. :)

I believe there are two options to consider:
Use TextIO with TableRowJsonCoder to ingest the JSON files (e.g., like it is done in the TopWikipediaSessions example);
Import the JSON files into a bigquery table (https://cloud.google.com/bigquery/loading-data-into-bigquery), and then use BigQueryIO.Read to read from the table.

Related

How to keep zero begin string when export data using opencsv library

I using opencsv library in java and export csv. But i have problem. When i used string begin zero look like : 0123456 , when i export it remove 0 and my csv look like : 123456. Zero is missing. I using way :
"\"\t"+"0123456"+ "\""; but when csv export it look like : "0123456" . I don't want it. I want 0123456. I don't want edit from excel because some end user don't know how to edit. How to export csv using open csv and keep 0 begin string. Please help
I think it is not really the problem when generating CSV but the way excel treats the data when opened via explorer.
Tried this code, and viewed the CSV in a text editor ( not excel ), notice that it shows up correctly, though when opened in excel, leading 0s are lost !
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"));
// feed in your array (or convert your data to an array)
String[] entries = "0123131#21212#021213".split("#");
List<String[]> a = new ArrayList<>();
a.add(entries);
//don't apply quotes
writer.writeAll(a,false);
writer.close();
If you are really sure that you want to see the leading 0s for numeric values when excel is opened by user, then each cell entry be in format ="dataHere" format; see code below:
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"));
// feed in your array (or convert your data to an array)
String[] entries = "=\"0123131\"#=\"21212\"#=\"021213\"".split("#");
List<String[]> a = new ArrayList<>();
a.add(entries);
writer.writeAll(a);
writer.close();
This is how now excel shows when opening excel from windows explorer ( double clicking ):
But now, if we see the CSV in a text editor, with the modified data to "suit" excel viewing, it shows as :
Also see link :
format-number-as-text-in-csv-when-open-in-both-excel-and-notepad
have you tried to use String like this "'"+"0123456". ' char will mark number as text when parse into excel
For me OpenCsv works correctly ( vers. 5.6 ).
for example my csv file has a row as the following extract:
"999739059";;;"abcdefgh";"001024";
and opencsv reads the field "1024" as 001024 corretly. Of course I have mapped the field in a string, not in a Double.
But, if you still have problems, you can grab a simple yet powerful parser that fully adheres with RFC 4180 standard:
mykong.com
Mykong shows you some examples using opencsv directly and, in the end, he writes a simple parser to use if you don't want to import OpenCSV , and the parser works very well , and you can use it if you still have any problems.
So you have an easy-to-understand source code of a simple parser that you can modify as you want if you still have any problem or if you want to customize it for your needs.

How to read a CSV uploaded via a Spring REST handler using spark?

I am new to Spark and Dataframes. I came across the below piece of code provided by the databricks library to read CSV from a specified path in the file system.
sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downlos/2017.csv")
Is there any API in csv databricks that parses a byte array from a http request instead of reading from a file system?
Usecase here is to read a multipart(CSV) file uploaded using spring REST handler using Spark Dataframes. I'm looking for a dataframe API that can load a file/byte array as input instead of reading from file system.
From the file read, need to select only those columns in each row that match a given condition(eg. any column value that is not equal to string "play" in each parsed
row) and save only those fields back to the database.
Can anyone suggest if the above mentioned Usecase is feasible in spark using RDD's/Dataframes?..Any suggestions on this would be of much help.
You cannot directly convert it to String you have to convert it tostring then you can create RDD.
check this: URL contents to a String or file
val html = scala.io.Source.fromURL("https://spark.apache.org/").mkString
val list = html.split("\n").filter(_ != "")
val rdds = sc.parallelize(list)
val count = rdds.filter(_.contains("Spark")).count()
ScalafromURLApi

Java OpenCSV - 2 List comparison and duplication

i am going to make a application, comparising 2 .csv lists, using OpenCSV. It should works like that:
Open 2 .csv files ( each file has columns: Name,Emails)
Save results ( and here is a prbolem, i don't know if it should be save to table or something)
Compare From List1 and List2 value of "Emails column".
If Email from List 1 appear on List2 - delete it(from list 1)
Export results to new .csv file
I don't know if it's good algorithm. Please Tell me which option to saving results of reading .csv file is best in that case.
Kind Regards
You can get around this more easily with univocity-parsers as it can read your data into columns:
CsvParserSettings parserSettings = new CsvParserSettings(); //parser config with many options, check the tutorial
parserSettings.setHeaderExtractionEnabled(true); // uses the first row as headers
// To get the values of all columns, use a column processor
ColumnProcessor rowProcessor = new ColumnProcessor();
parserSettings.setRowProcessor(rowProcessor);
CsvParser parser = new CsvParser(parserSettings);
//This will parse everything and pass the data to the column processor
parser.parse(new FileReader(new File("/path/to/your/file.csv")));
//Finally, we can get the column values:
Map<String, List<String>> columnValues = rowProcessor.getColumnValuesAsMapOfNames();
Let's say you parsed the second CSV with that. Just grab the emails and create a set:
Set<String> emails = new HashSet<>(columnValues.get("Email"));
Now just iterate over the first CSV and check if the emails are in the emails set.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
If you have a hard requirement to use openCSV then here is what I believe is the easiest solution:
First off I like Jeronimo's suggestion about the HashSet. Read the second csv file first using the CSVToBean and save off the email addresses in the HashSet.
Then create a Filter class that implements the CSVToBeanFilter interface. In the constructor pass in the set and in the allowLine method you look up the email address and return true if it is not in the set (so you have a quick lookup).
Then you pass the filter in the CsvToBean.parse when reading/parsing the first file and all you will get are the records from the first file whose email addresses are not on the second file. The CSVToBeanFilter javadoc has a good example that shows how this works.
Lastly use the BeanToCSV to create a file from the filtered list.
In interest of fairness I am the maintainer of the openCSV project and it is also open source and free (Apache V2.0 license).

can't save clustering result to an arff file

I'm using a java code to save clustering result to an arff file..
I've followed the instructions in this site:
http://weka.wikispaces.com/Visualizing+cluster+assignments
but I get an error in the line:
PlotData2D predData = ClustererPanel.setUpVisualizableInstances(train, eval);
saying that:
The method setUpVisualizableInstances(Instances, ClusterEvaluation) is undefined for the type ClustererPanel
I've tried to google it but I couldn't find a solution
Judging from the current code:
http://grepcode.com/file/repo1.maven.org/maven2/nz.ac.waikato.cms.weka/weka-dev/3.7.12/weka/gui/explorer/ClustererPanel.java#ClustererPanel
I assume you have to call setInstances instead of setUpVisualizableInstances now.
But: Why do you use a visualization tutorial?
Visualization won't produce an .arff file.

Failing for Larger Input Files Only: FileServiceFactory getBlobKey throws IllegalArgumentException

I have a Google App Engine App that converts CSV to XML files. It works fine for small XML inputs, but refuses to finalize the file for larger inputed XML. The XML is read from, and the resulting csv files are written to, many times before finalization, over a long-running (multi-day duration) task. My problem is different than this FileServiceFactory getBlobKey throws IllegalArgumentException , since my code works fine both in production and development with small input files. So it's not that I'm neglecting to write to the file before closing/finalizing. However, when I attempt to read from a larger XML file. The input XML file is ~150MB, and the resulting set of 5 CSV files is each much smaller (perhaps 10MB each). I persisted the file urls for the new csv files, and even tried to close them with some static code, but I just reproduce the same error, which is
java.lang.IllegalArgumentException: creation_handle: String properties must be 500 characters or less. Instead, use com.google.appengine.api.datastore.Text, which can store strings of any length.
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedSingleValue(DataTypeUtils.java:242)
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedValue(DataTypeUtils.java:207)
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedValue(DataTypeUtils.java:173)
at com.google.appengine.api.datastore.Query$FilterPredicate.<init>(Query.java:900)
at com.google.appengine.api.datastore.Query$FilterOperator.of(Query.java:75)
at com.google.appengine.api.datastore.Query.addFilter(Query.java:351)
at com.google.appengine.api.files.FileServiceImpl.getBlobKey(FileServiceImpl.java:329)
But I know that it's not a String/Text data type issue, since I am already using similar length file service urls for the previous successful attempts with smaller files. It also wasn't an issue for the other stackoverflow post I linked above. I also tried putting one last meaningless write before finalizing, just in case it would help as it did for the other post, but it made no difference. So there's really no way for me to debug this... Here is my file closing code that is not working. It's pretty similar to the Google how-to example at http://developers.google.com/appengine/docs/java/blobstore/overview#Writing_Files_to_the_Blobstore .
log.info("closing out file 1");
try {
//locked set to true
FileWriteChannel fwc1 = fileService.openWriteChannel(csvFile1, true);
fwc1.closeFinally();
} catch (IOException ioe) {ioe.printStackTrace();}
// You can't get the blob key until the file is finalized
BlobKey blobKeyCSV1 = fileService.getBlobKey(csvFile1);
log.info("csv blob storage key is:" + blobKeyCSV1.getKeyString());
csvUrls[i-1] = blobKeyCSV1.getKeyString();
break;
At this point, I just want to finalize my new blob files for which I have the urls, but cannot. How can I get around this issue, and also, what may be the cause? Again, my code works for small files (~60 kB), but the input file of ~150MB fails). Thank you for any advice on what is causing this or how to get around it! Also, how long will my unfinalized files stick around for, before being deleted?
This issue was a bug in the Java MapReduce and Files API, which was recently fixed by Google. Read announcement here: groups.google.com/forum/#!topic/google-appengine/NmjYYLuSizo

Categories

Resources