I'm trying to load an ISO-8859-1 file into BigQuery using DataFlow. I've built a template with Apache Beam Java. Everything works well but when I check the content of the Bigquery table I see that some characters like 'ñ' or accents 'á','é', etc. haven't been stored propertly, they have been stored as �.
I've tried several charset changing before write into BigQuery. Also, I've created a special ISOCoder passed to the pipeline using the method setCoder(), but nothing works.
Does anyone know if is it possible to load into BigQuery this kind of files using Apache Beam? Only UTF-8?
Thanks in advance for your help.
This feature is currently not available in the Java SDK of Beam. In Python this seems to be possible by using the additional_bq_parameters when using WriteToBigQuery, see: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L177
Related
I have created a python script for predictive analytics using pandas,numpy etc. I want to send my result set to java application . Is their simple way to do it. I found we can use Jython for java python integration but it doesn't use many data analysis libraries. Any help will be great . Thank you .
Have you tried using xml to transfer the data between the two applications ?
My next suggestion would be to output the data in JSON format in a txt file and then call the java application which will read the JSON from the text file.
Better approach here is to use java pipe input like python pythonApp.py | java read. Output of python application can be used as an input for java application till the format of data is consitent and known. Above soultions of creating a file and then reading also works but is prone to more errors.
Has anything changed recently with the way google dataflow reads compressed files from google cloud storage? I am working on a project that reads compressed csv log files from GCS and uses these files as the source for a dataflow pipeline. Until recently this worked perfectly with and without specifying the compression type of the file.
Currently the processElement method in my DoFn is only called once (for the csv header row) although the file has many rows. If I use the same source file uncompressed then everything works as expected (the processElement method is called for every row). As suggested here https://stackoverflow.com/a/27775968/6142412 setting the Content-Encoding to gzip does work but I did not have to do this previously.
I am expereincing this issue when using DirectPipelineRunner or DataflowPipelineRunner. I am using version 1.5.0 of the cloud-data-flow sdk.
We identified a problem (BEAM-167) reading from concatenated gzip files. It has been fixed in the Apache Beam Github repository by PR 114 and the Dataflow SDK Github repository by PR 180. It will be part of the next release.
Until then, a work around is to use the SDK built from Github or to compress the entire file as a single part.
Is there any java api which is similar to open xml sdk 2.0. Just I need to convert open office xml excel file to .xlsx file.
office xml excel file I'm creating by using xml and xslt. I tried apache poi to read xml excel file but getting invalid header format exception.
Thanks.
Well, I believe the best API out there to handle *.xlsx files is Apache POI (it has *.xlsx support since 3.7 or so).
Some alternatives:
There was a project called JExcel API, but there's not much activity there in the last 3 or so years (and I'm not sure if it handles *.xlsx format, only *.xls, but I might be wrong).
I'm not sure, but the OpenOffice UDK might also help you. Unfortunately it is only a binding, and requires an installed implementation (i.e., you have to install OpenOffice in order to use it), which is not always a valid requirement on the server side if you do not have any X servers there.
Another option is something like using it through Jacob via COM. The pro is that you are able to access all ow the data, the con is COM, you need an installed Excel on your machine (and of course, it is a Windows-specific solution).
I believe the best way to stick to Apache POI, it is usually perfectly enough if you just want to read/write cell data.
I want to convert .mxd file into .pdf file. I have google under this topic but I ended with nothing. I want to know that can I convert .mxd to .pdf directly or do I need to convert using intermediate conversions?
any help would be appreciate.....
thank you.
Typically .mxd files are mapping files created with ESRI ArcGIS. ArcMap has a tool to export a specific section to a pdf.
If you must do this programmatically (not by using a manual tool) you can do this I believe by publishing the MXD as a map service and then using the JavaScript, etc. APIs to make the conversions.
well i found this
http://arcscripts.esri.com/details.asp?dbid=15139
A question that seems to have quite a few options for Python, but none for Java after googling for two days. Really really could use some help all I have found so far is a recommendation to use gaeVFS to build an excel file from the xml components and then zip it all together which sounds like a slap in the face. Oh yes and if you were wondering I am questioning my use of Java rather than python but at 5,000 lines of code it would be insane to turn back now...
Other things you might find useful
Client: GWT
Server: Servlets running
on google app engine storing data
into the google data store
Excel file: mandatory, CSV isn't good
enough, no need to save the file just
to be able to "serve" it to the
client i.e. open a "Save As" box.
Have you checked out this api already: Java Excel API ?
You could also take a look at the Apache POI project. You can read and write MS Excel documents with this library.
Take a look at this post.
It's a step by step tutorial on how to generate excel files on google app engine.
Try this :
http://code.google.com/p/gwt-table-to-excel/
google app engine do not support input/output stream classes, you need to use google app engine virtual file system.