Camel: How to skip multiple header lines in CSV files

Camel: How to skip multiple header lines in CSV files - java

I'm going to process CSV files using Apache Camel. My files have multiple header lines. In Camel I only find skipFirstLine or skipHeaderRecord (which is not clear for me) but how to skip more than one line?

You can use tokenize method on your body before processing the body.
tokenize(String token, int group, boolean skipFirst)
Example:
`from("filePath").
split(body().tokenize("\n",1,true)).
streaming().
process(exchange -> {....}).
to("filePath");`

If the number of lines to skip is fixed, then you can use the simple language to skip X number. You likely need to covert the message to a String first,
.convertBodyTo(String.class)
.transform(simple("${skip(3)}")
See more about skip method at: http://camel.apache.org/simple
This requires Camel 2.19 onwards.
Using older releases you would need to build some custom code yourself to skip the lines.

Related

How is shift-enter represented in a Word doc?

I'm using Java and Apache POI to read a Word documen template and generate a new document from it. The original document has newline breaks entered with "shift-enter"; I thought it would allow a line break while continuing the paragraph. But as I sequence through runs, I seem to get an empty string at that point. There are 'flags' on the run; do they indicate the line break somehow? I want to leave it in the resuling document; I think what's happening is that I detect it as an empty string and leave it out. How can I detect its presence so I can leave it in the resulting document after I've processed the template?
As a side note, are those flags documented anywhere?

I suspect you are talking about XWPF of apache poi which is the apache poi part to handle Office Open XML file format *.docx.
All Office Open XML file formats are ZIP archives containing XML files and other files in a special directory structure. So one can simply unzip a *.docx file and have a look into it.
For an explicit line break (Shift+Enter) you will find following XML in /word/document.xml in that ZIP archive:
...
<w:r ...>
<w:br/>
</w:r>
...
So it is a run element (w:r) containing one or more break elements (w:br).
The run element (w:r) is the low level source for a XWPFRun in apache poi. It is represented by a org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR which can be got via XWPFRun.getCTR.
So if you got a XWPFRun run, you can get the explicit line breaks as so:
...
for (int i = 0; i < run.getCTR().getBrList().size(); i++) {
System.out.println("<BR />");
}
...
Is this documented anywhere?
There is ECMA-376 for Office Open XML.
The org.openxmlformats.schemas.wordprocessingml.x2006.main.* classes are auto-generated from this specifications. Unfortunately there is not a API documentation public available. So one needs downloading the sources from ooxml-schemas (up to apache poi 4) or poi-ooxml-full (from apache poi 5 on) and then doing javadoc from them.

Apache Camel: How to look inside body to determine file format

We receive .csv files (both via ftp and email) each of which can be one of a few different formats (that can be determined by looking at the top line of the file). I am fairly new to Apache Camel but want to implement a content based router and unmarshal each to the relevant class.
My current solution is to break down the files to a lists of strings, manually use the first line to determine the type of file, and then use the rest of the strings to create relevant entity instances.
Are there a cleaner and better way?

You could use a POJO to implement the type check in whatever way works best for your files.
public String checkFileType(#Body File file) {
return determineFileType(file);
}
private String determineFileType(File file) {...}
Like this you can keep your route clean by separating the filetype check and any other part of processing. Because the filetype check is just metadata enrichment.
For example you could just set the return value as a message header by calling the bean)
.setHeader("fileType", method(fileTypeChecker))
Then you can route the files according to type easily by using the message header.
.choice()
.when(header("fileType").isEqualTo("foo"))
...

How to ignore a record in the last line of the CSV file using Apache Commons CSV java?

I'm using Apache Commons CSV to read a CSV file. The file have an information about the file itself (date and time of generation) at the last line.
|XXXX |XXXXX|XXXXX|XXXX|
|XXXX |XXXXX|XXXXX|XXXX|
|File generation: 21/01/2019 17.34.00| | | |
So while parsing the file, I'm getting this as a record(obviously).
I'm wondering is there any way to get rid of it from parsing and does Apache Commons CSV have any provision to handle it.

It's a while loop and you wouldn't know when you get to the end until you get to the end. You have two options:
Bad option: Read it once and count the number of lines and then
when you read it the second time you can break the loop when you
reach (counter-1) line.
Good option: It seems like your files are pipe delimited so when
you're processing line by line simply make sure that
line.trim().spit("|").length() > 1 or in your case do some work as
long as the number of records per line is greater than 1. This will
ensure you don't apply your logic on the lines with just one column
which happens to be your last row aka footer.
Example taken from Apache commons and modified a litte
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.RFC4180.parse(in);
for (CSVRecord record : records) {
//all lines except the last will result greater than 1
if (record.size() > 1){
//do your work here
String columnOne = record.get(0);
String columnTwo = record.get(1);
}
}

Apache Commons CSV provides a function to ignore the header (https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#withSkipHeaderRecord--), but don't offer a solution to ignore the footer. But you can simply get all records, except the last one by manually ignoring the last record.

Apache Camel Java DSL add newline to body

So I have a netty4 socket route set up in Java DSL that looks like the following:
#Override
public void configure() throws Exception {
String dailyDataUri = "{{SOCKET.daily.file}}" + "&fileName=SocketData-${date:now:yyyyMMdd}.txt";
from(socketLocation).routeId("thisRoute")
.transform()
.simple("${in.body}\n")
.wireTap(dailyDataUri)
.to(destination)
;
Where both the wireTap and the destination are sending their data to two separate files. And the data collection in the destination file is separated by a \n (line break)... or at least it should be.
When viewing the files created, the \n is never added.
The equivalent idea in the Spring DSL worked before I switched to Java:
<transform>
<simple>${in.body}\n</simple>
</transform>
After using that and opening the files created during the route, the lines of data that came in through the socket would be separated by a newline.
What am I doing wrong in the Java DSL that doesn't allow the newline to be appended to the socket data as it comes in?
I feel like it's something obvious that I just don't see.
The data that is coming in is just a CSV-like line of text.

I found a solution, I'm never sure what can be translated almost word from word from Spring to Java. Apparently the transform/simple combination has some issue where it will not work for me in Java DSL.
So a possible solution (there may be more solutions) is to do this:
#Override
public void configure() throws Exception {
String dailyDataUri = "{{SOCKET.daily.file}}" + "&fileName=SocketData-${date:now:yyyyMMdd}.txt";
from(socketLocation).routeId("thisRoute")
.transform(body().append("\n"))
.wireTap(dailyDataUri)
.to(destination)
;
Where instead of using the Simple language to manipulate the body, I just call on the body and append a String of \n to it. And that solves my issue.

Update : Camel version 3.x and above File component provides features to append your desired character.
As you are writing file using file component (producer)
appendChars (producer)
Used to append characters (text) after writing files. This can for example be used to add new lines or other separators when writing and appending new files or existing files. To specify new-line (slash-n or slash-r) or tab (slash-t) characters then escape with an extra slash, eg slash-slash-n.

Apache Camel's ${file:ext} picks up everything after the first dot, instead of extension only

As the title says, I am trying to get file extension using Camel's File Language to specify the correct route.
choice().
when().simple("${file:ext} in 'xml'").
unmarshal(coreIt("jaxb[Core]")).
beanRef(connectorName()+coreIt("[Core]ImportConnector"), "processXml").
when().simple("${file:ext} in 'zip,7z'").
beanRef(connectorName()+coreIt("[Core]ImportConnector"), "extractZip").
endChoice();
Problem is, client provides us with xml file that has a date in filename, separated by dots. For some reason camel treats everything after the first dot as an extension. If I do:
when().simple("${file:ext} in '09.16.xml'").
it works...
Is there any solution or workaround apart from creating a separate folder to import xml files? Thanks for your time.

Well its tough as some files may have dot in extension such as '.tar.gz' and so on. So they should ideally not use dot in the file name. To work around this you would need to use some other simple expression to check for this. You can use ends with
${file:name} ends with 'xml'
And then you can use or:
${file:name} ends with 'zip' || ${file:name} ends with '7z'
See more details at: http://camel.apache.org/simple

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Camel: How to skip multiple header lines in CSV files - java

I'm going to process CSV files using Apache Camel. My files have multiple header lines. In Camel I only find skipFirstLine or skipHeaderRecord (which is not clear for me) but how to skip more than one line?

You can use tokenize method on your body before processing the body. tokenize(String token, int group, boolean skipFirst) Example: `from("filePath"). split(body().tokenize("\n",1,true)). streaming(). process(exchange -> {....}). to("filePath");`

Related

How is shift-enter represented in a Word doc?

Apache Camel: How to look inside body to determine file format

How to ignore a record in the last line of the CSV file using Apache Commons CSV java?

Apache Camel Java DSL add newline to body

Apache Camel's ${file:ext} picks up everything after the first dot, instead of extension only

Categories

Resources