talend: Merge multiple (complex) flat files into single JSON file

talend: Merge multiple (complex) flat files into single JSON file - java

I'm testing talend for its potential use in a project - basic tasks are completed easily, however I'm struggling with the following sitution:
We have multiple flat files, all of which combine to describe various items. For my testing, I would simply like to merge two of these files (for now) into a JSON format. The catch here is that one of the files contains 1 or more rows per item;
For example:
File 1: id, category
1, A
2, A
3, B
File 2: id, language, colour
1, en_GB, Red
1, de_DE, Rot
2, en_GB, Blue
3, en_GB, Green
3, de_DE, Grün
3, es_ES, Verde
The result should look something like this:
{
items[{
"id": 1,
"category": "A",
"colours": [{
"language": "en_GB",
"colour": "Red"
}, {
"language": "de_DE",
"colour": "Rot"
}],
},
...
}
What I have tried so far is:
tMap to merge the files/rows together, then tAggregate to group by the id's. This does not quite work, as it results in the language and colour attributes being formatted individually as comma separated lists:
ie.
"language": "en_GB, de_DE",
"colour": "Red, Rot"
This is not what we require.
Is it possible to achieve what we need in talend? If so, how?

Here's a solution I put together, using java json library, since json components do not handle such complex structure.
tAggregateRow settings:
First, load the json-java.jar using a tLibraryLoad. Then join data using a tMap (on the id column, returning all matches), then aggregate it using the id, and output a list of objects for language and colour. Then in tJavaFlex, loop over the rows to construct the final json (here's the java code).
This gives the below formatted output, based on your example :
{
items: [{
"id": 1,
"category": "A",
"colours": [{
"colour": "Red",
"language": "en_GB"
}, {
"colour": "Rot",
"language": "de_DE"
}
]
}, {
"id": 2,
"category": "A",
"colours": [{
"colour": "Blue",
"language": "en_GB"
}
]
}, {
"id": 3,
"category": "B",
"colours": [{
"colour": "Green",
"language": "en_GB"
}, {
"colour": "Grün",
"language": "de_DE"
}, {
"colour": "Verde",
"language": "es_ES"
}
]
}
]
}

Related

How to get rid of "col1" aliases from aggregate struct objects in Spark?

I'm trying to aggregate json objects to json list - dynamically create struct objects that are created with various amount of fields. Each time I create an aggregate using the below snippet:
batched = dataset.select(col(asteriskChar), row_number()
.over(Window.orderBy(order)).alias(rowNumAlias))
.withColumn(batchAlias, functions.ceil(col(rowNumAlias).divide(batchSize)))
.groupBy(col(batchAlias)) .agg(functions.collect_list(struct(structCol)).alias(batchedColAlias));
I would like to have object batches like below:
[
{
"id": 1,
"first": "John",
"last": "Thomas",
"score": 88
},
{
"id": 2,
"first": "Anne",
"last": "Jacobs",
"score": 32
}
]
, but I got below:
[
{
"col1": {
"id": 1,
"first": "John",
"last": "Thomas",
"score": 88
}
},
{
"col1": {
"id": 2,
"first": "Anne",
"last": "Jacobs",
"score": 32
}
}
]
How can I get rid of "col1" fields and make those jsons a single objects within an array? Thank you in advance.

Most probably you don't need the struct there:
.groupBy(col(batchAlias))
.agg(functions.collect_list(structCol).alias(batchedColAlias));

How can I store and update the nested json object into couchbase using java sdk

I am using couchbase Community Edition 5.0.1 and java-client 2.7.4. I want to store the following nested json object into couchbase. If I want to update the same object without affecting the other fields.
Eg:
If I want to add one more player object under players object
array
If I want to add One more group say 'Z Group' under group object array
How can I Achieve this without affecting other fields.
{
"doctype": "config:sample",
"group": [{
"name": "X Group",
"id": 1,
"players": [{
"name": "Roger Federer",
"number": 3286,
"keyword": "tennies"
},
{
"name": "P. V. Sindhu",
"number": 4723,
"keyword": "badminton"
}
]
},
{
"name": "Y Group",
"id": "2",
"players": [{
"name": "Jimmy Connors",
"number": 5623,
"keyword": "tennies"
},
{
"name": "Sachin",
"number": 8756,
"keyword": "Cricket"
}
]
}
]
}

N1QL has a huge variety of functions to operate on arrays:
https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/arrayfun.html
In your case, you could simply use ARRAY_INSERT or ARRAY_PREPEND

Check out update/update-for syntax (last example) https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/update.html
UPDATE default AS d
SET d.group = ARRAY_APPEND(d.group, {......})
WHERE .....;
UPDATE default AS d
SET g.players = ARRAY_APPEND(g.players, {......}) FOR g IN d.group WHEN g.id = 2 END
WHERE .....;

If you know which document IDs you want to update you can use the key-value subdocument API, which will generally be faster than going via N1QL for a single document update.
This will add a new player to the end of X Group's "players" array:
bucket.mutateIn(docId)
.arrayAppend("group[0].players",
JsonObject.create()
.put("name", "John Smith"))
// ... other player JSON
.execute();
And this will add a new Group Z to the "group" array:
bucket.mutateIn(docId)
.arrayAppend("group",
JsonObject.create()
.put("name", "Z Group"))
// ... other group JSON
.execute();

Failure to parse this json field

I can get other fields correctly, but can parse the "title" or "title_full" values. I always receives an empty string. I am using the org.json library. This is the json. What's the trick?
try {
title = jsonDoc.getString("title_full");
} catch (JSONException e) {
log.info("no full title: " + docString);
}
{
"organizations": [],
"uuid": "d0adc516c9012113774557365f9847da99b228e7",
"thread": {
"site_full": "www.fark.com",
"main_image": "http://img.fark.net/images/cache/orig/5/51/fark_514Jh7VFpynQw4MyN2xcK1jwCxk.png?t=RQrnhq8EGZiUuElMitgLOQ&f=1488776400",
"site_section": "http://www.fark.com/discussion/",
"section_title": "FARK.com: Discussion links",
"url": "http://www.fark.com/comments/9500577/I-want-to-support-work-that-NY-Times-Washington-Post-are-doing-I-can-only-afford-one-subscription-Who-do-you-recommend-I-throw-my-support-to?cpp=1",
"country": "US",
"domain_rank": 3382,
"title": "(9500577) I want to support the work that the NY Times and Washington Post are doing. I can only afford one subscription. Who do you recommend I throw my support to?",
"performance_score": 0,
"site": "fark.com",
"participants_count": 31,
"title_full": "FARK.com: (9500577) I want to support the work that the NY Times and Washington Post are doing. I can only afford one subscription. Who do you recommend I throw my support to?",
"spam_score": 0.0,
"site_type": "discussions",
"published": "2017-03-03T12:00:00.000+02:00",
"replies_count": 2,
"uuid": "67213179a24931106e75cd588386bd30fb3bbdc8"
},
"author": "EbolaNYC",
"url": "http://www.fark.com/comments/9500577/I-want-to-support-work-that-NY-Times-Washington-Post-are-doing-I-can-only-afford-one-subscription-Who-do-you-recommend-I-throw-my-support-to?cpp=1#c107765048",
"ord_in_thread": 1,
"title": "",
"locations": [],
"entities": {
"persons": [],
"locations": [],
"organizations": []
},
"highlightText": "",
"language": "english",
"persons": [],
"text": "dionysusaur : Either the NY Post or the WA Times.\nOnly asshats read the NY Post.",
"external_links": [],
"published": "2017-03-03T15:58:00.000+02:00",
"crawled": "2017-03-03T17:05:26.049+02:00",
"highlightTitle": "",
"social": {
"gplus": {"shares": 0},
"pinterest": {"shares": 0},
"vk": {"shares": 0},
"linkedin": {"shares": 0},
"facebook": {"likes": 0, "shares": 0, "comments": 0},
"stumbledupon": {"shares": 0}
}
}

Your JSON seems like follows:
{
{
"main": {
"key": "value",
},
},
}
So, First fetch the main json and then the key.
Code should be like as follows:
String something = jsonDoc.get("main").get("key").toString();
There are two title values in your JSON, Do check which title you need before fetching.

After I formatted the json code, the problem becomes obvious:
title_full is only available inside the thread node, and a non-empty title is also only inside the thread node. So you'll first have to access the thread node and then access title and title_full inside that node.
Using the org.json library, you can access the fields like this:
String fullTitle = jsonDoc.getJSONObject("thread").getString("title_full");

If you take a look at the json you will see that the "title" and "title_full" fields are in the thread field.
So try reading that field and then parsing the filed into a new jsonObject and you should be able to get them.

Show JSON data in grid in vaadin

Am a new bee to vaadin. I have to show the data from a JSON file (which is fetching from MySQL db) in Grid/Table(vaadin). I am able show the data in table if JSON in the below format.
[
{
"id": "ex-wardrobe",
"productId": "ex-wardrobe",
"name": "exWardrobe",
"desc": "Some description",
"dimension": "WxDxH 148\" X 24\" X 112\" ",
"category": "Bedroom",
"subcategory": "Wardrobe",
"categoryId": "bedroom",
"subcategoryId": "wardrobe",
"tags": "all, Space Design Bedroom, Space Details Wardrobe",
"designer": "hb",
"curr": "INR",
"popularity": "1",
"relevance": "1",
"shortlisted": "1",
"likes": "1",
"createDt": "",
"pageId": "ex-wardrobe",
"styleName": "Fresh",
"styleId": "cfresh",
"priceRange": "Premium",
"priceId": "premium",
"defaultPrice": "123",
"defaultMaterial": "MDF ",
"defaultFinish": "LAMINATE"
}
]
But, if i get JSON(data is related to same product) in the below format am unable to add data in table.
[
{
"id": "ex-wardrobe",
"productId": "ex-wardrobe",
"name": "exWardrobe",
"desc": "Some description",
"dimension": "WxDxH 148\" X 24\" X 112\" ",
"category": "Bedroom",
"subcategory": "Wardrobe",
"categoryId": "bedroom",
"subcategoryId": "wardrobe",
"tags": "all, Space Design Bedroom, Space Details Wardrobe",
"designer": "hb",
"curr": "INR",
"popularity": "1",
"relevance": "1",
"shortlisted": "1",
"likes": "1",
"createDt": "",
"pageId": "ex-wardrobe",
"styleName": "Fresh",
"styleId": "cfresh",
"priceRange": "Premium",
"priceId": "premium",
"defaultPrice": "123",
"defaultMaterial": "MDF ",
"defaultFinish": "LAMINATE",
"mf": [
{
"basePrice": "123",
"material": "MDF ",
"finish": "LAMINATE"
}
],
"images": [
"066___ex_WARDROBE_Dim.jpg",
"067___ex_WARDROBE_close_door.jpg",
"068___ex_DOVE_dim.jpg"
],
"components": [],
"accessories": []
}
]
This is the code which am using to show JSON data in table,
Table grid = new Table();
root.addComponent(grid);
grid.setStyleName("iso3166");
grid.setPageLength(6);
grid.setSizeFull();
grid.setSelectable(true);
grid.setMultiSelect(false);
grid.setImmediate(true);
grid.setColumnReorderingAllowed(true);
grid.setColumnCollapsingAllowed(true);
try {
JSONArray products = productsDataProvider.getCatalogs();
JsonContainer dataSource =
JsonContainer.Factory.newInstance(products.toString());
grid.setContainerDataSource(dataSource);
grid.setColumnReorderingAllowed(true);
grid.setWidth("98%");
grid.addStyleName(ChameleonTheme.TABLE_STRIPED);
} catch (IllegalArgumentException ignored) {
}
grid.setWidth("100%");
grid.setHeight("100%");
root.addComponent(grid);
Am stuck on this and i have sleepless night on this. Million tons of thanks in advance. I hope you GURU's can help me in this :)

Sorry not vaadin expert. See it first time and like it. I guess your problem are the arrays inside your object. I mean this:
"mf": [
{
"basePrice": "173881",
"material": "MDF ",
"finish": "LAMINATE"
}
],
"images": [
"066___ex_WARDROBE_Dim.jpg",
"067___ex_WARDROBE_close_door.jpg",
"068___ex_DOVE_dim.jpg"
],
"components": [],
"accessories": []
No idea how the component should display this. Have you tried it without mf, images, components and accessories?

You are using a very simple JSONContainer. As can be seen in the source code, this implementation does not support nested / compound elements and arrays.
First you have to ask yourself how these complex objects need to be displayed (UX), especially the "mf" field.
UPDATE
Simple compound objects (like "simpleCompound": {"name": "foo", number: 123}) can be shown in a table column (not supported by the JSONContainer you use, but similar functionality is available by the BeanItemContainer, so look there for how to implement this functionality).
The array fields are more problematic from a UX standpoint. Mostly this information is only shown on demand or in separate panels. The Vaadin Grid component offers the possibility to show a details view, see the wiki. Maybe that will fit your requirements.

Currently nested json data is not supported unless you create your own advance template for that. There is currently an online json container application demo that is used to test if the data is really json or json array but not nested. Then it displays that data in grid table. So you can use this to verify your data.
You can also get the application template with source code on github

Displaying JSON in a ListView with separators

I want to be able to take the JSON data and format it into a ListView with each of the outermost objects as the headings. For example, there should be a divider for "Company A" and all of its projects under the divider. Then there should be the "Company B" divider and it's project under that header. Here's an example of a JSON response I'll be working with. I know how to parse the JSON, just not how to display it.
{
"Company A": {
"name": "Company A",
"id": "1145",
"projects": [
{
"name": "Test Project - DELETE",
"id": "39771",
"amount": "0.00",
"billingType": "HOURLY",
"date": "2012-07-09 15:38:06",
"u_id": "25445",
"itemID": "3"
},
{
"name": "TEST",
"id": "39905",
"amount": "0.00",
"billingType": "FIXED",
"date": "2012-07-10 13:19:10",
"u_id": "25455",
"itemID": "1"
},
{
"name": "Test Project - DELETE",
"id": "39771",
"amount": "0.00",
"billingType": "HOURLY",
"date": "2012-07-09 15:38:06",
"u_id": "25445",
"itemID": "4"
}
]
},
"Company B": {
"name": "Company B",
"id": "5569",
"projects": [
{
"name": "Type Test",
"id": "39657",
"amount": "0.00",
"billingType": "FIXED",
"date": "2012-07-12 10:14:30",
"u_id": "25479",
"itemID": "1"
}
]
}
}
Is there an easy way to achieve this kind of formatting?

Yes and no.
You can easily convert each set (header with content) into an object, and the content itself into sub-objects (if you need help, ask :)); the hard part is configuring the ListView if you aren't familiar with using multiple item types.
I think the answer to this question will be of use to you.
To summarize: basically, ListView can be made to use multiple item types; so your header would be one item type and each data item would be of a second type. Just implement the glue logic so that you get the right view type for the right object, and the right object for the right ListView "position".

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

talend: Merge multiple (complex) flat files into single JSON file - java

Related

How to get rid of "col1" aliases from aggregate struct objects in Spark?

How can I store and update the nested json object into couchbase using java sdk

Failure to parse this json field

Show JSON data in grid in vaadin

Displaying JSON in a ListView with separators

Categories

Resources