How Cascading TextDelimited the log file

How Cascading TextDelimited the log file - java

I am following the guide of Cascading on its website. I have the following TSV format input:
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
I use the following code to process it:
Tap docTap = new Hfs(new TextDelimited(true, "\t"), inPath);
...
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
// only returns "token"
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
It looks like just split the second part of each line (ignore doc_id part). How does Cascading ignore the first doc_id part and just process the second part? is that because of TextDelimited ??

If you see the pipe statement
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
The second argument is the only field you are sending to splitter function. Here you are sending 'text' field. SO only the text is sent to splitter and returns the tokens.
Below explains the Each method clearly.
Each
#ConstructorProperties(value={"name","argumentSelector","function","outputSelector"})
public Each(String name,
Fields argumentSelector,
Function function,
Fields outputSelector)
Only pass argumentFields to the given function, only return fields selected by the outputSelector.
Parameters:
name - name for this branch of Pipes
argumentSelector - field selector that selects Function arguments from the input Tuple
function - Function to be applied to each input Tuple
outputSelector - field selector that selects the output Tuple from the input and Function results Tuples

The answer is in these 2 lines
1. The way Tap was created, program was told that first line contains header ("true").
Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );
2. And second, in this line the column name was provided as "text". If you look closely in your input file, "text" is the column name for the data you are trying to base your word count on.
Fields text = new Fields( "text" );

Related

Replace data between special character

My text file has a pattern and it's just like the following:
1;Mary Yeah;John Freeman;(12)3456-7890;iammary#gmail.com
2;Ash Wilson;One Two Three;(99)1111-2222;lorddragon#hotmail.com
3;Xin Zhao;Street Address 55;(11)0101-0202;lolyourface#gmail.com
4;My Name;My Address;My Phone;myemail#mail.com
I want to be able to type the line number, the type of data I want to replace(e-mail, phone, name), and the string I want to replace them with. The program overwrites the text.
How could I code this in Java?

The issue of how to find a given row based on the line number depends on many things, most importantly it depends on code you haven't shown us. But as for what you can do once you have found a given line, you may try the following:
String line = "2;Ash Wilson;One Two Three;(99)1111-2222;lorddragon#hotmail.com";
String[] parts = line.split(";");
parts[4] = "some.address#mail.com"; // to change the email
// now join back to a single line
line = String.join(";", Arrays.asList(parts));
Demo

Java - xgboost DMatrix input

When creating a DMatrix in java with the xgboost4j package, at first i succeed to create the matrix using a "filepath".
DMatrix trainMat = new DMatrix("...\\xgb_training_input.csv");
But when I try to train the model:
Booster booster = XGBoost.train(trainMat, params, round, watches, null, null);
I get the following error:
...regression_obj.cc:108: label must be in [0,1] for logistic regression
now my data is solid. I've checked it out on an xgb model built in python.
I'm guessing the problem is with the data format somehow.
currently the format is as follows:
x1,x2,x3,x4,x5,y
where x1-x5 are "Real" numbers and y is either 0 or 1. file end is .csv
Maybe the separator shouldn't be ',' ?

DMatrix gets an .libsvm file. which can be easily created with python.
libsvm looks like this:
target 0:column1 1:column2 2:column3 ... and so on
so the target is the first column, while every other column (predictor) is being attached to increasing index with ":" in between.

how to use Google Line Chart API for passing dynamic data in Blackberry application?

I want to pass my array data to the URL.
My code is:
String[] pointArray=(String[]) hashtable.get("point");
// where all values are coming from hashtable of my webservice.I want to pass this array as a chart data for line graph.
BrowserFieldConfig myBrowserFieldConfig = new BrowserFieldConfig();
myBrowserFieldConfig.setProperty(BrowserFieldConfig.NAVIGATION_MODE,BrowserFieldConfig.NAVIGATION_MODE_POINTER);
BrowserField browserField = new BrowserField(myBrowserFieldConfig);
add(browserField);
String url="http://chart.apis.google.com/chart?&cht=lc&chco=000000&chds=0,10&chdlp=b&chxt=x,y" +
"&chg=1.04,0,5,1&chds=0,30&chco=3072F3,ff0000,00aaaa&chls=2,4,1&chm=s,FF0000,0,-1,0|s,0000ff,1,-1,0|s,00aa00,2,-1,0" +
"&chs=480x280&chof=validate&chd=t:100,200,300,400,500,600,700&chd=t:"+point";
browserField.requestContent(url);
But it gives me this error :
The parameter 'chd=t:[Ljava.Lang.String#d297c570f' does not match the expected format.
I want to pass my array to this URL for chart data. How to solve this problem?

In url
String url="http://chart.apis.google.com/chart?&cht=lc&chco=000000&chds=0,10&chdlp=b&chxt=x,y"
+"&chg=1.04,0,5,1&chds=0,30&chco=3072F3,ff0000,00aaaa&chls=2,4,1&chm=s,FF0000,0,-1,0|s,0000ff,1,-1,0|s,00aa00,2,-1,0"
+"&chs=480x280&chof=validate&chd=t:100,200,300,400,500,600,700&chd=t:"
+point;
Your point is converted with point.toString() and appended to the url in this part
"&chd=t:"+point;
which is the second time when chd appers in the url. So in effect only the last chd value is considered. And the first which was chd=t:100,200,300,400,500,600,700 is not considered.
If you want to pass chd it has to be in format chd=t:val,val,val where val represents a value from your data. See an example with a data chd=t:-5,30,-30,50,80,200
and chd should appear only once in the url.
So your code for the url would be this:
String url="http://chart.apis.google.com/chart?&cht=lc&chco=000000&chds=0,10&chdlp=b&chxt=x,y"
+ "&chg=1.04,0,5,1&chds=0,30&chco=3072F3,ff0000,00aaaa&chls=2,4,1&chm=s,FF0000,0,-1,0|s,0000ff,1,-1,0|s,00aa00,2,-1,0"
+ "&chs=480x280&chof=validate&chd=t:"+<comma separated list of values>;

Cascading tutorial word count example error

I am learning Cascading now. Now I am looking the second tutorial on its official website which is about Work Count example. I copy the code from it and try to run, it always gives me the following errors:
Exception in thread "main" cascading.flow.planner.PlannerException: could not build flow from assembly: [[token][com.starscriber.cascadingtest.Main.main(Main.java:44)]
unable to resolve argument selector: [{1}:'text'], with incoming: [{1}:'doc01 A rain shadow is a dry area on the lee back side of a mountainous area.']] at cascading.flow.planner.FlowPlanner.handleExceptionDuringPlanning(FlowPlanner.java:576)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:263)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
at com.starscriber.cascadingtest.Main.main(Main.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: cascading.pipe.OperatorException: [token][com.starscriber.cascadingtest.Main.main(Main.java:44)]
unable to resolve argument selector: [{1}:'text'], with incoming: [{1}:'doc01 A rain shadow is a dry area on the lee back side of a mountainous area.']
at cascading.pipe.Operator.resolveArgumentSelector(Operator.java:345)
at cascading.pipe.Each.outgoingScopeFor(Each.java:368)
at cascading.flow.planner.ElementGraph.resolveFields(ElementGraph.java:628)
at cascading.flow.planner.ElementGraph.resolveFields(ElementGraph.java:610)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:248)
... 8 more
Caused by: cascading.tuple.FieldsResolverException:
could not select fields: [{1}:'text'], from: [{1}:'doc01 A rain shadow is a dry area on the lee back side of a mountainous area.']
at cascading.tuple.Fields.indexOf(Fields.java:1008)
at cascading.tuple.Fields.select(Fields.java:1064)
at cascading.pipe.Operator.resolveArgumentSelector(Operator.java:341)
... 12 more
How come?? I copy the exactly same code which is from its official Github and don't change anything...
String docPath = args[0];
String wcPath = args[1];
Properties properties = new Properties();
AppProps.setApplicationJarClass(properties, Main.class);
HadoopFlowConnector flowConnector = new HadoopFlowConnector(properties);
// create source and sink taps
Tap docTap = new Hfs(new TextDelimited(true, "\t"), docPath);
Tap wcTap = new Hfs(new TextDelimited(true, "\t"), wcPath);
// specify a regex operation to split the "document" text lines into a token stream
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
// only returns "token"
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
// determine the word counts
Pipe wcPipe = new Pipe("wc", docPipe);
wcPipe = new GroupBy(wcPipe, token);
wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL);
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName("wc")
.addSource(docPipe, docTap)
.addTailSink(wcPipe, wcTap);
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect(flowDef);
wcFlow.writeDOT("dot/wc.dot");
wcFlow.complete();
Where is the problem??
And this is the input file:
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]

Once check if there is tab between the two fields docId and text in the input file. Program is expecting two fields with tab separated, but in your case it is reading whole line into one field.

As other people have already mentioned you need to have the same headers the example is expecting. Instead of copying the code, try to clone the repository so that you won't have any error related to file formatting

How to trim text out of an AutoCompleteTextView after it is selected by a user

here is what I'm trying to do. I have a list of stock symbols located in the string.xml file in an android project. The list looks something like this...
ACE - ACE Limited
ABT - Abbott Laboratories
ANF - Abercrombie and Fitch Company etc...etc.
I have this list set up in the android Main as an AutoComplete array. The problem is that when the user selects one of the dropdown stocks, the box fills in the STOCK SYMBOL + the COMPANY NAME. I need to "trim" off the "company name" when the user selects it so only the stock "symbol" appears in the box. Is there a simple function or command to do this? I get confused trying to convert the array back to a string and then back again. Any help would be appreciated!

Try separating them within your XML file. For example, instead of something like
<company>ACE - ACE Limited</company>
Use
<company><symbol>ACE</symbol><name>ACE Limited</company>
or
<company symbol="ACE" name="ACE Limited" />
Then you can read the individual properties or sub-tags with your reading engine. For a simple tutorial on this, check out this link.
EDIT: If this would require too much work, you could try simply splitting the strings (assuming they all have a common delimiter of -.
String symbol;
String name;
String xmlData = "ACE - ACE Limited";
String[] splitData = xmlData.split(" - ");
symbol = splitData[0];
// Set the name to the remaining items
for (int i=1; i<splitData.length; i++) {
name += splitData[i] + " - ";
}
This will set the symbol to the first part of xmlData (or the whole string if " - " is not found) and the name to the rest of it, including all occurrences of " - ".
(Of course, you'll only want to do either of these once the user selects the item. I'm assuming your question is about the parsing of the String rather than the click event.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How Cascading TextDelimited the log file - java

Related

Replace data between special character

Java - xgboost DMatrix input

how to use Google Line Chart API for passing dynamic data in Blackberry application?

Cascading tutorial word count example error

How to trim text out of an AutoCompleteTextView after it is selected by a user

Categories

Resources