Retrive textcontent by matching start word and end word

Retrive textcontent by matching start word and end word - java

I am getting text file with contents like below. I want to retrieve the data present between start_word=Tax% and end_word="ErrorMessage".
ParsedText:
Tax%
63 2 .90 0.00 D INTENS SH 80ML(48) 9.00% 9.00%
23 34013090 0.0 DS PURE WHIT 1 COG (24) 9.00% 9.00%
"ErrorMessage":"","ErrorDetails":""
After retreiving the output would be
63 2 .90 0.00 D INTENS SH 80ML(48) 9.00% 9.00%
23 34013090 0.0 DS PURE WHIT 1 COG (24) 9.00% 9.00%
Please help.
I am using camel to read the text then i want to retrive the data to process further as per my requiement.
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
public class DataExtractor implements Processor{
#Override
public void process(Exchange exchange) throws Exception {
String textContent=(String) exchange.getIn().getBody();
System.out.println("TextContents >>>>>>"+textContent);
}
}
In the text content I am getting the content that i have given above.I need help regarding retreiving the the data in java.

Below is the code snippet to extract the desired output:
String[] strArr = textContent.split("\\r?\\n");
StringBuilder stringBuilder = new StringBuilder();
boolean appendLines = false;
for(String strLines : strArr) {
if(strLines.contains("Tax%")) {
appendLines = true;
continue;
}
if(strLines.contains("\"ErrorMessage\"")) {
break;
}
if(appendLines){
stringBuilder.append(strLines);
stringBuilder.append(System.getProperty("line.separator"));
}
}
textContent = stringBuilder.toString();

Related

How can I print the content of rows in a Dataset using Java and the Spark SQL?

I would like to do a simple Spark SQL code that reads a file called u.data, that contains the movie ratings, creates a Dataset of Rows, and then print the first rows of the Dataset.
I've had as premise read the file to a JavaRDD, and map the RDD according to a ratingsObject(the object has two parameters, movieID and rating). So I just want to print the first Rows in this Dataset.
I'm using Java language and Spark SQL.
public static void main(String[] args){
App obj = new App();
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example").getOrCreate();
Map<Integer,String> movieNames = obj.loadMovieNames();
JavaRDD<String> lines = spark.read().textFile("hdfs:///ml-100k/u.data").javaRDD();
JavaRDD<MovieRatings> movies = lines.map(line -> {
String[] parts = line.split(" ");
MovieRatings ratingsObject = new MovieRatings();
ratingsObject.setMovieID(Integer.parseInt(parts[1].trim()));
ratingsObject.setRating(Integer.parseInt(parts[2].trim()));
return ratingsObject;
});
Dataset<Row> movieDataset = spark.createDataFrame(movies, MovieRatings.class);
Encoder<Integer> intEncoder = Encoders.INT();
Dataset<Integer> HUE = movieDataset.map(
new MapFunction<Row, Integer>(){
private static final long serialVersionUID = -5982149277350252630L;
#Override
public Integer call(Row row) throws Exception{
return row.getInt(0);
}
}, intEncoder
);
HUE.show();
//stop the session
spark.stop();
}
I've tried a lot of possible solutions that I found, but all of them got the same error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, localhost, executor 1): java.lang.ArrayIndexOutOfBoundsException: 1
at com.ericsson.SparkMovieRatings.App.lambda$main$1e634467$1(App.java:63)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
And here is the sample of the u.data file:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
286 1014 5 879781125
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
Where the first column represents de UserID, the second MovieID, the third the rating,and the last one is the timestamp.

As mentioned before your data are not space separated.
I'll show you two possible solutions, the first one based on RDD and the second one based on spark sql which is, in general, the best solution in term of performance.
RDD (you should use built in types to reduce the overhead):
public class SparkDriver {
public static void main (String args[]) {
// Create a configuration object and set the name of
// the application
SparkConf conf = new SparkConf().setAppName("application_name");
// Create a spark Context object
JavaSparkContext context = new JavaSparkContext(conf);
// Create final rdd (suppose you have a text file)
JavaPairRDD<Integer,Integer> movieRatingRDD =
contextFile("u.data.txt")
.mapToPair(line -> {(
String[] tokens = line.split("\\s+");
int movieID = Integer.parseInt(tokens[0]);
int rating = Integer.parseInt(tokens[1]);
return new Tuple2<Integer, Integer>(movieID, rating);});
// Keep in mind that take operation takes the first n elements
// and the order is the order of the file.
ArrayList<Tuple2<Integer, Integer> list = new ArrayList<>(movieRatingRDD.take(10));
System.out.println("MovieID\tRating");
for(tuple : list) {
System.out.println(tuple._1 + "\t" + tuple._2);
}
context.close();
}}
SQL
public class SparkDriver {
public static void main(String[] args) {
// Create spark session
SparkSession session = SparkSession.builder().appName("[Spark app sql version]").getOrCreate();
Dataset<MovieRatings> personsDataframe = session.read()
.format("tct")
.option("header", false)
.option("inferSchema", true)
.option("delimiter", "\\s+")
.load("u.data.txt")
.map(row -> {
int movieID = row.getInteger(0);
int rating = row.getInteger(1);
return new MovieRatings(movieID, rating);
}).as(Encoders.bean(MovieRatings.class);
// Stop session
session.stop();
}
}

Mainframe comp-3 field reading using JRecord

I am trying to read mainframe file but all are working other than comp 3 file.Below program is giving strange values.It is not able to read the salary value which is double also it is giving 2020202020.20 values. I don't know what am missing.Please help me to find it.
Program:
public final class Readcopybook {
private String dataFile = "EMPFILE.txt";
private String copybookName = "EMPCOPYBOOK.txt";
public Readcopybook() {
super();
AbstractLine line;
try {
ICobolIOBuilder iob = JRecordInterface1.COBOL.newIOBuilder(copybookName)
.setFileOrganization(Constants.IO_BINARY_IBM_4680).setSplitCopybook(CopybookLoader.SPLIT_NONE);
AbstractLineReader reader = iob.newReader(dataFile);
while ((line = reader.read()) != null) {
System.out.println(line.getFieldValue("EMP-NO").asString() + " "
+ line.getFieldValue("EMP-NAME").asString() + " "
+ line.getFieldValue("EMP-ADDRESS").asString() + " "
+ line.getFieldValue("EMP-SALARY").asString() + " "
+ line.getFieldValue("EMP-ZIPCODE").asString());
}
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
new Readcopybook();
}
}
EMPCOPYBOOK:
001700 01 EMP-RECORD.
001900 10 EMP-NO PIC 9(10).
002000 10 EMP-NAME PIC X(30).
002100 10 EMP-ADDRESS PIC X(30).
002200 10 EMP-SALARY PIC S9(8)V9(2) COMP-3.
002200 10 EMP-ZIPCODE PIC 9(4).
EMPFILE:
0000001001suneel kumar r bangalore e¡5671
0000001002JOSEPH WHITE FIELD rrn4500
Output:
1001 suneel kumar r bangalore 20200165a10 5671
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
0.00
1002 JOSEPH WHITE FIELD 202072726e0 4500

One problem is you have done a Ebcdic to Ascii conversion on the file.
The 2020... is a dead give away x'20' is the ascii space character.
This Answer deals with problems with doing an Ebcdic to ascii conversion.
You need to do a Binary transfer from the Mainframe and read the file using Ebcdic. You will need to check the RECFM on the Mainframe. If the RECFM is
FB - problems just transfer
VB - either convert to FB on the mainframe of include the RDW (Record Descriptor Word) option in the transfer.
Other - Convert to FB/VB on the mainframe
Updated java Code
int fileOrg = Constants.IO_FIXED_LENGTH_RECORDS; // or Constants.IO_VB
ICobolIOBuilder iob = JRecordInterface1.COBOL
.newIOBuilder(copybookName)
.setFileOrganization(fileOrg)
.setFont("Cp037")
.setSplitCopybook(CopybookLoader.SPLIT_NONE);
Note: IO_BINARY_IBM_4680 is for IBM 4690 Registers
There is a wiki entry here
or this Question
How do you generate java~jrecord code fror a Cobol copybook

Combiner creating mapoutput file per region in HBase scan mapreduce

Hi i am running an application which reads records from HBase and writes into text files .
I have used combiner in my application and custom partitioner also.
I have used 41 reducer in my application because i need to create 40 reducer output file that satisfies my condition in custom partitioner.
All working fine but when i use combiner in my application it creates map output file per regions or per mapper .
Foe example i have 40 regions in my application so 40 mapper getting initiated then it create 40 map-output files .
But reducer is not able to combine all map-output and generate final reducer output file that will be 40 reducer output files.
Data in the files are correct but no of files has increased .
Any idea how can i get only reducer output files.
// Reducer Class
job.setCombinerClass(CommonReducer.class);
job.setReducerClass(CommonReducer.class); // reducer class
below is my Job details
Submitted: Mon Apr 10 09:42:55 CDT 2017
Started: Mon Apr 10 09:43:03 CDT 2017
Finished: Mon Apr 10 10:11:20 CDT 2017
Elapsed: 28mins, 17sec
Diagnostics:
Average Map Time 6mins, 13sec
Average Shuffle Time 17mins, 56sec
Average Merge Time 0sec
Average Reduce Time 0sec
Here is my reducer logic
import java.io.IOException;
import org.apache.log4j.Logger;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
public class CommonCombiner extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(CommonCombiner.class);
private MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
private static final String DATA_SEPERATOR = "\\|\\!\\|";
public void setup(Context context) {
logger.info("Inside Combiner.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
if ("".equals(strName) && strName.length() == 0) {
String[] strArrFileName = valueStr.split(DATA_SEPERATOR);
String strFullFileName[] = strArrFileName[1].split("\\|\\^\\|");
strName = strFullFileName[strFullFileName.length - 1];
String strArrvalueStr[] = valueStr.split(DATA_SEPERATOR);
if (!strArrvalueStr[0].contains(HbaseBulkLoadMapperConstants.FF_ACTION)) {
sb.append(strArrvalueStr[0] + "|!|");
}
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
context.getCounter(Counters.FILE_DATA_COUNTER).increment(1);
}
}
}
public void cleanup(Context context) throws IOException, InterruptedException {
multipleOutputs.close();
}
}

I have replaced multipleOutputs.write(NullWritable.get(), new Text(sb.toString()), strName);
with
context.write()
and i got the correct output .

Is it possible to create a list in java using data from multiple text files

I have multiple text files that contains information about different programming languages popularity in different countries based off of google searches. I have one text file for each year from 2004 to 2015. I also have a text file that breaks this down into each week (called iot.txt) but this file does not include the country.
Example data from 2004.txt:
Region java c++ c# python JavaScript
Argentina 13 14 10 0 17
Australia 22 20 22 64 26
Austria 23 21 19 31 21
Belgium 20 14 17 34 25
Bolivia 25 0 0 0 0
etc
example from iot.txt:
Week java c++ c# python JavaScript
2004-01-04 - 2004-01-10 88 23 12 8 34
2004-01-11 - 2004-01-17 88 25 12 8 36
2004-01-18 - 2004-01-24 91 24 12 8 36
2004-01-25 - 2004-01-31 88 26 11 7 36
2004-02-01 - 2004-02-07 93 26 12 7 37
My problem is that i am trying to write code that will output the number of countries that have exhibited 0 interest in python.
This is my current code that I use to read the text files. But I'm not sure of the best way to tell the number of regions that have 0 interest in python across all the years 2004-2015. At first I thought the best way would be to create a list from all the text files not including iot.txt and then search that for any entries that have 0 interest in python but I have no idea how to do that.
Can anyone suggest a way to do this?
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.*;
public class Starter{
public static void main(String[] args) throws Exception {
BufferedReader fh =
new BufferedReader(new FileReader("iot.txt"));
//First line contains the language names
String s = fh.readLine();
List<String> langs =
new ArrayList<>(Arrays.asList(s.split("\t")));
langs.remove(0); //Throw away the first word - "week"
Map<String,HashMap<String,Integer>> iot = new TreeMap<>();
while ((s=fh.readLine())!=null)
{
String [] wrds = s.split("\t");
HashMap<String,Integer> interest = new HashMap<>();
for(int i=0;i<langs.size();i++)
interest.put(langs.get(i), Integer.parseInt(wrds[i+1]));
iot.put(wrds[0], interest);
}
fh.close();
HashMap<Integer,HashMap<String,HashMap<String,Integer>>>
regionsByYear = new HashMap<>();
for (int i=2004;i<2016;i++)
{
BufferedReader fh1 =
new BufferedReader(new FileReader(i+".txt"));
String s1 = fh1.readLine(); //Throw away the first line
HashMap<String,HashMap<String,Integer>> year = new HashMap<>();
while ((s1=fh1.readLine())!=null)
{
String [] wrds = s1.split("\t");
HashMap<String,Integer>langMap = new HashMap<>();
for(int j=1;j<wrds.length;j++){
langMap.put(langs.get(j-1), Integer.parseInt(wrds[j]));
}
year.put(wrds[0],langMap);
}
regionsByYear.put(i,year);
fh1.close();
}
}
}

Create a Map<String, Integer> using a HashMap and each time you find a new country while scanning the incoming data add it into the map country->0. Each time you find a usage of python increment the value.
At the end loop through the entrySet of the map and for each case where e.value() is zero output e.key().

Exception in thread "main" java.util.UnknownFormatConversionException: Conversion = 'ti'

package chapterreader;
import java.util.Scanner;
import java.io.File;
public class ChapterReader {
public static void main(String[] args) throws Exception {
Chapter myChapter = new Chapter();
File chapterFile = new File("toc.txt");
Scanner chapterScanner;
//check to see if the file exists to read the data
if (chapterFile.exists()) {
System.out.printf("%7Chapter %14Title %69Page %80Length");
chapterScanner = new Scanner(chapterFile);
//Set Delimiter as ';' & 'new line'
chapterScanner.useDelimiter(";|\r\n");
while (chapterScanner.hasNext()) {
//Reads all the data from file and set it to the object Chapter
myChapter.setChapterNumber(chapterScanner.nextInt());
myChapter.setChapterTitle(chapterScanner.next());
myChapter.setStartingPageNumber(chapterScanner.nextInt());
myChapter.setEndingPageNumber(chapterScanner.nextInt());
displayProduct(myChapter);
}
chapterScanner.close();
} else {
System.out.println("Missing Chapter File");
}
}
//Display the Chapter Information in a correct Format
public static void displayProduct(Chapter reportProduct) {
System.out.printf("%7d", reportProduct.getChapterNumber());
System.out.printf("%-60s", reportProduct.getChapterTitle());
System.out.printf("%-6d", reportProduct.getStartingPageNumber());
System.out.printf("%-7d%n", reportProduct.getEndingPageNumber());
}
}
But then I got an Error:
run: Exception in thread "main"
java.util.UnknownFormatConversionException: Conversion = 'ti' at
java.util.Formatter$FormatSpecifier.checkDateTime(Formatter.java:2915)
at java.util.Formatter$FormatSpecifier.(Formatter.java:2678)
at java.util.Formatter.parse(Formatter.java:2528) at
java.util.Formatter.format(Formatter.java:2469) at
java.io.PrintStream.format(PrintStream.java:970) at
java.io.PrintStream.printf(PrintStream.java:871) at
chapterreader.ChapterReader.main(ChapterReader.java:17) Java Result: 1
BUILD SUCCESSFUL (total time: 0 seconds)
What's wrong with this error? Please, Help!

Your below statement is not formattable. That why it throws UnknownFormatConversionException
System.out.printf("%7Chapter %14Title %69Page %80Length");
If you want to separate these words than use following way
System.out.printf("%7s %14s %69s %80s", "Chapter", "Title", "Page", "Length");

Instead of
System.out.printf("%7Chapter %14Title %69Page %80Length");
I think you wanted something like
System.out.printf("%7s %14s %69s %80s%n", "Chapter", "Title", "Page",
"Length");
and your message is telling you that your format String(s) aren't valid (%14Ti). The Formatter#syntax javadoc says (in part)
't', 'T' date/time Prefix for date and time conversion characters. See Date/Time Conversions.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Retrive textcontent by matching start word and end word - java

Related

How can I print the content of rows in a Dataset using Java and the Spark SQL?

Mainframe comp-3 field reading using JRecord

Combiner creating mapoutput file per region in HBase scan mapreduce

Is it possible to create a list in java using data from multiple text files

Exception in thread "main" java.util.UnknownFormatConversionException: Conversion = 'ti'

Categories

Resources