Flink - Create BroadcastStream from database properties - java

I have to create a BroadcastStream to be able to change the properties on the database and see the application of these properties in real time on the application.
The problems I have are 2:
1) When I read the database I need to have all the lines at the same time, via resultSet, HashMap or anything that can contain a structure of the type key-value, as some properties depend on other properties, so I cannot process them individually.
The structure of my MapStateDescriptor will be:
//String = topic name
//TopicProperties = object containing all the topic properties
MapStateDescriptor<String, TopicProperties> propertiesStateDescriptor = new MapStateDescriptor<String, TopicProperties>("properties",
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.of(new TypeHint<TopicProperties>() {}));
BroadcastStream<Row> propertiesBroadcastStream = env.createInput(JDBCInputFormat)
.map(new TopicPropertiesDbMapper()
.broadcast(propertiesStateDescriptor);
TopicPropertiesDbMapper converts what JDBCInputFormat returns to the String structure, TopicProperties.
The problem is that it is processed one row at a time, but I need to process them all together, as mentioned above.
2) Repeat the reading of the properties and update the BroadcastStream once an hour.
I specify that I have already made a version of the one above, but with the reading of the properties from file, through:
readFile (FileInputFormat, path file, FileProcessingMode, milliseconds of interval for re-reading)
it is working and I solved the two problems listed above for the database case with:
1) Set "unsplittable" flag of the FileInputFormat class to "true";
2) FileProcessingMode.PROCESS_CONTINUOUSLY.

Related

Accessing Statistics/MemberInfo of ZFile (JZOS)

I am using the IBM JZOS API to access PDS members and now I need some information about the members. There is the class PdsDirectory.MemberInfo.Statistics, so that I can create a PdsDirectory, iterate over it and get the Statistics of each member (e.g. modification date, last editing user,...) like so:
PdsDirectory dir = new PdsDirectory(args[0]);
for (Iterator iter = dir.iterator(); iter.hasNext(); ) {
PdsDirectory.MemberInfo info = (PdsDirectory.MemberInfo)iter.next();
System.out.println(info);
}
But I need those statistics only for one single file. Is there a way with
ZFile zFile = new ZFile("//DD:INPUT", "rb,type=record,noseek");
or creating a reader, to access those information? Or is the only way to create the directory and find the file I need?
The only information you can get for a data set is from the catalog. You can use the JZOS CatalogSearch class to do that from Java. There is a sample on github.
PDS member statistics are usually only present if you edit members using ISPF. ISPF stores statistics in the PDS directory user data field. Any application can use this field for whatever they like but it's usually only used by ISPF. There are no such statistics in the catalog. There is no last edited userid or record count etc. There is creation data, last referenced date and lots of other useful metadata. You may not find what you are looking for but most of the interesting stuff is in the Format 1 DSCB.

IBM RTC API - Adding files to change sets

Basically I'm experimenting with the IBM Rational Team Concert Plain Java Client API, and I'm stuck at adding operations to change sets.
I create a new change set, retrieve the Operation factory and then I'd like to add a new file from the local machine file system (might be a new file of a project).
val changeSetHandle = workspaceConnection.createChangeSet(component, null)
val operationFactory = workspaceConnection.configurationOpFactory()
val saveOperation = operationFactory.save(...)
I do not understand how to to obtain an IVersionable handle to submit to the save() method.
You can refer to this thread which shows an example of IVersionable:
// Create a new file and give it some content
IFileItem file = (IFileItem) IFileItem.ITEM_TYPE.createItem();
file.setName("file.txt");
file.setParent(projectFolder);
// Create file content.
IFileContentManager contentManager = FileSystemCore.getContentManager(repository);
IFileContent fileContent = contentManager.storeContent(
"UTF-8",
FileLineDelimiter.LINE_DELIMITER_LF,
new VersionedContentManagerByteArrayInputStreamPovider(BYTE_ARRAY),
null,
null);
file.setContent(fileContent);
file.setContentType(IFileItem.CONTENT_TYPE_TEXT);
file.setFileTimestamp(new Date());
workspaceConnection.configurationOpFactory().save(file);
However, this is not enough:
IConfigurationOpFactory is used to update a repository workspace by adding changes to a change set.
The usage pattern is to get a workspace connection, create a bunch of save operations, then run IWorkspaceConnection#commit() on those ops.
Calling save() without committing the change drops the op onto the stack for the garbage collector to gobble up. ;)

How to process multiple files separately after SparkContext.wholeTextFiles?

I'm trying to use wholeTextFiles to read all the files names in a folder and process them one-by-one seperately(For example, I'm trying to get the SVD vector of each data set and there are 100 sets in total). The data are saved in .txt files spitted by space and arranged in different lines(like a matrix).
The problem I came across with is that after I use "wholeTextFiles("path with all the text files")", It's really difficult to read and parse the data and I just can't use the method like what I used when reading only one file. The method works fine when I just read one file and it gives me the correct output. Could someone please let me know how to fix it here? Thanks!
public static void main (String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("whole text files").setMaster("local[2]").set("spark.executor.memory","1g");;
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> fileNameContentsRDD = jsc.wholeTextFiles("/Users/peng/FMRITest/regionOutput/");
JavaRDD<String[]> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String[]>() {
#Override
public String[] call(Tuple2<String, String> fileNameContent) throws Exception {
String content = fileNameContent._2();
String[] sarray = content .split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i< sarray.length; i++){
values[i] = Double.parseDouble(sarray[i]);
}
pd.cache();
RowMatrix mat = new RowMatrix(pd.rdd());
SingularValueDecomposition<RowMatrix, Matrix> svd = mat.computeSVD(84, true, 1.0E-9d);
Vector s = svd.s();
}});
Quoting the scaladoc of SparkContext.wholeTextFiles:
wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
In other words, wholeTextFiles might not simply be what you want.
Since by design "Small files are preferred" (see the scaladoc), you could mapPartitions or collect (with filter) to grab a subset of the files to apply the parsing to.
Once you have the files per partitions in your hands, you could use Scala's Parallel Collection API and schedule Spark jobs to execute in parallel:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Save a spark RDD using mapPartition with iterator

I have some intermediate data that I need to be stored in HDFS and local as well. I'm using Spark 1.6. In HDFS as intermediate form I'm getting data in /output/testDummy/part-00000 and /output/testDummy/part-00001. I want to save these partitions in local using Java/Scala so that I could save them as /users/home/indexes/index.nt(by merging both in local) or /users/home/indexes/index-0000.nt and /home/indexes/index-0001.nt separately.
Here is my code:
Note: testDummy is same as test, output is with two partitions. I want to store them separately or combined but local with index.nt file. I prefer to store separately in two data-nodes. I'm using cluster and submit spark job on YARN. I also added some comments, how many times and what data I'm getting. How could I do? Any help is appreciated.
val testDummy = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).saveAsTextFile(outputFilePathForHDFS+"/testDummy")
println("testDummy done") //1 time print
def savesData(iterator: Iterator[(String)]): Iterator[(String)] = {
println("Inside savesData") // now 4 times when coalesce(Constants.INITIAL_PARTITIONS)=2
println("iter size"+iterator.size) // 2 735 2 735 values
val filenamesWithExtension = outputPath + "/index.nt"
println("filenamesWithExtension "+filenamesWithExtension.length) //4 times
var list = List[(String)]()
val fileWritter = new FileWriter(filenamesWithExtension,true)
val bufferWritter = new BufferedWriter(fileWritter)
while (iterator.hasNext){ //iterator.hasNext is false
println("inside iterator") //0 times
val dat = iterator.next()
println("datadata "+iterator.next())
bufferWritter.write(dat + "\n")
bufferWritter.flush()
println("index files written")
val dataElements = dat.split(" ")
println("dataElements") //0
list = list.::(dataElements(0))
list = list.::(dataElements(1))
list = list.::(dataElements(2))
}
bufferWritter.close() //closing
println("savesData method end") //4 times when coal=2
list.iterator
}
println("before saving data into local") //1
val test = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).mapPartitions(savesData)
println("testRDD partitions "+test.getNumPartitions) //2
println("testRDD size "+test.collect().length) //0
println("after saving data into local") //1
PS: I followed, this and this but not exactly same what I'm looking for, I did somehow but not getting anything in index.nt
A couple of things:
Never call Iterator.size if you plan to use data later. Iterators are TraversableOnce. The only way to compute Iterator size is to traverse all its element and after that there is no more data to be read.
Don't use transformations like mapPartitions for side effects. If you want to perform some type of IO use actions like foreach / foreachPartition. It is a bad practice and doesn't guarantee that given piece of code will be executed only once.
Local path inside action or transformations is a local path of particular worker. If you want to write directly on the client machine you should fetch data first with collect or toLocalIterator. It could be better though to write to distributed storage and fetch data later.
Java 7 provides means to watch directories.
https://docs.oracle.com/javase/tutorial/essential/io/notification.html
The idea is to create a watch service, register it with the directory of interest (mention the events of your interest, like file creation, deletion, etc.,), do watch, you will be notified of any events like creation, deletion, etc., you can take whatever action you want then.
You will have to depend on Java hdfs api heavily wherever applicable.
Run the program in background since it waits for events forever. (You can write logic to quit after you do whatever you want)
On the other hand, shell scripting will also help.
Be aware of coherency model of hdfs file system while reading files.
Hope this helps with some idea.

Java how to read parameters from both file and cli?

I'm writing a tool in java and I need to provide some parameters that user can set.
I thought it is good to have ability to save all parameters in a file (and just run the .jar) and to alter saved parameters through command line.
So, I need to somehow handle parameters from two sources (priority, validity, etc.). Currently I use Apache.commons.cli to read cli-provided parameters and java.util.Properties for file-provided properties. And then I combine these properties together (and add some defaults if needed). But I don't like the result, it seems over-complicated to me.
So the code is something like this:
Properties fromFile = new Properties();
fromFile.load(new FileInputStream("settings.properties"));
cli.Options cliOptions = new cli.Options();
cliOptions.addOption(longName, shortName, hasArg, description);
//add more options
Parser parser = new DefaultParser();
CommandLine fromCli = parser.parse(cliOptions, args);
//at this point I have two different objects with properties I need,
//and I need to get every property from fromCli, check it's not empty,
// if it is, get it from fromFile, etc
So the question is: is there any library to handle properties from different sources (cli, file, defaults)? I tried googling, but did not succeed. Sorry if my googling skills are just not enough.
I'd like the code to be something like this:
import org.supertools.allPropsLib;
allPropsLib.PropsHandler handler = new allPropsLib.PropsHandler();
handler.addOptions(name, shortName, hasArg, description, defaultsTo);
handler.addSource(allPropsLib.Sources.CLI);
handler.addSource(allPropsLib.Sources.FILE);
handler.addSource(allPropsLib.Sources.DEFAULTS);
handler.setFileSource("filename");
allPropsLib.PropsContainer properties = handler.readAllProps();
// and at this point container should contain properties combined
// maybe there should be some handler function to tell the priorities,
// but I don't need to decide from where each properties should be taken
After you define the properties, load them into a java.util.Properties container regardless of the source. Then call the logic and pass it the container as a parameter.

Categories

Resources