To be clear, to all of the guys who rush and say that these type of posts are duplicate without even reading it: this is not a type of question in which i ask what null is and how can i manage these exceptions, here i ask why twitter's API returns to my method A null object seemingly random.
I am creating a java application that interacts with Twitter API using the library Twitter4J. I want to download a big amount of tweets, and then do the statistics on the offline data. Tweets are saved in a NoSQL database (elasticsearch).
My code was doing fine when it started printing the tweets only on the console for testing. When my program hit the limit of max tweets it slept until the reset of twitter limitation (more than 1.000.000 was printed and got zero errors), the problem came up after i started saving the tweets in my database, after some loops, i get a java.lang.NullPointerException in this exact statement if (searchTweetsRateLimit.getRemaining() == 0). Any suggestions?
public static void main(String[] args) throws TwitterException {
int totalTweets = 0;
long maxID = -1;
twitter4j.Twitter twitter = getTwitter();
RestClient restclient = RestClient.builder(
new HttpHost("localhost",9200,"http"),
new HttpHost("localhost",9201,"http")).build();
Map<String, RateLimitStatus> rateLimitStatus = twitter.getRateLimitStatus("search");
// This finds the rate limit specifically for doing the search API call we use in this program
RateLimitStatus searchTweetsRateLimit = rateLimitStatus.get("/search/tweets");
System.out.printf("You have %d calls remaining out of %d, Limit resets in %d seconds\n",
searchTweetsRateLimit.getRemaining(),
searchTweetsRateLimit.getLimit(),
searchTweetsRateLimit.getSecondsUntilReset());
int i = 10;
// This is the loop that retrieve multiple blocks of tweets from Twitter
for (int queryNumber=0;queryNumber < MAX_QUERIES; queryNumber++)
{
System.out.printf("\n\n!!! Starting loop %d\n\n", queryNumber);
// Do we need to delay because we've already hit our rate limits?
if (searchTweetsRateLimit.getRemaining() == 0)
{
// Yes we do, unfortunately ...
System.out.printf("!!! Sleeping for %d seconds due to rate limits\n", searchTweetsRateLimit.getSecondsUntilReset());
// If you sleep exactly the number of seconds, you can make your query a bit too early
// and still get an error for exceeding rate limitations
Thread.sleep((searchTweetsRateLimit.getSecondsUntilReset()+2) * 1000l);
}
Query q = new Query(SEARCH_TERM); // Search for tweets that contains this term
q.setCount(TWEETS_PER_QUERY); // How many tweets, max, to retrieve
q.resultType(null); // Get all tweets
q.setLang("en"); // English language tweets, please
Related
Im trying to copy the exrcise about halfway down the page on this link:
https://d2l.ai/chapter_recurrent-neural-networks/sequence.html
The exercise uses a sine function to create 1000 data points between -1 through 1 and use a recurrent network to approximate the function.
Below is the code I used. I'm going back to study more why this isn't working as it doesn't make much sense to me now when I was easily able to use a feed forward network to approximate this function.
//get data
ArrayList<DataSet> list = new ArrayList();
DataSet dss = DataSetFetch.getDataSet(Constants.DataTypes.math, "sine", 20, 500, 0, 0);
DataSet dsMain = dss.copy();
if (!dss.isEmpty()){
list.add(dss);
}
if (list.isEmpty()){
return;
}
//format dataset
list = DataSetFormatter.formatReccurnent(list, 0);
//get network
int history = 10;
ArrayList<LayerDescription> ldlist = new ArrayList<>();
LayerDescription l = new LayerDescription(1,history, Activation.RELU);
ldlist.add(l);
LayerDescription ll = new LayerDescription(history, 1, Activation.IDENTITY, LossFunctions.LossFunction.MSE);
ldlist.add(ll);
ListenerDescription ld = new ListenerDescription(20, true, false);
MultiLayerNetwork network = Reccurent.getLstm(ldlist, 123, WeightInit.XAVIER, new RmsProp(), ld);
//train network
final List<DataSet> lister = list.get(0).asList();
DataSetIterator iter = new ListDataSetIterator<>(lister, 50);
network.fit(iter, 50);
network.rnnClearPreviousState();
//test network
ArrayList<DataSet> resList = new ArrayList<>();
DataSet result = new DataSet();
INDArray arr = Nd4j.zeros(lister.size()+1);
INDArray holder;
if (list.size() > 1){
//test on training data
System.err.println("oops");
}else{
//test on original or scaled data
for (int i = 0; i < lister.size(); i++) {
holder = network.rnnTimeStep(lister.get(i).getFeatures());
arr.putScalar(i,holder.getFloat(0));
}
}
//add originaldata
resList.add(dsMain);
//result
result.setFeatures(dsMain.getFeatures());
result.setLabels(arr);
resList.add(result);
//display
DisplayData.plot2DScatterGraph(resList);
Can you explain the code I would need for a 1 in 10 hidden and 1 out lstm network to approximate a sine function?
Im not using any normalization as function is already -1:1 and Im using the Y input as the feature and the following Y Input as the label to train the network.
You notice i am building a class that allows for easier construction of nets and I have tried throwing many changes at the problem but I am sick of guessing.
Here are some examples of my results. Blue is data red is result
This is one of those times were you go from wondering why was this not working to how in the hell were my original results were as good as they were.
My failing was not understanding the documentation clearly and also not understanding BPTT.
With feed forward networks each iteration is stored as a row and each input as a column. An example is [dataset.size, network inputs.size]
However with recurrent input its reversed with each row being a an input and each column an iteration in time necessary to activate the state of the lstm chain of events. At minimum my input needed to be [0, networkinputs.size, dataset.size] But could also be [dataset.size, networkinputs.size, statelength.size]
In my previous example I was training the network with data in this format [dataset.size, networkinputs.size, 1]. So from my low resolution understanding the lstm network should never have worked at all but somehow produced at least something.
There may have also been some issue with converting the dataset to a list as I also changed how I feed the network but but I think the bulk of the issue was a data structure issue.
Below are my new results
Hard to tell what is going on without seeing the full code. For a start I don't see an RnnOutputLayer specified. You could take a look this which shows you how to build an RNN in DL4J.
If your RNN setup is correct this could be a tuning issue. You can find more on tuning here. Adam is probably a better choice for an updater than RMSProp. And tanh probably is a good choice for the activation for your output layer since it's range is (-1,1). Other things to check/tweak - learning rate, number of epochs, set up of your data (like are you trying to predict to far out?).
My question might cause some confusion so please see Description first. It might be helpful to identify my problem. I will add my Code later at the end of the question (Any suggestions regarding my code structure/implementation is also welcomed).
Thank you for any help in advance!
My question:
How to define multiple sinks in Flink Batch processing without having it get data from one source repeatedly?
What is the difference between createCollectionEnvironment() and getExecutionEnvironment() ? Which one should I use in local environment?
What is the use of env.execute()? My code will output the result without this sentence. if I add this sentence it will pop an Exception:
-
Exception in thread "main" java.lang.RuntimeException: No new data sinks have been defined since the last execution. The last execution refers to the latest call to 'execute()', 'count()', 'collect()', or 'print()'.
at org.apache.flink.api.java.ExecutionEnvironment.createProgramPlan(ExecutionEnvironment.java:940)
at org.apache.flink.api.java.ExecutionEnvironment.createProgramPlan(ExecutionEnvironment.java:922)
at org.apache.flink.api.java.CollectionEnvironment.execute(CollectionEnvironment.java:34)
at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:816)
at MainClass.main(MainClass.java:114)
Description:
New to programming. Recently I need to process some data (grouping data, calculating standard deviation, etc.) using Flink Batch processing.
However I came to a point where I need to output two DataSet.
The structure was something like this
From Source(Database) -> DataSet 1 (add index using zipWithIndex())-> DataSet 2 (do some calculation while keeping index) -> DataSet 3
First I output DataSet 2, the index is e.g. from 1 to 10000;
And then I output DataSet 3 the index becomes from 10001 to 20000 although I did not change the value in any function.
My guessing is when outputting DataSet 3 instead of using the result of
previously calculated DataSet 2 it started from getting data from database again and then perform the calculation.
With the use of ZipWithIndex() function it does not only give the wrong index number but also increase the connection to db.
I guess that this is relevant to the execution environment, as when I use
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
will give the "wrong" index number (10001-20000)
and
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
will give the correct index number (1-10000)
The time taken and number of database connections is different and the order of print will be reversed.
OS, DB, other environment details and versions:
IntelliJ IDEA 2017.3.5 (Community Edition)
Build #IC-173.4674.33, built on March 6, 2018
JRE: 1.8.0_152-release-1024-b15 amd64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
Windows 10 10.0
My Test code(Java):
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
//Table is used to calculate the standard deviation as I figured that there is no such calculation in DataSet.
BatchTableEnvironment tableEnvironment = TableEnvironment.getTableEnvironment(env);
//Get Data from a mySql database
DataSet<Row> dbData =
env.createInput(
JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.mysql.cj.jdbc.Driver")
.setDBUrl($database_url)
.setQuery("select value from $table_name where id =33")
.setUsername("username")
.setPassword("password")
.setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.DOUBLE_TYPE_INFO))
.finish()
);
// Add index for assigning group (group capacity is 5)
DataSet<Tuple2<Long, Row>> indexedData = DataSetUtils.zipWithIndex(dbData);
// Replace index(long) with group number(int), and convert Row to double at the same time
DataSet<Tuple2<Integer, Double>> rawData = indexedData.flatMap(new GroupAssigner());
//Using groupBy() to combine individual data of each group into a list, while calculating the mean and range in each group
//put them into a POJO named GroupDataClass
DataSet<GroupDataClass> groupDS = rawData.groupBy("f0").combineGroup(new GroupCombineFunction<Tuple2<Integer, Double>, GroupDataClass>() {
#Override
public void combine(Iterable<Tuple2<Integer, Double>> iterable, Collector<GroupDataClass> collector) {
Iterator<Tuple2<Integer, Double>> it = iterable.iterator();
Tuple2<Integer, Double> var1 = it.next();
int groupNum = var1.f0;
// Using max and min to calculate range, using i and sum to calculate mean
double max = var1.f1;
double min = max;
double sum = 0;
int i = 1;
// The list is to store individual value
List<Double> list = new ArrayList<>();
list.add(max);
while (it.hasNext())
{
double next = it.next().f1;
sum += next;
i++;
max = next > max ? next : max;
min = next < min ? next : min;
list.add(next);
}
//Store group number, mean, range, and 5 individual values within the group
collector.collect(new GroupDataClass(groupNum, sum / i, max - min, list));
}
});
//print because if no sink is created, Flink will not even perform the calculation.
groupDS.print();
// Get the max group number and range in each group to calculate average range
// if group number start with 1 then the maximum of group number equals to the number of group
// However, because this is the second sink, data will flow from source again, which will double the group number
DataSet<Tuple2<Integer, Double>> rangeDS = groupDS.map(new MapFunction<GroupDataClass, Tuple2<Integer, Double>>() {
#Override
public Tuple2<Integer, Double> map(GroupDataClass in) {
return new Tuple2<>(in.groupNum, in.range);
}
}).max(0).andSum(1);
// collect and print as if no sink is created, Flink will not even perform the calculation.
Tuple2<Integer, Double> rangeTuple = rangeDS.collect().get(0);
double range = rangeTuple.f1/ rangeTuple.f0;
System.out.println("range = " + range);
}
public static class GroupAssigner implements FlatMapFunction<Tuple2<Long, Row>, Tuple2<Integer, Double>> {
#Override
public void flatMap(Tuple2<Long, Row> input, Collector<Tuple2<Integer, Double>> out) {
// index 1-5 will be assigned to group 1, index 6-10 will be assigned to group 2, etc.
int n = new Long(input.f0 / 5).intValue() + 1;
out.collect(new Tuple2<>(n, (Double) input.f1.getField(0)));
}
}
It's fine to connect a source to multiple sink, the source gets executed only once and records get broadcasted to the multiple sinks. See this question Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?
getExecutionEnvironment is the right way to get the environment when you want to run your job. createCollectionEnvironment is a good way to play around and test. See the documentation
The exception error message is very clear: if you call print or collect your data flow gets executed. So you have two choices:
Either you call print/collect at the end of your data flow and it gets executed and printed. That's good for testing stuff. Bear in mind you can only call collect/print once per data flow, otherwise it gets executed many time while it's not completely defined
Either you add a sink at the end of your data flow and call env.execute(). That's what you want to do once your flow is in a more mature shape.
I am using Twitter4j to build a client to fetch tweets for the input search term. I am also trying to provide the facility to the user to enter the number of tweets he wants in the result.
I know that we can set the number of tweets to be returned per page with the Query's setCount() method:
Query q = new Query(searchTerm);
q.setCount(maxTweets);
But if I am giving the value 1 as maxTweets, it returns 2 tweets.
Update: After further research I observed that it is returning 1 extra tweet per search. So I am giving 1 as maxTweets value it is returning 2 tweets. If I am giving 2 as maxTweets value it is returning 3 tweets and so on.
I am not sure where I am wrong but please let me know if there is a way by which I can get the fixed number of tweets using twitter4j.
Any guidance will be helpful.
When you write
Query q = new Query(searchTerm);
Think of it as one tabled page which contains an amount of result matching your query. But there might be more multiple pages.
When you set
q.setCount(maxTweets);
it will bring you maxTweets amount of tweets per page. In your case, 2, because there were two pages matching your query and you selected one tweet per page.
What you can do, try to handle it with a do - while loop.
Query q = new Query(searchTerm);
QueryResult result;
int tempUSerInput = 0; //keep a temp value
boolean flag = false;
do {
result = twitter.search(query);
List<Status> tweets = result.getTweets();
tempUSerInput = tempUSerInput + tweets.size();
if(tempUSerInput >= realyourUserInput) // you have already matched the number
flag = true; //set the flag
}
while ((query = result.nextQuery()) != null && !flag);
// Here Take only realyourUserInput number
// as you might have taken more than required
List<Status> finaltweets = new ArrayList();
for(int i=0; i<realyourUserInput; i++)
finaltweets.add( tweets.get(i) ); //add them to your final list
I have an application which accesses about 2 million tweets from a MySQL database. Specifically one of the fields holds a tweet of text (with maximum length of 140 characters). I am splitting every tweet into an ngram of words ngrams where 1 <= n <= 3. For example, consider the sentence:
I am a boring sentence.
The corresponding nGrams are:
I
I am
I am a
am
am a
am a boring
a
a boring
a boring sentence
boring
boring sentence
sentence
With about 2 million tweets, I am generating a lot of data. Regardless, I am surprised to get a heap error from Java:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2145)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1922)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3423)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:483)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3118)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2288)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2709)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2728)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2678)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1612)
at twittertest.NGramFrequencyCounter.FreqCount(NGramFrequencyCounter.java:49)
at twittertest.Global.main(Global.java:40)
Here is the problem code statement (line 49) as given by the above output from Netbeans:
results = stmt.executeQuery("select * from tweets");
So, if I am running out of memory it must be that it is trying to return all the results at once and then storing them in memory. What is the best way to solve this problem? Specifically I have the following questions:
How can I process pieces of results rather than the whole set?
How would I increase the heap size? (If this is possible)
Feel free to include any suggestions, and let me know if you need more information.
EDIT
Instead of select * from tweets I partitioned the table into equally sized subsets of about 10% of the total size. Then I tried running the program. It looked like it was working fine but it eventually gave me the same heap error. This is strange to me because I have ran the same program in the past, successfully with 610,000 tweets. Now I have about 2,000,000 tweets or roughly 3 times as much more data. So if I split the data into thirds it should work, but I went further and split the subsets into size 10%.
Is some memory not being freed? Here is the rest of the code:
results = stmt.executeQuery("select COUNT(*) from tweets");
int num_tweets = 0;
if(results.next())
{
num_tweets = results.getInt(1);
}
int num_intervals = 10; //split into equally sized subets
int interval_size = num_tweets/num_intervals;
for(int i = 0; i < num_intervals-1; i++) //process 10% of the data at a time
{
results = stmt.executeQuery( String.format("select * from tweets limit %s, %s", i*interval_size, (i+1)*interval_size));
while(results.next()) //for each row in the tweets database
{
tweetID = results.getLong("tweet_id");
curTweet = results.getString("tweet");
int colPos = curTweet.indexOf(":");
curTweet = curTweet.substring(colPos + 1); //trim off the RT and retweeted
if(curTweet != null)
{
curTweet = removeStopWords(curTweet);
}
if(curTweet == null)
{
continue;
}
reader = new StringReader(curTweet);
tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
//tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
//Set stopSet = StopFilter.makeStopSet(Version.LUCENE_36, stopWords, true);
//tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopSet);
tokenizer = new ShingleFilter(tokenizer, 2, 3);
charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken()) //insert each nGram from each tweet into the DB
{
insertNGram.setInt(1, nGramID++);
insertNGram.setString(2, charTermAttribute.toString().toString());
insertNGram.setLong(3, tweetID);
insertNGram.executeUpdate();
}
}
}
Don't get all rows from table. Try to select partial
data based on your requirement by setting limits to query. You are using MySQL database your query would be select * from tweets limit 0,10. Here 0 is starting row id and 10 represents 10 rows from start.
You can always increase the heap size available to your JVM using the -Xmx argument. You should read up on all the knobs available to you (e.g. perm gen size). Google for other options or read this SO answer.
You probably can't do this kind of problem with a 32-bit machine. You'll want 64 bits and lots of RAM.
Another option would be to treat it as a map-reduce problem. Solve it on a cluster using Hadoop and Mahout.
Have you considered streaming the result set? Halfway down the page is a section on result set, and it addresses your problem (I think?) Write the n grams to a file, then process the next row? Or, am I misunderstanding your problem?
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html
I'm working on an app using Twitter4j.
I'm trying to import tweets with a certain hashtag (ex: weather)
Then, I want to categorize the Tweets with that hashtag, by searching for keywords.
For example:
Some of the Tweets imported could be
- OMG, I hate this rain #weather
- This sunshine makes me feel so happy #weather
- Such strange #weather! One moment it rains, the next the sun shines. Confusing!
- Rain makes me sad #weather
- I love the sunshine! #weather
Then, I want to categorize these tweets as:
- hate, Confusing, sad,... are negative
- happy, love,... are positive
PositiveTweets would be:
- This sunshine makes me feel so happy #weather
- I love the sunshine! #weather
NegativeTweets would be:
- OMG, I hate this rain #weather
- Such strange #weather! One moment it rains, the next the sun shines. Confusing!
- Rain makes me sad #weather
So, NegativeTweets=3 and the PositiveTweets=2
Can anyone help me with this or point me towards something similar?
You can query for the #weather hashtag, then separate the tweets into separate lists based on whether or not they contain any of the keywords you designate for either good or bad weather.
public static void main(String[] args) throws TwitterException {
List<Tweet> goodWeather = new ArrayList<Tweet>();
List<Tweet> badWeather = new ArrayList<Tweet>();
Twitter twitter = new TwitterFactory().getInstance();
System.out.println("Fetching Weather Data...");
// get the 1000 most recent tweets tagged #weather
for (int page = 1; page <= 10; page++) {
Query query = new Query("#weather");
query.setRpp(100); // 100 results per page
query.setPage(page);
QueryResult qr = twitter.search(query);
List<Tweet> qrTweets = qr.getTweets();
// break out if there are no more tweets
if(qrTweets.size() == 0) break;
// separate tweets into good and bad bins
for(Tweet t : qrTweets) {
if (t.getText().toLowerCase().contains("happy") ||
t.getText().toLowerCase().contains("love")) {
goodWeather.add(t);
}
if (t.getText().toLowerCase().contains("sad") ||
t.getText().toLowerCase().contains("hate")) {
badWeather.add(t);
}
}
}
System.out.println("Good Weather: " + goodWeather.size());
for (Tweet good : goodWeather) {
System.out.println(good.getCreatedAt() + ": " + good.getText());
}
System.out.println("\nBad Weather: " + badWeather.size());
for (Tweet bad : badWeather) {
System.out.println(bad.getCreatedAt() + ": " + bad.getText());
}
}
I think what you want to do is Sentiment Analysis to see how many of the tweets you retrieve are positive and how many are negative, right? A good start would be to look into SentiWordNet it has a lot of words already stored with their polarities of how positive or negative a word is, it's only a text file containing all of this data. You would need to parse it and store the data in some data structure. Once you have all this done then you simply need to scan the tweets and match the words and get the scores and then tag the tweets. It's not as hard as it sounds, get searching on SentiWordNet first. I believe this is the better way to go, as it will help you more in the long run :)
hope this helped