I am trying to get all the followersIDs from an a twitter account with about 150.000 followers. I later want to map their location, but first I need all those IDs.
at the moment I am using this code:
long lCursorIDs = -1;
long[] fArray = new long[100];
do
{
fArray = twitter.getFollowersIDs(name, lCursorIDs).getIDs();
} while (twitter.getFollowersIDs(name, lCursorIDs).hasNext ());
try
{
PrintWriter pr = new PrintWriter(filenameOutput);
for (int i=0; i<fArray.length ; i++)
{
pr.println(fArray[i]);
}
pr.close();
System.out.println("Follower IDs collected and saved to file: " + filenameOutput );
}
catch (Exception e)
{
e.printStackTrace();
System.out.println("No such file exists.");
}
This works for User with less followers. but with that many it always returns an error message - rate limit exceeded.
I was thinking about getting only a certain number of followersIDs per hour, but I am not sure how to do that and not start every hour from the beginning with the first follower. also, I am not sure how many followers I can get with one request. maybe it is 100, as with the "lookupUser" method but I am not sure.. any ideas/suggestions?
EDIT: ok, I just tried to get the followerIDs of an account with 2700 followers and it stored them correctly in the text file. It also only "cost" one request. than I changed the account name to an account with 15500 followers and it crashes again with an rate limit exceeded message. I don´t get why since it´s only roughly 6 times as many followers but all the remaining requests get spend.. any ideas on what I´m doing wrong?
the answer:
int numberOfFollowers;
numberOfFollowers = user.getFollowersCount();
//CREATE ARRAYS FOR FOLLOWER IDS
long cursor = -1;
long[] fArray = new long[numberOfFollowers];
long[] local = new long[5000];
IDs ids = twitter.getFollowersIDs(name, cursor);
int j = 0;
int x = 5000;
int durchgang = 1;
int d_anzahl = 1 + numberOfFollowers / 5000;
//STROE FOLLOWER IDS IN ARRAYS
do
{
ids = twitter.getFollowersIDs(name, cursor);
local = twitter.getFollowersIDs(name, cursor).getIDs();
System.out.println("Durchgang: " + durchgang + " / " + d_anzahl );
System.arraycopy(local, 0, fArray, j * x , local.length);
j++;
durchgang++;
cursor = ids.getNextCursor();
} while (ids.hasNext());
this gets an array with all follower IDs of any twitter User. It calculates the number of loops needed to get all follower IDs and copys each array of 5000 IDs into new array which has all IDs at the end.
Related
We created a program to make the use of the database easier in other programs. So the code im showing gets used in multiple other programs.
One of those other programs gets about 10,000 records from one of our clients and has to check if these are in our database already. If not we insert them into the database (they can also change and have to be updated then).
To make this easy we load all the entries from our whole table (at the moment 120,000), create a class for every entry we get and put all of them into a Hashmap.
The loading of the whole table this way takes around 5 minutes. Also we sometimes have to restart the program because we run into a GC overhead error because we work on limited hardware. Do you have an idea of how we can improve the performance?
Here is the code to load all entries (we have a global limit of 10.000 entries per query so we use a loop):
public Map<String, IMasterDataSet> getAllInformationObjects(ISession session) throws MasterDataException {
IQueryExpression qe;
IQueryParameter qp;
// our main SDP class
Constructor<?> constructorForSDPbaseClass = getStandardConstructor();
SimpleDateFormat itaTimestampFormat = new SimpleDateFormat("yyyyMMddHHmmssSSS");
// search in standard time range (modification date!)
Calendar cal = Calendar.getInstance();
cal.set(2010, Calendar.JANUARY, 1);
Date startDate = cal.getTime();
Date endDate = new Date();
Long startDateL = Long.parseLong(itaTimestampFormat.format(startDate));
Long endDateL = Long.parseLong(itaTimestampFormat.format(endDate));
IDescriptor modDesc = IBVRIDescriptor.ModificationDate.getDescriptor(session);
// count once before to determine initial capacities for hash map/set
IBVRIArchiveClass SDP_ARCHIVECLASS = getMasterDataPropertyBag().getSDP_ARCHIVECLASS();
qe = SDP_ARCHIVECLASS.getQueryExpression(session);
qp = session.getDocumentServer().getClassFactory()
.getQueryParameterInstance(session, new String[] {SDP_ARCHIVECLASS.getDatabaseName(session)}, null, null);
qp.setExpression(qe);
qp.setHitLimitThreshold(0);
qp.setHitLimit(0);
int nrOfHitsTotal = session.getDocumentServer().queryCount(session, qp, "*");
int initialCapacity = (int) (nrOfHitsTotal / 0.75 + 1);
// MD sets; and objects already done (here: document ID)
HashSet<String> objDone = new HashSet<>(initialCapacity);
HashMap<String, IMasterDataSet> objRes = new HashMap<>(initialCapacity);
qp.close();
// do queries until hit count is smaller than 10.000
// use modification date
boolean keepGoing = true;
while(keepGoing) {
// construct query expression
// - basic part: Modification date & class type
// a. doc. class type
qe = SDP_ARCHIVECLASS.getQueryExpression(session);
// b. ID
qe = SearchUtil.appendQueryExpressionWithANDoperator(session, qe,
new PlainExpression(modDesc.getQueryLiteral() + " BETWEEN " + startDateL + " AND " + endDateL));
// 2. Query Parameter: set database; set expression
qp = session.getDocumentServer().getClassFactory()
.getQueryParameterInstance(session, new String[] {SDP_ARCHIVECLASS.getDatabaseName(session)}, null, null);
qp.setExpression(qe);
// order by modification date; hitlimit = 0 -> no hitlimit, but the usual 10.000 max
qp.setOrderByExpression(session.getDocumentServer().getClassFactory().getOrderByExpressionInstance(modDesc, true));
qp.setHitLimitThreshold(0);
qp.setHitLimit(0);
// Do not sort by modification date;
qp.setHints("+NoDefaultOrderBy");
keepGoing = false;
IInformationObject[] hits = null;
IDocumentHitList hitList = null;
hitList = session.getDocumentServer().query(qp, session);
IDocument doc;
if (hitList.getTotalHitCount() > 0) {
hits = hitList.getInformationObjects();
for (IInformationObject hit : hits) {
String objID = hit.getID();
if(!objDone.contains(objID)) {
// do something with this object and the class
// here: construct a new SDP sub class object and give it back via interface
doc = (IDocument) hit;
IMasterDataSet mdSet;
try {
mdSet = (IMasterDataSet) constructorForSDPbaseClass.newInstance(session, doc);
} catch (Exception e) {
// cause for this
String cause = (e.getCause() != null) ? e.getCause().toString() : MasterDataException.ERRMSG_PART_UNKNOWN;
throw new MasterDataException(MasterDataException.ERRMSG_NOINSTANCE_POSSIBLE, this.getClass().getSimpleName(), e.toString(), cause);
}
objRes.put(mdSet.getID(), mdSet);
objDone.add(objID);
}
}
doc = (IDocument) hits[hits.length - 1];
Date lastModDate = ((IDateValue) doc.getDescriptor(modDesc).getValues()[0]).getValue();
startDateL = Long.parseLong(itaTimestampFormat.format(lastModDate));
keepGoing = (hits.length >= 10000 || hitList.isResultSetTruncated());
}
qp.close();
}
return objRes;
}
Loading 120,000 rows (and more) each time will not scale very well, and your solution may not work in the future as the record size grows. Instead let the database server handle the problem.
Your table needs to have a primary key or unique key based on the columns of the records. Iterate through the 10,000 records performing JDBC SQL update to modify all field values with where clause to exactly match primary/unique key.
update BLAH set COL1 = ?, COL2 = ? where PKCOL = ?; // ... AND PKCOL2 =? ...
This modifies an existing row or does nothing at all - and JDBC executeUpate() will return 0 or 1 indicating number of rows changed. If number of rows changed was zero you have detected a new record which does not exist, so perform insert for that new record only.
insert into BLAH (COL1, COL2, ... PKCOL) values (?,?, ..., ?);
You can decide whether to run 10,000 updates followed by however many inserts are needed, or do update+optional insert, and remember JDBC batch statements / auto-commit off may help speed things up.
I have been wondering if there is a way to access all the twitter followers list.
We have tried using call to the REST API via twitter4j:
public List<User> getFriendList() {
List<User> friendList = null;
try {
friendList = mTwitter.getFollowersList(mTwitter.getId(), -1);
} catch (IllegalStateException e) {
e.printStackTrace();
} catch (TwitterException e) {
e.printStackTrace();
}
return friendList;
}
But it returns only a list of 20 followers.
I tried using the same call in loop, but it cause a rate limit exception - says we are not allowed to make too many requests in a small interval of time.
Do we have a way around this?
You should definitely use getFollowersIDs. As the documentation says, this returns an array (list) of IDs objects. Note that it causes the list to be broken into pages of around 5000 IDs at a time. To begin paging provide a value of -1 as the cursor. The response from the API will include a previous_cursor and next_cursor to allow paging back and forth.
The tricky part is to handle the cursor. If you can do this, then you will not have the problem of getting only 20 followers.
The first call to getFollowersIDs will need to be given a cursor of -1. For subsequent calls, you need to update the cursor value, by getting the next cursor, as done in the while part of the loop.
long cursor =-1L;
IDs ids;
do {
ids = twitter.getFollowersIDs(cursor);
for(long userID : ids.getIDs()){
friendList.add(userID);
}
} while((cursor = ids.getNextCursor())!=0 );
Here is a very good reference:
https://github.com/yusuke/twitter4j/blob/master/twitter4j-examples/src/main/java/twitter4j/examples/friendsandfollowers/GetFriendsIDs.java
Now, if the user has more than around 75000 followers, you will have to do some waiting (see Vishal's answer).
The first 15 calls will yield you around 75000 IDs. Then you will have to sleep for 15 minutes. Then make another 15 calls, and so on till you get all the followers. This can be done using a simple Thread.sleep(time_in_milliseconds) outside the for loop.
Just Change like this and try, this is working for me
try {
Log.i("act twitter...........", "ModifiedCustomTabBarActivity.class");
// final JSONArray twitterFriendsIDsJsonArray = new JSONArray();
IDs ids = mTwitter.mTwitter.getFriendsIDs(-1);// ids
// for (long id : ids.getIDs()) {
do {
for (long id : ids.getIDs()) {
String ID = "followers ID #" + id;
String[] firstname = ID.split("#");
String first_Name = firstname[0];
String Id = firstname[1];
Log.i("split...........", first_Name + Id);
String Name = mTwitter.mTwitter.showUser(id).getName();
String screenname = mTwitter.mTwitter.showUser(id).getScreenName();
// Log.i("id.......", "followers ID #" + id);
// Log.i("Name..", mTwitter.mTwitter.showUser(id).getName());
// Log.i("Screen_Name...", mTwitter.mTwitter.showUser(id).getScreenName());
// Log.i("image...", mTwitter.mTwitter.showUser(id).getProfileImageURL());
}
} while (ids.hasNext());
} catch (Exception e) {
e.printStackTrace();
}
Try This...
ConfigurationBuilder confbuilder = new ConfigurationBuilder();
confbuilder.setOAuthAccessToken(accessToken)
.setOAuthAccessTokenSecret(secretToken)
.setOAuthConsumerKey(TwitterOAuthActivity.CONSUMER_KEY)
.setOAuthConsumerSecret(TwitterOAuthActivity.CONSUMER_SECRET);
Twitter twitter = new TwitterFactory(confbuilder.build()).getInstance();
PagableResponseList<User> followersList;
ArrayList<String> list = new ArrayList<String>();
try
{
followersList = twitter.getFollowersList(screenName, cursor);
for (int i = 0; i < followersList.size(); i++)
{
User user = followersList.get(i);
String name = user.getName();
list.add(name);
System.out.println("Name" + i + ":" + name);
}
listView.setAdapter(new ArrayAdapter<String>(this, android.R.layout.simple_list_item_1 , list));
listView.setVisibility(View.VISIBLE);
friend_list.setVisibility(View.INVISIBLE);
post_feeds.setVisibility(View.INVISIBLE);
twit.setVisibility(View.INVISIBLE);
}
This is a tricky one.
You should specify whether you're using application or per user tokens and the number of users you're fetching followers_ids for.
You get just 15 calls per 15 minutes in case of an application token. You can fetch a maximum of 5000 followers_ids per call. That gives you a maximum of 75K followers_ids per 15 minutes.
If any of the users you're fetching followers_ids for has over 75K followers, you'll get the rate_limit error immediately. If you're fetching for more than 1 user, you'll need to build strong rate_limit handling in your code with sleeps and be very patient.
The same applies for friends_ids.
I've not had to deal with fetching more than 75K followers/friends for a given user but come to think of it, I don't know if it's even possible anymore.
i am trying to fetch all the documents from a database without knowing the exact url's . I got one query
DocumentPage documents =docMgr.read();
while (documents.hasNext()) {
DocumentRecord document = documents.next();
System.out.println(document.getUri());
}
But i do not have specific urls , i want all the documents
The first step is to enable your uris lexicon on the database.
You could eval some XQuery and run cts:uris() (or server-side JS and run cts.uris()):
ServerEvaluationCall call = client.newServerEval()
.xquery("cts:uris()");
for ( EvalResult result : call.eval() ) {
String uri = result.getString();
System.out.println(uri);
}
Two drawbacks are: (1) you'd need a user with privileges and (2) there is no pagination.
If you have a small number of documents, you don't need pagination. But for a large number of documents pagination is recommended. Here's some code using the search API and pagination:
// do the next eight lines just once
String options =
"<options xmlns='http://marklogic.com/appservices/search'>" +
" <values name='uris'>" +
" <uri/>" +
" </values>" +
"</options>";
QueryOptionsManager optionsMgr = client.newServerConfigManager().newQueryOptionsManager();
optionsMgr.writeOptions("uriOptions", new StringHandle(options));
// run the following each time you need to list all uris
QueryManager queryMgr = client.newQueryManager();
long pageLength = 10000;
queryMgr.setPageLength(pageLength);
ValuesDefinition query = queryMgr.newValuesDefinition("uris", "uriOptions");
// the following "and" query just matches all documents
query.setQueryDefinition(new StructuredQueryBuilder().and());
int start = 1;
boolean hasMore = true;
Transaction transaction = client.openTransaction();
try {
while ( hasMore ) {
CountedDistinctValue[] uriValues =
queryMgr.values(query, new ValuesHandle(), start, transaction).getValues();
for (CountedDistinctValue uriValue : uriValues) {
String uri = uriValue.get("string", String.class);
//System.out.println(uri);
}
start += uriValues.length;
// this is the last page if uriValues is smaller than pageLength
hasMore = uriValues.length == pageLength;
}
} finally {
transaction.commit();
}
The transaction is only necessary if you need a guaranteed "snapshot" list isolated from adds/deletes happening concurrently with this process. Since it adds some overhead, feel free to remove it if you don't need such exactness.
find out the page length and in the queryMgr you can specify the starting point to access. Keep on increasing the starting point and loop through all the URL. I was able to fetch all URI. This could be not so good approach but works.
List<String> uriList = new ArrayList<>();
QueryManager queryMgr = client.newQueryManager();
StructuredQueryBuilder qb = new StructuredQueryBuilder();
StructuredQueryDefinition querydef = qb.and(qb.collection("xxxx"), qb.collection("whatever"), qb.collection("whatever"));//outputs 241152
SearchHandle results = queryMgr.search(querydef, new SearchHandle(), 10);
long pageLength = results.getPageLength();
long totalResults = results.getTotalResults();
System.out.println("Total Reuslts: " + totalResults);
long timesToLoop = totalResults / pageLength;
for (int i = 0; i < timesToLoop; i = (int) (i + pageLength)) {
System.out.println("Printing Results from: " + (i) + " to: " + (i + pageLength));
results = queryMgr.search(querydef, new SearchHandle(), i);
MatchDocumentSummary[] summaries = results.getMatchResults();//10 results because page length is 10
for (MatchDocumentSummary summary : summaries) {
// System.out.println("Extracted friom URI-> " + summary.getUri());
uriList.add(summary.getUri());
}
if (i >= 1000) {//number of URI to store/retreive. plus 10
break;
}
}
uriList= uriList.stream().distinct().collect(Collectors.toList());
return uriList;
I've installed HBase 0.94.0. I had to improve my read performance through scan. I've inserted random 100000 records.
When I set setCache(100); my performance was 16 secs for 100000 records.
When I set it to setCache(50) my performance was 90 secs for 100000 records.
When I set it to setCache(10); my performance was 16 secs for 100000 records
public class Test {
public static void main(String[] args) {
long start, middle, end;
HTableDescriptor descriptor = new HTableDescriptor("Student7");
descriptor.addFamily(new HColumnDescriptor("No"));
descriptor.addFamily(new HColumnDescriptor("Subject"));
try {
HBaseConfiguration config = new HBaseConfiguration();
HBaseAdmin admin = new HBaseAdmin(config);
admin.createTable(descriptor);
HTable table = new HTable(config, "Student7");
System.out.println("Table created !");
start = System.currentTimeMillis();
for(int i =1;i<100000;i++) {
String s=Integer.toString(i);
Put p = new Put(Bytes.toBytes(s));
p.add(Bytes.toBytes("No"), Bytes.toBytes("IDCARD"),Bytes.toBytes("i+10"));
p.add(Bytes.toBytes("No"), Bytes.toBytes("PHONE"),Bytes.toBytes("i+20"));
p.add(Bytes.toBytes("No"), Bytes.toBytes("PAN"),Bytes.toBytes("i+30"));
p.add(Bytes.toBytes("No"), Bytes.toBytes("ACCT"),Bytes.toBytes("i+40"));
p.add(Bytes.toBytes("Subject"), Bytes.toBytes("English"),Bytes.toBytes("50"));
p.add(Bytes.toBytes("Subject"), Bytes.toBytes("Science"),Bytes.toBytes("60"));
p.add(Bytes.toBytes("Subject"), Bytes.toBytes("History"),Bytes.toBytes("70"));
table.put(p);
}
middle = System.currentTimeMillis();
Scan s = new Scan();
s.setCaching(100);
ResultScanner scanner = table.getScanner(s);
try {
for (Result rr = scanner.next(); rr != null; rr=scanner.next()) {
System.out.println("Found row: " + rr);
}
end = System.currentTimeMillis();
} finally {
scanner.close();
}
System.out.println("TableCreation-Time: " + (middle - start));
System.out.println("Scan-Time: " + (middle - end));
} catch (IOException e) {
System.out.println("IOError: cannot create Table.");
e.printStackTrace();
}
}
}
Why is this happening?
Why would you want to return every record in your 100000 records table? You're doing a full
table scan and just as in any large database this is slow.
Try thinking about a more useful use case in which you would like to return some columns of a record or a range of records.
HBase does only have one index on it's table, the row key. Make use of that. Try defining your row key so that you can get the data you need just by specifying the row key.
Let's say you would like to know the value of Subject:History for the rows with a
row key between 80000 and 80100. (Note that setCaching(100) means HBase will fetch 100 records per RPC and is this case thus one. Fetching 100 rows obviously requires more memory opposed to fetching, let's say, one row. Keep that in mind in a large multi-user environment.)
Long start, end;
start = System.currentTimeMillis();
Scan s = new Scan(String.valueOf(80000).getBytes(), String.valueOf(80100).getBytes());
s.setCaching(100);
s.addColumn("Subject".getBytes(), "History".getBytes());
ResultScanner scanner = table.getScanner(s);
try {
for (Result rr = scanner.next(); rr != null; rr=scanner.next()) {
System.out.println("Found row: " + new String(rr.getRow(), "UTF-8") + " value: " + new String(rr.getValue("Subject".getBytes(), "History".getBytes()), "UTF-8")));
}
end = System.currentTimeMillis();
} finally {
scanner.close();
}
System.out.println("Scan: " + (end - start));
This might look stupid because how would you know which rows you need just by an integer? Well, exactly, but that's why you need to design a row key according to what you're about to query instead of just using an incremental value as you would in a traditional database.
Try this example. It should be fast.
Note: I didn't run the example. I just typed it here. Maybe there are some small syntax errors you should correct but I hope the idea is clear.
This seems like a very strange problem. I'm stress testing my neo4j graph database, and so one of my tests requires creating a lot of users (in this specific test, 1000). So the code for that is as follows,
// Creates a n users and measures the time taken to add another
n = 1000;
tx = graphDb.beginTx();
try {
for(int i = 0; i < n; i++){
dataService.createUser(BigInteger.valueOf(i));
}
start = System.nanoTime();
dataService.createUser(BigInteger.valueOf(n));
end = System.nanoTime();
time = end - start;
System.out.println("The time taken for createUser with " + n + " users is " + time +" nanoseconds.");
tx.success();
}
finally
{
tx.finish();
}
And the code for dataService.createUser() is,
public User createUser(BigInteger identifier) throws ExistsException {
// Verify that user doesn't already exist.
if (this.nodeIndex.get(UserWrapper.KEY_IDENTIFIER, identifier)
.getSingle() != null) {
throw new ExistsException("User with identifier '"
+ identifier.toString() + "' already exists.");
}
// Create new user.
final Node userNode = graphDb.createNode();
final User user = new UserWrapper(userNode);
user.setIdentifier(identifier);
userParent.getNode().createRelationshipTo(userNode, NodeRelationships.PARENT);
return user;
}
Now I need to call dataService.getUser() after I've made these Users. The code for getUser() is as follows,
public User getUser(BigInteger identifier) throws DoesNotExistException {
// Search for the user.
Node userNode = this.nodeIndex.get(UserWrapper.KEY_IDENTIFIER,
identifier).getSingle();
// Return the wrapped user, if found.
if (userNode != null) {
return new UserWrapper(userNode);
} else {
throw new DoesNotExistException("User with identifier '"
+ identifier.toString() + "' was not found.");
}
}
So everything is going fine until I create the 129th user. I'm following along in the debugger and watching the value of dataService.getUser(BigInteger.valueOf(1)) which is the second node, dataService.getUser(BigInteger.valueOf(127)) which is the 128th node, and dataService.getUser(BigInteger.valueOf(i-1)) which is the last node created. And the debugger is telling me that after node 128 is created, node 129 and above aren't created because getUser() throws a DoesNotExistException for those nodes, but still gives values for node 2 and node 128.
The user id I'm passing to createUser() is autoindexed.
Any idea why it isn't making more nodes (or not indexing these nodes)?
It sounds suspiciously like a byte value conversion which flips around at 128. Could you make sure there isn't anything like that going on in your code?