How to process files in batch with PDF OCR?

How to process files in batch with PDF OCR? - java

I would like to process 20000 PDFS in batch asynchronously with Google OCR, but I did not find documentation releated with it, I already tried using client.asyncBatchAnnotateFilesAsync fuction;
List<AsyncAnnotateFileRequest> requests = new ArrayList<>();
for (MultipartFile file : files) {
GcsSource gcsSource = GcsSource.newBuilder().setUri(gcsSourcePath + file.getOriginalFilename()).build();
InputConfig inputConfig = InputConfig.newBuilder().setMimeType("application/pdf").setGcsSource(gcsSource)
.build();
GcsDestination gcsDestination = GcsDestination.newBuilder()
.setUri(gcsDestinationPath + file.getOriginalFilename()).build();
OutputConfig outputConfig = OutputConfig.newBuilder().setBatchSize(2).setGcsDestination(gcsDestination)
.build();
AsyncAnnotateFileRequest request = AsyncAnnotateFileRequest.newBuilder().addFeatures(feature)
.setInputConfig(inputConfig).setOutputConfig(outputConfig).build();
requests.add(request);
}
AsyncBatchAnnotateFilesRequest request = AsyncBatchAnnotateFilesRequest.newBuilder().addAllRequests(requests)
.build();
AsyncBatchAnnotateFilesResponse response = client.asyncBatchAnnotateFilesAsync(request).get();
System.out.println("Waiting for the operation to finish.");
But what I get is an error message
io.grpc.StatusRuntimeException: INVALID_ARGUMENT: At this time, only single requests are supported for asynchronous processing.
If google does not provide batch process why de they provide asyncBatchAnnotateFilesAsync? Maybe am I using an old version? Does asyncBatchAnnotateFilesAsync function work in other beta version?

Multiple requests on a single call are not supported by the Vision service.
This may be confusing because, according to the RPC API documentation you could indeed provide multiple requests on a single service call (1 file per request), still, according to this issue tracker there is a known limitation on the Vision service and currently it can only take one request at a time.

Since they limit to just 1 file per request, can you just send in 20k requests? They are async requests so it should be pretty fast to send those in.

Related

parallel stream with kafka consumer records

I have kafka records:
ConsumerRecords<String, Events> records = kafkaConsumer.poll(POLL_TIMEOUT);
I want to run the below code using parallel streams, not multithreading.
records.forEach((record) -> {
Event event = record.value();
HTTPSend.send(event);
});
I tried with mlutithreading but I want to try parallelstream:
for (ConsumerRecord<String, Event> record : records) {
executor.execute(new Runnable() {
#Override
public void run() {
HTTPSend.send(Event);
}
});
}
Actually I'm facing issue with HTTP.send with multithreading (even with a thread pool of 1 thread). I'm getting
"Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target".
This is a request over https. This error comes only for the first time the request is made. Afterwards, the exception vanishes. poof!
For multithreading i'm using:
int threadCOunt=1;
BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(threadCOunt, true);
RejectedExecutionHandler handler = new ThreadPoolExecutor.CallerRunsPolicy();
ExecutorService executor = new ThreadPoolExecutor(threadCOunt, threadCOunt, 0L, TimeUnit.MILLISECONDS, queue, handler);
HTTPSend.send() is:
long sizeSend = 0;
SSLContext sc = null;
try {
sc = SSLContext.getInstance("TLS");
sc.init(null, TRUST_ALL_CERTS, new SecureRandom());
} catch (NoSuchAlgorithmException | KeyManagementException e) {
LOGGER.error("Failed to create SSL context", e);
}
// Ignore differences between given hostname and certificate hostname
HostnameVerifier hv = (hostname, session) -> true;
// Create the REST client and configure it to connect meta
Client client = ClientBuilder.newBuilder()
.hostnameVerifier(hv)
.sslContext(sc).build();
WebTarget baseTarget = client.target(getURL()).path(HTTP_PATH);
Response jsonResponse = null;
try {
StringBuilder eventsBatchString = new StringBuilder();
eventsBatchString.append(this.getEvent(event));
Entity<String> entity = Entity.entity(eventsBatchString.toString(), MediaType.APPLICATION_JSON_TYPE);
builder = baseTarget.request();
LOGGER.debug("about to send the event {} and URL {}", entity, getURL());
jsonResponse = builder.header(HTTP_ACK_CHANNEL, guid.toString())
.header("Content-type", MediaType.APPLICATION_JSON)
.header("Authorization", String.format("Meta %s", eventsModuleConfig.getSecretKey()))
.post(entity);

I see what you want to do, and I'm not sure that's the best idea (I'm also not sure it's not).
The poll / commit model of Kafka allows simple backpressure and retention of the last item processed if you crashed. By returning to your poll loop "immediately" you are telling Kafka "I am ready for more", and committing the offset (manually or automatically) tells Kafka that you have successfully read up to that point.
What you seem to want to do is read off Kafka as fast as possible, committing offsets, then putting the Kafka records into an executor queue then you balance your requests per second etc from that.
I'm not 100% sure that's a good idea: what happens if your app crashes? You may have committed some Kafka messages that actually didn't make it upstream. If you do really want to do this, I would suggest manually committing the offset (via commitSync) upon completion of the Runnable, instead of letting the high level consumer do it for you.
Why might you want to use a thread executor: I think these can be accomplished with Kafka too.
You may want to post multiple messages to the web server at the same time. A well paritioned Kafka topic will let multiple consumers / consumer groups consumer multiple partitions, thus - assuming a perfectly scaling HTTP server - would let you parallelize the posting of messages to your server. Yay for process based concurrency!
Maybe the web server is not perfectly scalable, or slow for this request (say each request takes 1 second): you need to limit the number of requests per second the web server takes, if you have a queue you might have a couple threads posting while not backing up Kafka.
In this case you can set max.poll.records to a scalable value that your web server requires. There's probably a better way to do this too, although it's escaping me at the moment.
If your web server takes a long time to respond you may get errors related to failing heartbeats. In that case I direct you to this SO answer on the timeout / heartbeat topic.
Instead of using a thread executioner, thus making synchronous HTTP requests appear to be async, I would use an evented HTTP client like Netty, thus achieving parallelism without thread based concurrency.

For solving a "slow consumer" use case where you're doing I/O processing, you should use something like Parallel Consumer (PC) to avoid the "head of line blocking" problem you're describing.
By using PC, you can processing all your keys in parallel, regardless of how long it takes to do your I/O.
It also comes with a non blocking Vert.x module which more efficiently uses the CPU.
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

proper way of connecting to twitter using twitter-hbc in multithreaded system

I am having a use-case in which I am consuming twitter filter stream in a multithreaded environment using twitter HBC.
I receive keywords to track from user on per request that I receive, process them and save them to the database.
Example
Request 1: track keywords [apple, google]
Request 2: track keywords [yahoo, microsoft]
What I am currently doing is, opening a new thread for each request and processing them.
I am making connection for every end-pints a below (I am following this official HBC example)
endpoint.trackTerms(Lists.newArrayList("twitterapi", "#yolo"));
Authentication auth = new OAuth1(consumerKey, consumerSecret, token, secret);
// Create a new BasicClient. By default gzip is enabled.
Client client = new ClientBuilder()
.hosts(Constants.STREAM_HOST)
.endpoint(endpoint)
.authentication(auth)
.processor(new StringDelimitedProcessor(queue))
.build();
// Establish a connection
client.connect();
The problem that I am facing is twitter gives me warning of having more that one connection.
But as according to the above code and my use case I have to make a new instance of Client object for ever request that I receive as my end points
(trackTerms) are different for every request that I receive.
What is the proper way to connect to twitter for my use-case (multithreaded system) to avoid to many connections warning and rate limit warnings
Current warnings from twitter:
2017-02-01 13:47:33 INFO TwitterStreamImpl:62 - 420:Returned by the Search and Trends API when you are being rate limited (https://dev.twitter.com/docs/rate-limiting).
Returned by the Streaming API:
Too many login attempts in a short period of time.
Running too many copies of the same application authenticating with the same account name.
Exceeded connection limit for user
Thanks.

App engine slow response time and optimization

I have a project on google app engine, and there are several functions using the
Drive Java API.
Also, i'm using the "com.google.appengine.api.users.User;"
when i'm using some function, for example: createDocument:
public FileResponse createDocument(FileRequest file, #Named("visibility") #Nullable String visibility, User user) throws IOException, OAuthRequestException,
BadRequestException
{
Utils.validateAuthenticatedUser(user);
file.setValidator(new FileRequestValidator(FileRequestValidator.FileRequestType.CREATE));
file.validate(file);
Drive drive = new Drive.Builder(Globals.httpTransport, Globals.jsonFactory, Authenticator.credential(Constants.DRIVE_SCOPE, file.getDomainUser())).setApplicationName(
"My - APP").build();
File newFile = null;
try
{
Drive.Files.Insert insert = drive.files().insert(file.getFile());
if (visibility != null) insert.setVisibility(visibility);
newFile = insert.execute();
return new FileResponse(newFile);
} catch (Exception e)
{
logger.severe("An error occurred: " + e.getMessage());
throw new OAuthRequestException(e.getMessage());
}
}
This function is working, but it takes over 920 ms. there is a way i can optimize it? even to pay more to google.
we can see that 700 ms of the time belongs the urlFetch
we can see here the time of the response:

You can use the Appstats for profiling the Remote Procedure Call(RPC) performance of your application. RPC can make your application to work slow.
To keep your application fast, you need to know:
Is your application making unnecessary RPC calls?
Should it cache data instead of making repeated RPC calls to get the same data?
Will your application perform better if multiple requests are executed in parallel rather than serially?
The Appstats verify your application if it is using RPC calls in the most efficient way by allowing you to profile your RPC calls. Appstats allows you to trace all RPC calls for a given request and reports on the time and cost of each call.

REST Service Task

I am following an example REST Service Task
I start my process engine using
val configuration = new StandaloneProcessEngineConfiguration(); configuration.setProcessEngineName(processEngineName)
Here is my bpmn file snippet
<process id="approve-loan" name="Loan Approval" isExecutable="true">
<serviceTask id="process_task" activiti:class="com.noggin.bpm.loan.ProcessRequestDelegate" activiti:exclusive="true" name="compute
Task">
<extensionElements>
<activiti:connector>
<activiti:connectorId>http-connector</activiti:connectorId>
<activiti:inputOutput>
<activiti:inputParameter name="url">http://127.0.0.1:5004/Hello/sayhello</activiti:inputParameter>
<activiti:inputParameter name="method">POST</activiti:inputParameter>
<activiti:inputParameter name="headers">
<activiti:map>
<activiti:entry key="Accept">application/json</activiti:entry>
<activiti:entry key="Content-type">application/json</activiti:entry>
</activiti:map>
</activiti:inputParameter>
<activiti:inputParameter name="payload"><![CDATA[{"bundleId":"101","script":"def greet = {\n \"Hello World\"\n }\n greet()"}]]></activiti:inputParameter>
<activiti:outputParameter name="isActive">Result</activiti:outputParameter>
</activiti:inputOutput>
</activiti:connector>
</extensionElements>
I start the process like this
val processEngine = ProcessEngines.getProcessEngine(processEngineName)
val runtime = processEngine.getRuntimeService
val processInstance = runtime.startProcessInstanceByKey(processInstanceKey)
Successfully, I am able to send the payload to ( http://127.0.0.1:5004/Hello/sayhello ).
My question is how to retrieve the response message from the location i started the instance. Since the response will be in a Json message which should be sent back to process initiator.

I believe I saw a similar question from you posted to the Camunda forum yesterday.
Either way, I believe the question and answer is the same.
Let me make sure I understand what you are asking.
1. You are starting the instance using the Java API
2. Your process definition includes a single Service Task that makes a REST call.
3. Your JavaDelegate class populates the "Result" process variable with the response of the REST call.
4. You want to capture the response.
If I have captured your requirement, then I think the problem is in your understanding of how he BPMN engine works.
With the process as you have it modeled, the process instance will start, make the REST call, populate the Response variable and then immediately end.
As you have currently modeled the process, you will not be able to capture the response during process execution.
Your options:
1. Change your model to either send the "Result" using a message service of some sort, or add a wait state where you can retrieve the response.
2. Use the Historical query REST API (or the equivalent Java API) to retrieve the Result payload from the completed instance.
It really depends on your use case as to the most appropriate option to take.
Cheers,
Greg

Java Bloomberg API - how to generate a Request without a Service

I am using the Bloomberg API to grab data. Currently, I have 3 processes which get data in the typical way as per the developers guide. Something like:
Service refDataService = session.getService("//blp/refdata");
Request request = refDataService.createRequest("ReferenceDataRequest");
request.append("securities", "IBM US Equity");
request.append("fields", "PX_LAST");
cid = session.sendRequest(request, null);
That works. Now I would like to expand the logic to be something more like an update queue. I would like each process to send their Request to an update queue process, which would in turn be responsible for creating the session and service, and then sending the requests. However, I don't see of any way to create the request without the Service. Also, since the request types (referenceData, historical data, intraday ticks) are so varied and have such different properties, it is not trivial to create a container object which my update queue could read.
Any ideas on how to accomplish this? My ultimate goal is to have a process (I'm calling update queue) which takes in a list of requests, removes any duplicates, and goes out to Bloomberg for the data in 30 second intervals.
Thank you!

I have updated the jBloomberg library to include tick data. You can submit different types of query to a BloombergSession which acts as a queue. So if you want to submit different types of request you can write something like:
RequestBuilder<IntradayTickData> tickRequest =
new IntradayTickRequestBuilder("SPX Index",
DateTime.now().minusHours(2),
DateTime.now());
RequestBuilder<IntradayBarData> barRequest =
new IntradayBarRequestBuilder("SPX Index",
DateTime.now().minusHours(2),
DateTime.now())
.period(5, TimeUnit.MINUTES);
RequestBuilder<ReferenceData> refRequest =
new ReferenceRequestBuilder("SPX Index", "NAME");
Future<IntradayTickData> ticks = session.submit(tickRequest);
Future<IntradayBarData> bars = session.submit(barRequest);
Future<ReferenceData> name = session.submit(refRequest);
More examples available in the javadoc.
If you need to fetch the same information regularly, you can reuse a builder and use it in combination with a ScheduledThreadPoolExecutor for example.
Note: the library is still in beta state so don't use it blindly in an black box that trades automatically!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.