Distributed ETL question

Distributed ETL question - java

Looking for any recommendations for an ETL system for 200+ distributed systems (Windows, AS400, Linux etc).
We collect data each month from all of our customers (regardless of system type), bring it back, process it all together and send the aggregate solutions back to them. I'm tasked with automating this system - any suggestions on how to do this robustly, I really don't want to re-invent the wheel. I don't own any of the systems I'm pulling data from, which has made this task more difficult but can install a client.
I've created a prototype client/server architecture in Java with FTP for transport but it feels brittle to me. I should note that all of the extract/transformation code for the different systems already exists in Java (albeit legacy).
I should mention we pull data once per month currently, but working towards weekly.
Any insight is appreciated.

I think it would depend on how the project will become. If this porject will be adding more requirement and there is some money involved, the ETL tool might be good idea.
However, if you have fixed output(the report) now and it is not intended to go anywhere, the custom ETL might be worth it. The reason is the most ETL tools have various output format(Chart, text file etc) and convinience to use the tool but the bottom line is Data moving part is almost universal for all the tools. Even with any other ETL tool, you need to implement same query you are doing now, plus you need to learn the tool. Who knows? Some tool might involved in 200+ site installation.
Recently, our company spent a lot of money to buy report tools & servers & human resource to build good ETL since our in-house ETL has been critisized for the slowness and not professional looking(You know it is not using popular ETL tools. It is bunch of script command). With all the money spending, the project faced on almost dead end.
One more thing. I don't understand how Java & FTP is involved in this process.
Can you directly connect the DB in your customer system using SQL?
If you could, using SQL & stored procedure is always better idea than using JAVA & FTP.
Hope it would help.

Related

Performance test for salesforce

I am working on salesforce UI’s.
How can I ensure that whatever UI’s created are able to handle a load.
Do I need to do performance testing?
Please suggest tools , process so that 500 user load / performance testing if any?

Tool choice depends on many factors, i.e.
ability to integrate flawlessly into your existing SDLC (i.e. if you have Java stack a Python-based tool would be an "alien)
support of all required use cases (nearly all load testing tools are suitable for web applications testing, however when it comes to more "exotic" tasks like communicating with database, WebSockets, handle AJAX requests, etc.) the choice might be less wide
bugdet (there are some proprietary solutions which cost millions of coins and free/open source options)
skills (the tool might require coding skills which you and your team may or may not have)
You can find quite a list for instance here: http://www.opensourcetesting.org/category/testing-tools-overview/performance/?menu-page=overview
Given you have java tag I would recommend to stick to one of the following:
The Grinder
Gatling
Apache JMeter
Check out Open Source Load Testing Tools: Which One Should You Use? article which highlights main features of the above tools followed by comparison matrix, sample scripts, reports, etc., hopefully it will help you to choose the right option.

Tools/techniques to Load/Soak test a deployed Java EE application?

I'm quite new to performance testing and am looking to be pointed in the right direction.
I have a Java project which contains two parts, deployed seperately:
A service-broker, published as a webservice; which has service and db wrappers.
A front end, which has a service-broker facade, business logic and a Spring MVC UI.
It is deployed on tomcat, which is running on a fresh install of Windows server 2008.
I need to do basic soak testing on this project, to highlight major memory leaks performance issues.
I've been told SOAP UI is the tool I need to do this.
Now for my questions:
Soap UI (Load UI) is only appropriate as the load-generator for testing the service-broker aspect of the project, right?
What additional tools would be helpful (Something to visualize garbage collection, memory use, heap/stack size etc?)
Can I use Load UI as a load generator for a Spring MVC Front end? If not, what's an appropriate alternative?
Thanks a lot.

Here is my opinions
SoapUI is good enough for microbenchmark test but not good at huge scale of testing. So i recommend to use other load testing tool. LoadUI can be a solution. But i want to recommend nGrinder. I used it, it works very well. Apache Jmeter is common tool. But it is JVM based so Jmeter itself needs a tuning.
To monitoring application during perfomance testing. easiest way is use VisualVM. It can monitor all that u mentioned. But it can show u a data in just Java Virtual machine perspective. I rather to recommend to use APM (Application Performance Monitoring). AppDynamic will be good solution.
About UX testing, big difference is, it needs record and play feature. U can do it by using Load UI. but nGrinder can also cover that by implementing HTTP resquest in coding. (It is a reason why we use such a expensive tools like LoadRunner etc).
I hope this will be useful to u.
Cheers

There are four sets of requirements you need to cover in a tool
Can it exercise my interface (Any HTTP test tool will do this for a web services application)
Can it monitor my infrastructure. Now you are getting into the details of if your underlying OS and Virtual Machine can be monitored in an integrated fashion. Not all tools allow for this and you need to be very explicit as to the level of detail you are interested in.
Will it report appropriate to my requirements and in a way which allows for easier identification of system bottlenecks? This is a mix of objective and subjective items. You have not indicated what level of reporting you need
Does my user community have the skills to use the tool? Get the top three right and miss this one and even a free as in beer tool goes to a negative ROI almost immediately.
It's time to button up the requirements or just hire a firm with a set of tools included to do the job.

Hadoop-Hive-HBase Advice for Web Analytics

The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.

Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.

(.Net + SQL Server +Azure ) vs (Java + Oracle +Google App Engine) which is better for this project specifically?

I have to develop an ERP System for a 2,000+ end users organisation.
Could you please suggest me with comparable points that among (Java or .Net)
in which technology I should invest money and time? Although I have done
some average projects in both, but this project is going to be very big in near
future in terms of scalability.
I want to know your experiences and some tips from you people, so that I can develop
and deploy this project efficiently.
I rate .Net > Java for this project only due to less development time available.
We have to use some Rapid App Development technology.
I have to deploy this on Cloud (Azure or Google App engine).
It will be better if I got answers from those people who works in both (.Net and Java).
I will appreciate answers from your experiences.

I would suggest creating a very small proof-of-concept project in both technologies, which do something real - like allow people to log in, see messages, and allow them to type in new messages, and log out again.
Even if the project is laughably small, if you do it well, you will have a finished product on each platform which have shown you by experience how things works and if you like the way you had to do them. You will be able to see if you can debug in the cloud, if you can profile when load testing, if you can do fast work inhouse which then works well when deployed to the cloud.
And you will need to figure out things. Are the online resources good? How responsive is the StackOverflow community for each platform when you ask questions?
Personally, I consider the ".NET is Windows-only" to be important. Except for that I do not believe there is any technical showstopper for either platform.

I think both approaches can be used to deliver this successfully. I would expect you to have the same amount of success/pain with either choice. When it comes to making a decision you should base it on the amount of expertise that you have to hand. That is, your own and that of your existing colleagues and the resources that you can acquire (new recruits, contractors, consultants etc.).
That said a couple of technical notes:
The Java approach tends to have more freedom, i.e. more frameworks and choice of technologies for various solutions (although GAE will bring in some restrictions).
There is less choice in the .NET space, but that is not always a bad thing. E.g. you tend not end up in tireless debates about the logging frameworks.
Java is starting to age as a language and C# is a bit nicer, however there a number of newer languages that run on the Java VM (Scala, Groovy, Ruby, Clojure).

Setting up a new Java development shop

I'm setting up a Java development shop, currently just for myself as the only developer, but with hopes of needing to hire others as the business grows. Obviously I'm hoping to set it up right so that as more people come in, they can be productive right away. Please help suggest things I want to do, and tools to do them.
Here's what I think I need:
Distributed source code/revision control (Subversion?)
Bug tracking (does Trac do this?)
documentation (both internal and customer facing)
team communication
frequent automated building
maybe something to make sure automatic tests pass as part of the check-in process?

I like Hudson for Continuous Integration builds, and I like JIRA for issue tracking. Eclipse has plugins for both.
Hudson can watch software repositories and rebuild those projects that use the changed resources.
If you need more documentation than javadoc can cover (which is quite a lot) then consider a Wiki. Easy to use, and with a bit of structure you can massage it into a PDF.
Source control is a bugger. Too many to choose from. For a small development team start with either subversion or CVS (which is old but has supreme IDE support) and when you outgrow that and know your needs, then migrate to a better one. Most have migration tools from svn or cvs. It is harder to move from e.g. git to Mercurial, and you defintively want one with more than one implementation. Remember to have good backups of the source control repository - it IS your business. Frequent rsyncs, often tapes.
EDIT: You also want decent hardware. For the Continuous Integration server, the fastest build machine you can afford. For yourself the largest monitor you can afford (not in size, in resolution) for your primary monitor and as many extra monitors as you can afford to have (including adapters to your computer). I have found that Mac's use the pixels better than Windows, so that might also be a point.
My primary monitor is pivoted 90 degrees. This allows me to see many lines at once instead of a few long lines. (For some reason tradition says that editing areas should be wide and short, which may work in word but not in code where lines should not be wider than 72 characters)
Note on Eclipse: Use the source repository to have a single workspace per project! Use the Java Editor Save Function to reformat your code everytime you save - this makes it more readable up front, and goes better with the source repository as changes are marked in the correct version.
Edit: The reason for the CI server needing to be better than your development machine is because it will run all your tests every time you check stuff into your source repository. After a while, that WILL take time.
Personally I have found tests working well for library routines. They specify what works and what doesn't. It is harder to write good tests for whole applications, but you may want to look into that from the beginning, as it allows you to ensure that everything works for every check in. Write a comment if you are not familiar with the concept.
Whatever you choose for the individual parts, you will be glad if they can work together. Hudson knows how to talk to JIRA for instance. JIRA knows how to look in CVS.

To see what a big amount of people thinks to make a good software development environment read this article on the 12 points of joels test
This list does not include the things more important for you. Getting clients getting them to pay and manage legal stuff and taxes.

Most important, the right staff:
get great people who find work and handle customers (aka sales)
get software engineers who are smart and get things done (http://www.joelonsoftware.com/items/2007/06/05.html)
get someone who knows about accounting and the local legal and tax regulations, so you don't get any surprises
Tools / Processes:
use a distributed version control system like git or mercurial
jira or the like for bug/issue tracking
continuous integration with hudson or cruisecontrol
wiki system to share the teams tangible knowledge
unit tests, clover, checkstyle, findbugs, ...
From a managerial point of view i would try
daily standup meetings (checkout scrum) to keep the team updated members to commit by saying what the did, do, and will do
timeboxed meetings, everything else sucks.
plan iterations/sprints
let team do task time estimations
pair programming (gets you better code)
code reviews (builds trust)
weekly in house "techtalks" to build a strong sense for the team
twitter like communication tool to keep all insync and informed with minimal distraction
develop team towards dynamic languages (groovy, scala, ...)
yourself, listen to what the guys at http://manager-tools.com/ have to say...
good luck!

We have been using:
version control:
subversion - Its not distributed but it is accessible over a few different protocols if firewalls are an issue. I'm not sure if distributed version control is necessary for us and reading Eric Sink's take is entertaining at least
issue tracking :
Fogbugz - You get some team discussion and communication for free with it because of the built-in wiki and discussion boards.
continuous integration:
CruiseControl - we had been talking about switching to Hudson, but Cruise is working really well right now - it runs our unit tests.
dev environment:
Netbeans and Eclipse - There is really no reason to pay for a Java IDE. An import point for getting going fast is that Netbeans and Eclipse both store all of their project data as text files which version control nicely. See this question. We had giant headaches when using an IDE which used binary project files.
profiler:
JDK VisualVM - Its free and it works. I used to really like YourKit, but VisualVM does so much now.
documention:
Combination of javadoc and fogbugz wiki pages plus the cruise dashboard for internal. For external we are using RoboHelp and we dislike it.
other tools:
Findbugs - huge help in catching things that are sometimes really stupid and sometimes amazing quirks that you'd have never realized. PMD is good for some of this as well.
We find chat tools to be really helpful for communication. We used to have access to Sametime and it had a giant conferencing feature that was really great. That was taken away for an unknown reason by the overlords though.

Here is what development stack our team of five developers is using over a year now:
Eclipse IDE (worked better for us than NetBeans)
Maven as a project comprehension and build tool -- simply a must tool!
Nexus: The Maven Repository Manager -- serves as a local Maven repo for proxying, and for managing internal and 3rd party libs; simple in use and is really necessary if you're going to use Maven
Subversion for source versioning -- was chosen mainly due to very good IDE support (Subclipse for Eclipse IDE)
Trac as a bug tracking and requirement management tool -- it nicely integrates with Subversion, has very useful plugins including blog and discussion plugins; also it can be integrated with Eclipse Mylyn.
Hudson as continuous integration, which nicely integrates with Subversion, Maven and Trac -- very valuable even for a small team.
Sonar code quality management platform -- a tool which integrates a large number of code quality matrix with intuitive web interface supporting code review and drill-down facility for analysis of the problems; integrates with Maven.
In our case this development stack is running under Ubuntu (workstation components: Eclipse IDE, Maven) and CentOS (server components: Maven, Nexus, Subversion, Trac, Hudson, Sonar).
As for the documentation, LaTeX (TexLive and Kile under Ubuntu) works just great supporting high quality PDF generation. The documentation source can be managed by Subversion the same way as the application source. Allows making of simple several page document and large multi-chapter books.
Hope this helps.

Distributed source code/revision control (Subversion?)
Subversion is good enough unless you want to use this as an opportunity to learn Git or Mercurial.
Bug tracking (does Trac do this?)
Trac works fine if you don't mind CamelCaseTurningIntoWikiWords. JIRA is fancier but (as noted) not free, and I find its data-ink ratio annoyingly low. Good once you learn which parts of the UI to ignore, though.
documentation (both internal and customer facing)
Internally, I would prefer communication over documentation unless you're really willing to commit to spending time keeping internal documentation updated. Javadoc your integration points (internal/external APIs), but try to keep your code and build scripts as self-documenting as you can.
That said, requirements docs are useful. I'd suggest using one tracking system for bugs and requirements. JIRA lets you create hierarchical issues, but I wouldn't bother with that till you need it.
Trac includes a Wiki, which can be handy although too much content will get stale.
Customer-facing docs are all over the map -- depends a lot on what you're doing. If you need big fat manuals (either printed or HTML or on-line help) FrameMaker + the DITA Open Toolkit is a nice combo, but a FrameMaker license is going to set you back $1000.
At a certain point this shades into corporate communications (marcom and PR), which are kind of a different animal.
team communication
Google Talk (or the instant messaging framework of your choice), email, and the occasional Skype teleconference are probably all you need.
An internal wiki and/or internal blogs might be nice but you probably don't need them right away, even once you've added a few more developers. Trac includes a wiki.
frequent automated building
Switching from CruiseControl to Hudson has made everyone here much happier -- two teams made the same decision independently and both are pleased with it. Hudson's flexible, easy to configure, and shiny.
I have yet to be convinced by Maven. I suspect its value depends on how much you want automated downloads of the latest versions of all your open-source libraries.
Edited (Feb. 2010) to add: Several months of Maven 2 experience now and I'm still not convinced. I think (like many frameworks, e.g. Rails) it may work well if you're starting from scratch and structuring your project according to its conventions, but if you already have your own structure in place it's a lot less advantageous.
maybe something to make sure automatic tests pass as part of the check-in process?
I'd advise against it. Most of the time the tests will be passing (or should be, anyway) As long as continuous integration notifies you quickly of failures, and you also have stable automated builds (a couple of days' worth of nightly builds, tagged iteration builds, etc.), you want more and faster checkins, not fewer and slower.

Subversion is not a distributed approach -- go for Mercurial, Bazaar or git instead!
Yes, Trac does do bug tracking (among other things -- check it out!).
Documentation is indeed a must but I'm not sure what you're asking -- tools for it? Why not just javadoc?
For communication you can have many tools, such as skype, email, IM of many kinds, and so forth -- you need to express your specific issues better to get specific advice, I think. Google Wave once it matures may be just great, but it's not production-ready yet.
For continuous build check out CruiseControl -- of course it also run tests &c.
You can write "triggers" for any of the build systems I've mentioned (and even good old svn;-) to run some test suite and reject the commit if it fails.

A few more items:
IDE
Billing Software - will you be charging by the
hour? If so you might want to track
what your time is going towards.
offsite backup of some kind.
Each one of your bullet points is probably worthy of a community wiki by themselves. Though in the end you might not care so much about best of breed in each area, but care more about how well they all integrate with your IDE or with each other.
Also, if you really want to get new teammates up and running quickly, consider putting as much of your dev environment into source control as you can, so you can just checkout your "dev-env" project onto a new computer and be up and running instantly!

One of your specs (in your question) says:
maybe something to make sure automatic
tests pass as part of the check-in
process?
I would suggest this is essential. Check out this matrix of continuous integration servers to see which one fits your requirements.

If you're starting off with just yourself as the only developer but scaling up at some point, depending on what it is you're developing, it might be worth to check out a Platform as a Service (PaaS) offering, like https://www.openshift.com or https://www.heroku.com/, both which offer developer accounts, but you can scale up to paid accounts as and when needed.
OpenShift have CI cartridges for Jenkins so you can spin up DVCS (git), CI using Jenkins, and an app server like JBoss AS or WildFly all in one go in a matter of minutes... depending on your needs and how much time you want to dedicate to setting this up yourself locally, I'd be more inclined to look at using a PaaS.

Rather than use Trac, consider Redmine. It allows you to have multiple projects stored in the one issue tracking system. You can then use the same setup for each project, rather than having to have N instances of Trac.

For starters, SVN will be enough for you. Definitely get a bugtracker. JIRA is good but isn't free. Enforce a rule "No commits without a bugtrecker ticket". This way you will be able to track the development. Do a cruise control and run a build + unit tests after every checkin into the main branch. Bigger changes should be made on a separate branch and then merged into the main branch.
Invest in a good IDE (I recommend intelliJ IDEA) and a good profiler (I recommend JProfiler). They're not free, but they are definitely worth their price.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.