DFD1 Prep Post: Cloudera

Cloudera Logo

Cloudera is a commercial bundle of Apache Hadoop and a few other related open source projects to create what Cloudera call an Enterprise Data Hub.

Hadoop itself is a big collection of technologies that can work together, but in the same way that the Internet is a big collection of technologies that can work together. It’s far from simple, and the exact shape it takes depends a lot on what you want to do with it.

There is the Hadoop Distributed File System, for example, which is a distributed storage engine plus filesystem, similar in many ways to what Hedvig provides, though not as simple to use, operate, and without the same number of connectors.

There are multiple database options on top of HDFS, such as Cassandra (a non-relational, scale-out, eventually consistent database, designed for large datasets) and HBase (another non-relational database, designed for large, sparse datasets). There are various other bits, like Hive (for adding a SQL-like interface back on top of non-relational databases), Spark (in-memory analytics and real-time stream processing), Mahout (machine learning library), Pig (data-flow and execution framework), and ZooKeeper (a job manager for controlling the whole thing).

Whew.

This is a gross over-simplification of what Hadoop is. It is a large and complex beast that requires a lot of specialist knowledge to figure out what bits should be used where. Configuring it well takes a bunch of time. Honestly, if you think you might need it, you don’t, because you don’t understand your problem well enough yet.

What Is Cloudera, Exactly?

I’m not entirely sure yet, but it seems that it’s a commercial version of Hadoop, meaning you can pay for support and professional services to help you get it all running. They also look like they’ve tried to make it easier to get started with Hadoop by building an installer/manager to hide away a lot of the complexity. Particularly with managing upgrades of all the components, which smells a lot like what Platform 9 were pitching at Virtualisation Field Day 4.

With something as large and complex as Hadoop, there is real value there, if it makes it easy for an organisation to get the outcomes they’re after without having to develop in-house expertise in the technology.

I spent some time attempting to get to know Cloudera, but I hit several unfortunate hurdles.

Easy Isn’t So Easy

There is an online “test it out” version called Cloudera Live. Ideal for my purposes in preparing for DFD1, I thought. Ah, but it’s hosted at GoGrid, and they require a credit card before they’ll let you do a free trial. Boo. No way am I risking getting charged for a ‘free trial’ if there are any issues with cancelling my ‘free’ subscription. Honestly, this is just poor form. An online trial should, at most, require a free signup so you get added to a leads database. Demand generation is a thing, and most of us are fine with a follow-up email to see how the trial went, but this is just going to increase your abandon rate.

As it was, someone from GoGrid contacted me because I tweeted about the issues I had, but there was no solution (like, oh, you’re press/analyst etc. having a good faith look at things, sure no problem, have a free trial!) so I didn’t get to try out the software.

But you can download a quickstart VM, which is a pre-built copy of the software as a virtual machine image. Yay! Alas, the download didn’t work for me because of some sort of web failure. I can enter my details, but clicking on the Download bit did nothing. Now, I run a bunch of privacy filtering things in my browser (like AdBlock Plus and Ghostery), so that might have been the problem, so I turned them off (as I sometimes have to do) but I couldn’t get it to work. Again, too hard.

Plan C was to download the software itself from the Cloudera software repositories. This was successful, finally, thanks to the magic of Ubuntu and apt-get. I spun up a virtual machine in my lab, ready to take on the role of a single-node installation of Hadoop. The documentation for how to do this is a little industrial, but it did work.

The Cloudera installation manager software is quite slick. It runs in a web interface, and guides you through the steps of what you need to set up. It should, in theory at least, manage the installation of the software components you choose (there are a lot of choices!), and will help you keep them up to date in future. This kind of management overlay is important for organisational use of software, because most places don’t want to have herds of nerds mucking out the stables all the time.

Alas, when it came to installing things on my single node, it failed on several attempts. At this stage, I’d sunk enough hours of my free time into trying to get to play with the software, so I’ve not done anything more with it. This is a shame, because it looks like there could be some goodness there. It’s just proven too much of a challenge for me to get things running, for whatever reason.

Who Is Cloudera For?

Because I haven’t had a decent play with it, I can’t really see what Cloudera is particularly good at.

Hadoop for the Enterprise is too vague for me. Hadoop, as we’ve seen, is a lot of different things which can be brought to bear on a range of data storage and analysis problems. Similarly, enterprises are quite different from one another, either due to industry, problem domain, or division within them. Their data storage and analysis needs are many and varied.

The customer list on Cloudera’s website has an impressive range of brands on it, but many of these organisations are very large and that means they tend to have a bit of everything. It’s entirely possible that they sold some stuff into one area of the business, while another area is using a competitor’s product.

What have these companies actually bought from Cloudera? What are they actually doing with it? What is it about Cloudera that makes them special? I just don’t know at this point, so I hope I can find out next week.

My experience so far has been pretty shrug-worthy, but I’m conscious that it hasn’t really been a fair hearing. I could have learned more if I’d been more willing to hand over a credit card, or fire up a wide-open browser. Probably.

My challenge is to reset my expectations so I can give Cloudera a fair hearing next week, and I’ll be doing my level best to do so. If I get time, I might even give the VM download thing another go, and then I can maybe take it with me on my laptop for a play in airports or on planes on my way to the US.

Bookmark the permalink.

One Comment

  1. Pingback: DFD1 Prep Post: Cloudera - Tech Field Day

Comments are closed