Pivotal, IBM and Hortonworks announced today the “Open Data Platform” (ODP) – an attempt to standardize Hadoop. This move seems to be backed up by IBM, Teradata and others that appear as sponsors on the initiative site.
This move has a lot of potential and a few possible downsides.
ODP promises standardization – Cloudera’s Mike Olson downplays the importance of this “Every vendor shipping a Hadoop distribution builds off the Hadoop trunk. The APIs, data formats and semantics of trunk are stable. The project is a decade old, now, and the global Hadoop community exercises its governance obligations responsibly. There’s simply no fundamental incompatibility among the core Hadoop components shipped by the various vendors.”
I disagree. While it is true that there are no “fundamental incompatibility” there is a lot of non-fundamental ones. Each release by each vendor includes backport of features that are … Read More »
I began blogging in 2005, back then I managed to post something new almost everyday. Now, 10 years after, I hardly post anything. I was beginning to think I don’t have anything left to say but I recently noticed I have quite a few posts in various states of “draft”. I guess that I am spending too much thinking about how to get a polished idea out there, rather than just go on and write what’s on my mind. This post is an attempt to change that by putting some thought I have (on big data in this case) without worrying too much on how complete and polished they are.
Anyway, here we go:
All data is time-series – When data is added to the big data store (Hadoop or otherwise) it is already historical i.e. it is being imported from a transactional system, … Read More »
One of the projects we’re currently running in my group (Amdocs’ Technology Research) is an evaluation the current state of different option for reporting on top of and near Hadoop (I hope I’ll be able to publish the results when we’d have them). Anyway, part of the preparations for the benchmark includes ingesting a lot of events (CDRs) into the system and creating different aggregations on top of them for instance, for voice call billing events we create yearly, monthly, weekly and daily and hourly aggregations on the subscriber level which include measures like : count of calls, average duration, sum of pricing, median balance, hourly distribution of calls, popular destinations etc.
We are using spark to do the ingestion and I thought that there are two interesting aspects I can share, which I haven’t seen too many examples on the … Read More »
I presented big data to Amdocs’ product group last week. One of the sessions I did was recorded so I might be able to add here later. Meanwhile you can check out the slides.
Note that trying to keep the slide visual I put some of the information is in the slide notes and not on the slides themselves.
Big data Overview from Arnon Rotem-Gal-Oz
Google’s Jeffrey Dean and Sanjay Ghemawat filed the patent request and published the map/reduce paper 10 year ago (2004). According to WikiPedia Doug Cutting and Mike Cafarella created Hadoop, with its own implementation of Map/Reduce, one year later at Yahoo – both these implementations were done for the same purpose – batch indexing of the web.
Back than, the web began its “web 2.0″ transition, pages became more dynamic , people began to create more content – so an efficient way to reprocess and build the web index was needed and map/reduce was it. Web Indexing was a great fit for map/reduce since the initial processing of each source (web page) is completely independent from any other – i.e. a very convenient map phase and you need to combine the results to build the reverse index. That said, even the core google algorithm – the famous pagerank is … Read More »
I’ve been working with Hadoop for a few years now and the platform and ecosystems has been advancing at an amazing pace with new features and additional capabilities appearing almost on a daily basis. Some changes are small like better scheduling in Oozie; some are still progressing like support for NFS some are cool like full support for CPython in Pig but, in my opinion, the most important change is the introduction of YARN in Hadoop 2.0.
Hadoop was created with HDFS, a distributed file system, and Map/Reduce framework – a distributed processing platform. With YARN hadoop moves from being a distributed processing framework into a distributed operating system.
“operating system”, that sounded a little exaggerated when I wrote it, so just for fun, I picked up a copy of Tanenbaum’s “Modern Operating Systems”*, I have lying around from my days as … Read More »
The NoSQL moniker that was coined circa 2009 marked a move from the “traditional” relational model. There were quite a few non-relational databases around prior to 2009, but in the last few years we’ve seen an explosion of new offerings (you can see,for example, the “NoSQL landscape” in a previous post I made). Generally speaking, and everything here is a wild generalization, since not all solutions are created equal and there are many types of solutions – NoSQL solutions mostly means some relaxation of ACID constraints, and, as the name implies, the removal of the “Structured Query Language” (SQL) both as a data definition language, and more importantly, as a data manipulation language, in particular SQL’s query capabilities.
ACID and SQL are a lot to lose and NoSQL solutions offer a few benefits to augment them mainly:
Scalability – either as relative scalability, … Read More »
In the last few years, we see the advent of highly distributed systems. Systems that have clusters with lots of servers are no longer the sole realm of the googles’ and facebooks’ of the world and we begin to see multi-node and big data systems in enterprises. e.g. I don’t think a company such as Nice (the company I work for) would release an hadoop based analytics platform and solutions, something we did just last week, 5-6 years ago.
So now that large(r) clusters are more prevalent, I thought it would be a good time to reflect on the fallacies of distributed computing and how/if they are relevant; should they be changed.
If you don’t know about the fallacies you can see the list and read the article I wrote about them at the link mentioned above. In a few words … Read More »
the past week or so we got some new data that we had to process quickly . There are quite a few technologies out there to quickly churn map/reduce jobs on Hadoop (Cascading, Hive, Crunch, Jaql to name a few of many) , my personal favorite is Apache Pig. I find that the imperative nature of pig makes it relatively easy to understand what’s going on and where the data is going and that it produces efficient enough map/reduces. On the down side pig lacks control structures so working with pig also mean you need to extend it with user defined functions (UDFs) or Hadoop streaming. Usually I use Java or Scala for writing UDFs but it is always nice to try something new so we decided to checkout some other technologies – namely perl and python. This post highlights some of … Read More »