Blog


Hadoop and the OpenDataPlatform

Posted on February 17th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. No Comments

Pivotal, IBM and Hortonworks announced today the “Open Data Platform” (ODP) – an attempt to standardize Hadoop. This move seems to be backed up by IBM, Teradata and others that appear as sponsors on the initiative site.

This move has a lot of potential and a few possible downsides.

ODP promises standardization – Cloudera’s Mike Olson downplays the importance of this “Every vendor shipping a Hadoop distribution builds off the Hadoop trunk. The APIs, data formats and semantics of trunk are stable. The project is a decade old, now, and the global Hadoop community exercises its governance obligations responsibly. There’s simply no fundamental incompatibility among the core Hadoop components shipped by the various vendors.”

I disagree. While it is true that there are no “fundamental incompatibility” there is a lot of non-fundamental ones. Each release by each vendor includes backport of features that are … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone


Random thoughts on big data

Posted on February 10th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. No Comments

I began blogging in 2005, back then I managed to post something new almost everyday. Now, 10 years after, I hardly post anything. I was beginning to think I don’t have anything left to say but I recently noticed I have quite a few posts in various states of “draft”. I guess that  I am spending too much thinking about how to get a polished idea out there, rather than just go on and write what’s on my mind. This post is an attempt to change that by putting some thought I have (on big data in this case) without worrying too much on how complete and polished they are.

Anyway, here we go:

All data is time-series – When data is added to the big data store (Hadoop or otherwise) it is already historical i.e. it is being imported from a transactional system, … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone


Apache Spark, ETL and Parquet

Posted on September 14th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. 1 Comment

One of the projects we’re currently running in my group (Amdocs’ Technology Research) is an evaluation the current state of different option for reporting on top of and near Hadoop (I hope I’ll be able to publish the results when we’d have them). Anyway, part of the preparations for the benchmark includes ingesting a lot of events (CDRs) into the system and creating different aggregations on top of them for instance, for voice call billing events we create yearly, monthly, weekly and daily and hourly aggregations on the subscriber level which include measures like : count of calls, average duration, sum of pricing, median balance, hourly distribution of calls, popular destinations etc.

We are using spark to do the ingestion and I thought that there are two interesting aspects I can share, which I haven’t seen too many examples on the … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone


Introduction to big data presentation

Posted on August 8th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. No Comments

I presented big data to Amdocs’ product group last week. One of the sessions I did was recorded so I might be able to add here later. Meanwhile you can check out the slides.

Note that trying to keep the slide visual I put some of the information is in the slide notes and not on the slides themselves.

Big data Overview from Arnon Rotem-Gal-Oz

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone

Is there a future for Map/Reduce?

Posted on June 3rd, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. 3 comments

Google’s Jeffrey Dean and Sanjay Ghemawat filed the patent request and published the map/reduce paper  10 year ago (2004). According to WikiPedia Doug Cutting and Mike Cafarella created Hadoop, with its own implementation of Map/Reduce,  one year later at Yahoo – both these implementations were done for the same purpose – batch indexing of the web.

Back than, the web began its “web 2.0″ transition, pages became more dynamic , people began to create more content – so an efficient way to reprocess and build the web index was needed and map/reduce was it. Web Indexing was a great fit for map/reduce since the initial processing of each source (web page) is completely independent from any other – i.e.  a very convenient map phase and you need  to combine the results to build the reverse index. That said, even the core google algorithm –  the famous pagerank is … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone


Hadoop YARN overview

Posted on May 27th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. No Comments

I did a short overview of Hadoop YARN to our big data development team. The presentation covers the motivation for YARN, how it works and its major weaknesses

You can watch/download on slideshare

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone

Looking for developers to join my group

Posted on April 29th, by Arnon Rotem-Gal-Oz in Blog, Featured Posts. No Comments

With Amdocs TeraScale, my previous project, move into production, I moved from to a new role within Amdocs and took over the Technology Research group, which is part of the big data and strategic initiatives business unit.

Now it is time to expand the group and I am looking for a developer-architect and/or senior developer to join my group.

 

If you are a technologist at heart and like learning new stuff every day
If you can pick up a new a technology and be up and running with it in a day or two
If you want to tinker with the latest and greatest technologies (big data, in memory grids, cloud management systems, NFV, columnar database etc.)
If you want to help shape the technology roadmap of a large corporation

I am looking for you

The positions are located in Raanaa Israel. If you’re interested you can contact me … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone

Services, Microservices, Nanoservices – oh my!

Posted on March 25th, by Arnon Rotem-Gal-Oz in Blog, SOA Patterns. 2 comments

Apparently there’s this new distributed architecture thing called microservices out and about – so last week I went ahead and read Martin Fowler’s & James Lewis’s extensive article on the subject . and my reaction to this was basically:

I guess it is easier to use a new name (Microservices) rather than say that this is what SOA actually meant – re http://t.co/gvhxDfDWLG

— Arnon Rotem-Gal-Oz (@arnonrgo) March 16, 2014

Similar arguments (nothing new here) were also expressed after Martin’s tweet of his article e.g. Clemens Vasters’ comment:

@martinfowler @boicy but these are the very principles of SOA before vendors does pushed the hub in the middle, i.e. ESB — Clemens Vasters (@clemensv) March 16, 2014

Or Steve Jones’ post “Microservices is SOA, for those who know what SOA is.”

Autonomy, smart endpoints, events etc. that the article talks about are all SOA concepts – If … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone


and with YARN the game changes

Posted on October 17th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. 2 comments

I’ve been working with Hadoop for a few years now and the platform and ecosystems has been advancing at an amazing pace with new features and additional capabilities appearing almost on a daily basis. Some changes are small like better scheduling in Oozie; some are still progressing like support for NFS some are cool like full support for CPython in Pig but, in my opinion, the most important change is the introduction of YARN in Hadoop 2.0.

Hadoop was created with HDFS, a distributed file system, and Map/Reduce framework – a distributed processing platform. With YARN hadoop moves from being a distributed processing framework into a distributed operating system.
“operating system”, that sounded a little exaggerated when I wrote it, so just for fun, I picked up a copy of Tanenbaum’s “Modern Operating Systems”*, I have lying around from my days as … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone


A couple of SOA Q&As

Posted on September 6th, by Arnon Rotem-Gal-Oz in Blog. No Comments

Every now and then I get some question by email, I usually just answer them directly but considering I got 2 such questions this week and that I have’t blogged for awhile (I do have a post about YARN which I hope to finish soon) – I thought I’d also publish my replies here.

Question #1 from Simon:

In your very interesting article “Bridging the Impedance Mismatch Between Business Intelligence and Service-Oriented Architecture” you highlight the challenges for BI and SOA to co-exist – that was 6 or so years ago – have you seen any advances that would cause you to revise that view?

I think the gap and dissonance between SOA needs and BI needs is still there. However, in addition to event publishing mentioned in the article, I see the approach to getting to BI on SOA getting more standardized. … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone