Big Data


Apache Spark, ETL and Parquet

Posted on September 14th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. 1 Comment

One of the projects we’re currently running in my group (Amdocs’ Technology Research) is an evaluation the current state of different option for reporting on top of and near Hadoop (I hope I’ll be able to publish the results when we’d have them). Anyway, part of the preparations for the benchmark includes ingesting a lot of events (CDRs) into the system and creating different aggregations on top of them for instance, for voice call billing events we create yearly, monthly, weekly and daily and hourly aggregations on the subscriber level which include measures like : count of calls, average duration, sum of pricing, median balance, hourly distribution of calls, popular destinations etc.

We are using spark to do the ingestion and I thought that there are two interesting aspects I can share, which I haven’t seen too many examples on the … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone


Introduction to big data presentation

Posted on August 8th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. No Comments

I presented big data to Amdocs’ product group last week. One of the sessions I did was recorded so I might be able to add here later. Meanwhile you can check out the slides.

Note that trying to keep the slide visual I put some of the information is in the slide notes and not on the slides themselves.

Big data Overview from Arnon Rotem-Gal-Oz

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone

Hadoop YARN overview

Posted on May 27th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. No Comments

I did a short overview of Hadoop YARN to our big data development team. The presentation covers the motivation for YARN, how it works and its major weaknesses

You can watch/download on slideshare

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone

and with YARN the game changes

Posted on October 17th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. 2 comments

I’ve been working with Hadoop for a few years now and the platform and ecosystems has been advancing at an amazing pace with new features and additional capabilities appearing almost on a daily basis. Some changes are small like better scheduling in Oozie; some are still progressing like support for NFS some are cool like full support for CPython in Pig but, in my opinion, the most important change is the introduction of YARN in Hadoop 2.0.

Hadoop was created with HDFS, a distributed file system, and Map/Reduce framework – a distributed processing platform. With YARN hadoop moves from being a distributed processing framework into a distributed operating system.
“operating system”, that sounded a little exaggerated when I wrote it, so just for fun, I picked up a copy of Tanenbaum’s “Modern Operating Systems”*, I have lying around from my days as … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone


Fallacies of massively distributed computing

Posted on April 29th, by Arnon Rotem-Gal-Oz in Big Data, Blog, Featured Posts. 5 comments

In the last few years, we see the advent of highly distributed systems. Systems that have clusters with lots of servers are no longer the sole realm of the googles’ and facebooks’ of the world and we begin to see multi-node and big data systems in enterprises. e.g. I don’t think a company such as Nice (the company I work for) would release an hadoop based analytics platform and solutions, something we did just last week, 5-6 years ago.

So now that large(r) clusters are more prevalent, I thought it would be a good time to reflect on the fallacies of distributed computing and how/if they are relevant; should they be changed.
If you don’t know about the fallacies you can see the list and read the article I wrote about them at the link mentioned above. In a few words … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone

Killing the HBase zombie table

Posted on January 15th, by Arnon Rotem-Gal-Oz in Big Data, Blog. 3 comments

One of our team leaders approached me in the hall today and asked if I could land a hand in troubleshooting something. He and our QA lead were configuring one of our test Hadoop clusters after an upgrade and they had a problem with one table they were trying to set up:

When they tried to create the table in HBase shell they got an error that the table exists
When they tried to delete the table they got an error that the table does not exist
HBase ships with a health-check and fix util called hbck (use: hbase hbck to run. see here for details) – they’ve run hbase reports everything is fine and dandy

Hmm, The first thing I tied to do is to look at the .META. table. This is where HBase keeps the tables and the regions they use. I … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone


The NoSQL landscape in diagrams

Posted on November 3rd, by Arnon Rotem-Gal-Oz in Big Data, Blog. 1 Comment

Here’s the NoSQL landscape in 3 slides (and hey, at least mine looks different :) )

451 research published their view of the NoSql/NewSql world in a unified diagram.

Infochimps published a similar diagram

And here’s mine from SOA Patterns chapter 10 (discussing “SOA & big data”)

 

 

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone

SOA & Big Data

Posted on October 11th, by Arnon Rotem-Gal-Oz in Big Data, Blog, SOA Patterns. 2 comments

I gave a presentation of SOA and big data in IGTCloud forum

SOA & Big Data from Arnon Rotem-Gal-Oz

Updated to embed slideshare

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone

5 lessons big-data projects should learn from the iOS6 map debacle

Posted on September 21st, by Arnon Rotem-Gal-Oz in Big Data, Blog. No Comments

By now you’ve probably heard something about Apple’s new iOS6 maps app. In case you’ve been living under a rock, it turns out the new and shiny application that replaces Google maps in the new iOS release produces a lot of inaccuracies, mangled graphics, navigation errors and what not (just like the image you see on the left – for more examples you can see this site). Kidding (or gloating) aside, this debacle carries with it a few important lessons that anyone who is building a big data project should keep in mind.

Apple took data from various sources like Waze, Tomtom, yelp and others to build their database. thinking that it is all just geographical data using the same coordinate system so everything should be just fine. Well, it doesn’t work like that – out first and probably most important … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone


Distributed computing reading list

Posted on August 21st, by Arnon Rotem-Gal-Oz in Blog. No Comments

My twitter feed spewed a very good list of distributed computing related papers (compiled by Dan Creswell). There are links to a lot of papers there. Few of my favorites include The fallacies of distributed computing by Peter Deutsch – you may also want to check out the paper I wrote explaining them; Life beyond distributed transactions an apostate’s opinion by Pat Helland and also “The Byzantine generals problem” by Leslie Lamport, Robert Shostak and Marshall Pease, “A note of distributed computing” by Samuel C. Kendall, Jim Waldo, Ann Wollrath and Geoff Wyant and “Harvest, yield, and scalable tolerant systems” by Armando Fox, Eric A. Brewer which I mentioned before in “10 papers every architect should read”

There are a also a few additional papers that are not in that list and that I found illuminating:

“Architectural Styles and the Design … Read More »

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Buffer this pageShare on RedditShare on StumbleUponEmail this to someone