One of the projects we’re currently running in my group (Amdocs’ Technology Research) is an evaluation the current state of different option for reporting on top of and near Hadoop (I hope I’ll be able to publish the results when we’d have them). Anyway, part of the preparations for the benchmark includes ingesting a lot of events (CDRs) into the system and creating different aggregations on top of them for instance, for voice call billing events we create yearly, monthly, weekly and daily and hourly aggregations on the subscriber level which include measures like : count of calls, average duration, sum of pricing, median balance, hourly distribution of calls, popular destinations etc.
We are using spark to do the ingestion and I thought that there are two interesting aspects I can share, which I haven’t seen too many examples on the … Read More »
I presented big data to Amdocs’ product group last week. One of the sessions I did was recorded so I might be able to add here later. Meanwhile you can check out the slides.
Note that trying to keep the slide visual I put some of the information is in the slide notes and not on the slides themselves.
Big data Overview from Arnon Rotem-Gal-Oz
I’ve been working with Hadoop for a few years now and the platform and ecosystems has been advancing at an amazing pace with new features and additional capabilities appearing almost on a daily basis. Some changes are small like better scheduling in Oozie; some are still progressing like support for NFS some are cool like full support for CPython in Pig but, in my opinion, the most important change is the introduction of YARN in Hadoop 2.0.
Hadoop was created with HDFS, a distributed file system, and Map/Reduce framework – a distributed processing platform. With YARN hadoop moves from being a distributed processing framework into a distributed operating system.
“operating system”, that sounded a little exaggerated when I wrote it, so just for fun, I picked up a copy of Tanenbaum’s “Modern Operating Systems”*, I have lying around from my days as … Read More »
In the last few years, we see the advent of highly distributed systems. Systems that have clusters with lots of servers are no longer the sole realm of the googles’ and facebooks’ of the world and we begin to see multi-node and big data systems in enterprises. e.g. I don’t think a company such as Nice (the company I work for) would release an hadoop based analytics platform and solutions, something we did just last week, 5-6 years ago.
So now that large(r) clusters are more prevalent, I thought it would be a good time to reflect on the fallacies of distributed computing and how/if they are relevant; should they be changed.
If you don’t know about the fallacies you can see the list and read the article I wrote about them at the link mentioned above. In a few words … Read More »
One of our team leaders approached me in the hall today and asked if I could land a hand in troubleshooting something. He and our QA lead were configuring one of our test Hadoop clusters after an upgrade and they had a problem with one table they were trying to set up:
When they tried to create the table in HBase shell they got an error that the table exists
When they tried to delete the table they got an error that the table does not exist
HBase ships with a health-check and fix util called hbck (use: hbase hbck to run. see here for details) – they’ve run hbase reports everything is fine and dandy
Hmm, The first thing I tied to do is to look at the .META. table. This is where HBase keeps the tables and the regions they use. I … Read More »
By now you’ve probably heard something about Apple’s new iOS6 maps app. In case you’ve been living under a rock, it turns out the new and shiny application that replaces Google maps in the new iOS release produces a lot of inaccuracies, mangled graphics, navigation errors and what not (just like the image you see on the left – for more examples you can see this site). Kidding (or gloating) aside, this debacle carries with it a few important lessons that anyone who is building a big data project should keep in mind.
Apple took data from various sources like Waze, Tomtom, yelp and others to build their database. thinking that it is all just geographical data using the same coordinate system so everything should be just fine. Well, it doesn’t work like that – out first and probably most important … Read More »
My twitter feed spewed a very good list of distributed computing related papers (compiled by Dan Creswell). There are links to a lot of papers there. Few of my favorites include The fallacies of distributed computing by Peter Deutsch – you may also want to check out the paper I wrote explaining them; Life beyond distributed transactions an apostate’s opinion by Pat Helland and also “The Byzantine generals problem” by Leslie Lamport, Robert Shostak and Marshall Pease, “A note of distributed computing” by Samuel C. Kendall, Jim Waldo, Ann Wollrath and Geoff Wyant and “Harvest, yield, and scalable tolerant systems” by Armando Fox, Eric A. Brewer which I mentioned before in “10 papers every architect should read”
There are a also a few additional papers that are not in that list and that I found illuminating:
“Architectural Styles and the Design … Read More »