Context Sculpting

Context Sculpting 05 Jun, 2026 A few months ago, I was reading “The Anatomy of an Agent Harness” by Viv (@Vtrivedy10). It’s a deep dive on what a harness is, why it’s important, and which components make up a harness. In some sense, being a software developer has always involved staying up to date on the latest developments in your field, and this...
Inside FAISS: Billion-Scale Similarity Search

IVF makes search fast by skipping most of the database, but leaves every vector uncompressed. One billion SIFT descriptors still cost 512 GB of RAM. Product Quantization (PQ), introduced by Jégou, Douze and Schmid (2011), is the compression trick FAISS builds on to shrink each vector to 8 bytes while keeping distance estimates meaningful.Same centroids, a different job§4 used centroids to...
Conventional Commits encourages focus on the wrong things

You’ve almost certainly encountered Conventional Commits before. It may have reared its ugly head in the changelog of an open source project you’ve used. It may have been the enforced commit format for an open source project you contributed to. A lot of people swear by it. I swear at it.Even though it is used by a large number of popular open source projects, Conventional Commits is an actively bad standard which encourages focus...
Apache Iceberg interoperability reaches tipping point

As AI adoption accelerates, open data architectures are becoming essential to help organizations access and share data across platforms. Apache Iceberg interoperability and other open standards are increasingly viewed as the key to reducing complexity and unlocking greater value from enterprise data. Interoperability without compromise reflects a vision of creating an open, connected data ecosystem that extends from the underlying data...
3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

Today we’re announcing a major update that makes Agent Bricks Knowledge Assistant both faster and higher quality. ...
VoidZero Is Joining Cloudflare

VoidZero, the company behind Vite, Vitest, Rolldown, Oxc, and Vite+, is joining Cloudflare. As part of this change, all team members of VoidZero are joining Cloudflare, too.Before saying anything else, we want to make the most important thing clear: Vite, Vitest, Rolldown, Oxc, and Vite+ will stay open source, vendor-agnostic, and community-driven. Nothing about that changes.Cloudflare's mission is to help...

Category: Big Data

pandas on spark apply_batch/transform_batch broken? (tl;dr; No – but it isn’t well documented)

pandas on spark apply_batch/transform_batch broken? (tl;dr; No – but it isn’t well documented)

Published by Arnon Rotem-Gal-Oz on October 16, 2022

Using pypark’s pandas integration via apply_batch and transform_batch is very powerful but lacking documentation can cause hard to trace bugs – hopefully my experience (below)…

Intro to Apache Spark (slides)

Published by Arnon Rotem-Gal-Oz on December 16, 2020

I gave a general overview of Apache Spark to our R&D teams. You can find the slides below

Where is Apache Spark heading?

Where is Apache Spark heading?

Published by Arnon Rotem-Gal-Oz on December 4, 2020

I watched (COVID19-era version of “attended”) the latest spark Summit and in one of the keynotes Reynold Xin from Databricks, presented the following two images…

Big data isn’t – well, almost

Big data isn’t – well, almost

Published by Arnon Rotem-Gal-Oz on March 23, 2019

Back in ancient history (2004) Google’s Jeff Dean & Sanjay Ghemawat presented their innovative idea for dealing with huge data sets – a novel idea…

Big data in the cloud – welcome to cost oriented design

Big data in the cloud – welcome to cost oriented design

Published by Arnon Rotem-Gal-Oz on April 1, 2016

A couple of weeks ago I presented @ BDX2016 The slides are available on Slideshare Big data in the cloud – welcome to cost oriented…