2019 is around the corner and I am looking at how many blog posts I wrote this year and the number is a resounding zero. On the other hand, looking in my drafts folder, I see quite a few posts in various stages of
- Extract, Anonymize, Transform Load is the new ELT – One of the major design decision in the current platform we’re developing today is to handle and isolate all PII (and PHI) as early as possible in the ingestion pipeline and build the rest of the system on de-identified data. This post would have explained how and what we’re doing and why it is important
- The woes of pySpark – I’ve been using Spark since mid-2014, i.e., v1, but only using it via the JVM (mostly using Scala) and it sure wasn’t a fun ride all the time (e.g., here read this 2014 post on how fun it was to use parquet files back then). This year I began using it with Python, and that felt like going back in time. First, there were the problems of using it with Pandas (spark 2.3 made that a lot easier – but it wasn’t until the end of June that it made it to GCP’s cloud dataproc ) and that’s that was the easy part. The main problem with pySpark is to tune jobs so that they complete when going from toy sample to big data. I wanted to round up some insights I gathered while fighting Spark on this
- How we’ve built our data ingestion and model creation pipeline on Kubernetes – Apropos the previous point, we managed to cut down our execution time and compute resources significantly (from several hours of 100 servers to < 30 min with about 20) by breaking our pipeline to do minimal preparations in Spark and handling the bulk of the work as queued jobs on Kubernetes. I thought that would be interesting to explain what we did there
- Creating services. It isn’t just carving the monolith – Whenever I read a write-up on micro-services, it never fails to irk me how it is always monoliths and micro-services like there’s nothing else in between. I began writing this note that you can also evolve services to new services and various other architectures.
- Docker for testing – Another pet peeve I have is with the “test pyramid”, the thinking that the right way is to have lots and lots of unit tests some integration tests and few unit tests. I think it should be a “test rhombus”, esp. in a world of micro-services where the interactions are what makes the system and the testing surface of each service is relatively small. Anyway, a lot what’s bad in unit tests is the whole mocking and faking thing, esp. of infrastructure – which makes the software more complex and the test ickier. Docker can save that – you can run your dependencies as docker images and use them. Both the JVM and Python (the two eco-systems I’m mostly using) have the testing libs support for integrating with docker (run images before test start and clean up afterward) so tests can also operate in build environments and not just on the devs laptops.
- Using KeyCloak for authentication and authorization – when delivering a SaaS solution there are many online services that manage authorization for you (like okta, auth0, etc.) – when it comes to on-prem, the options are more limited. Having implemented security solutions in the past, I know I don’t want to do that again. Then I found RedHat’s KeyCloak currently in v.4.8.1, open source, themeable, works out of the box with minimal configuration and integrates easily (with Angular and Python in my case). I was going to write how we’re using its JWT tokens for both authentication and authorization
- Kubernetes, Git and integration environment per feature – We are building software using Kanban and monthly release trains. To support that we’ve developed a CI/CD pipeline that integrates with our project management software (TargetProcess) and creates git branches and integration environments automatically (thanks to Nader Ganayem and Dotan Spector who did all the work) – I think both the technical and dev management aspects are interesting
I don’t know if there are any readers left for this blog as it has been dormant for so long, but, instead of looking at this as missed blogging opportunities, I’ll treat this as a new year’s resolution to turn at least some of this list into posts