Overview of "Data Science: The Hard Parts"
As a reader, I have often felt disappointed with a book, only to later realize that it wasn't the book itself, but rather my expectations about it that were the issue. Put differently, I was not the ideal reader for that particular book. In this sense, a negative review on my part reflects more on a mismatch between reader and book than on the actual quality of the book.
On the other hand, as a data-driven author, I consistently examine any feedback provided by readers in the form of reviews. However, as mentioned earlier, reviews are noisy signals because they depend on whether the reader was the right target audience or not. Disentangling this is an impossible task for an individual author.1
All this is to say that I’m writing this post for you, the potential reader, and for myself as an author. I’ll be successful if I can help readers understand what this book is about and whether it offers enough value to justify the purchase.
How I came about with the idea of the book
In my previous role, I led a relatively large team of data engineers, BI analysts, data analysts and data scientists. As I’ve done in previous similar roles, I set up a weekly internal seminar so that knowledge could be shared, but also to ensure that best practices were widely available.2
Back in June 2022, I realized that I myself had already given several seminars on different topics, and that many times, I couldn’t provide further references when asked. I also felt that they were generally well received, and due to time constraints, I often couldn’t delve as deeply as I wanted. I started toying with the idea of writing a book, but I was unsure if I could devote the required time. When I wrote Analytical Skills for AI and Data Science (AS) I also had a full-time job, so writing took a significant toll on other aspects of my personal life. Finally, I had covered both technical and non-technical topics in the seminars, so it wasn’t clear what the unifying theme was, if there was one.
Later that year, during a personal trip in Eastern Europe, I was finally able to organize the topics and write down a proposal. The original 6 or 7 topics were expanded to 21 (these were later reorganized, so I ended up with 16 independent topics presented across different chapters). Since I had already had an incredible time working with O’Reilly Media, it was a no-brainer for me to partner again with them. I sent them the proposal, we had some back-and-forths clearly delineating the content and scope, and the project was finally kicked off in November 2022. The plan was to finish the book one year later, so I’m happy that our predictions were accurate enough.
Who is this book for
I’ve learned the hard way that the best book are written for a specific group in mind, and the more delineated this is, the better the fit.3 So I’ll now describe the ideal reader for the book:
Data scientists and data analysts
With enough knowledge of machine learning (ML) techniques: I assume that the reader knows anything from linear and logistic regression to random forests and gradient boosting regression and classification. Moreover, several of the topics require an intermediate understanding of statistical concepts.
The accompanying code is all written in Python, so it’s better if you use it for your daily professional work.
In terms of seniority level, I'm comfortable saying that the book is intended for junior and intermediate practitioners. But I’ve also seen more senior people benefit from several of the topics.4
Does this mean no one else would benefit from reading the book? Not really, but I think this profile maximizes the probability of a true positive (I predict it’s intended for you, and you bought it). If you don’t know ML, you won’t benefit as much from the second part of the book, but it would still be great side material for your training if you’re starting or planning to start that path.
What is the book about?
Simply said, this book presents different skills or techniques that will make you a more productive data scientist. You will also notice that the overarching theme is an almost insatiable obsession to help you, or others in your organization, make better decisions.
To simplify the organization I divided the book in two parts: Part I deals with topics in data analytics and Part II is all about ML.
Data Analytics
The figure below shows the 8 topics covered in the first part of the book. I’ll provide a brief summary, along with the skills that you should develop in each of these chapters.
What is value? I’ve found that many practitioners still struggle with this question, so it’s no wonder that many organizations do so too. Being able to quantify the value of a team is not only important for the survival of data science organizations, but it also acts as an important individual motivator. In this chapter I show how to measure whether you or your team is incremental for the organization.
Metrics design: I argue that data scientists ought to be great at metrics design. Here I show you how to decompose metrics into submetrics that have some desirable properties for data-driven decision-making. I provide several examples that should be generalizable to your use case.
Growth decompositions: a rather common question that most data scientists get is “Why did we grow (or not grow)”. This is an incredibly hard question to answer. In this chapter I present three decompositions (additive, multiplicative and mix-rate) that can help you provide some light into it. They are also easily automatable so this should provide a productivity booster to your monthly workload.
2x2 designs: learning how to simplify a complex world can take you very far in data science. 2x2 designs are well known by practitioners with a statistical background, but in a quite orthogonal way, it’s also commonly used by business consultants. In this chapter I describe with examples how to use it, and show that starting simple is always the way to go.
Business cases: data scientists come from a highly diverse background, so it’s not uncommon to see practitioners that have never done a business case before. Business cases are not only great to think about the fundamentals of your business, but can also take you very far when deciding to undertake or prioritize a data science project. Needless to say, these two skills are critical to create value, so this chapter goes hand-by-hand with the first one.
Lifts: while ML is fun, cool and trendy, it also takes time to develop and deploy into production. Many times you need techniques that are “quick and dirty” and provide enough information to act upon. Lifts are one such technique. In this chapter I show examples of the many use cases that can be tackled with such simple technique.
Narratives: data scientists come in many flavors. Some lie closer to data engineers, some are closer to the business and analytics, and some need very strong statistical or machine learning backgrounds. However, independently of your background, you need to become a great storyteller. There are many great books and resources on how to do storytelling with data, but this chapter really focuses on aspects of storytelling that help improve your organization’s decision-making capabilities. This focus on making better decisions give this chapter a somewhat unique flavor, and I’m convinced that it nicely complements the existing literature.
Data visualization: visualizations are key for delivering great narratives with data. This chapter is a continuation of the one that precedes it, now focusing only on datavis techniques.
Machine Learning
The figure below shows the 8 topics covered in Part II: Machine Learning.
Simulation and bootstrapping: learning simulation can help you on many grounds as an ML practitioner. I most often use it to ensure I have a profound understanding of the algorithm I intend to use, and its robustness and behavior with different assumptions on the data generating process (DGP). I’ve found that thinking about data in terms of DGPs can take you very far when creating a model. But simulation can also be used for data augmentation purposes, and importantly, with the closely related topic of (statistical) bootstrapping. This chapter is full of examples that you can use and adapt.
Linear regression: these days only (mostly?) people in the causal inference realm use linear regression, so why should you still care about it? In this chapter I claim that learning the basics can help you ground some very useful intuition that extends to the algorithms you commonly use. Here I discuss the Frisch-Waugh-Lovell theorem and the problems of confounders, omitting relevant variables and including redundant ones. I also discuss the very useful (and general) concepts of signal-to-noise ratio and the role that variance plays in ML.
Data leakage: data leakage is one of the most pervasive problems in ML. Here I discuss its causes and how to fix it. I also present a time windowing methodology that can help you protect your organizaiton against it.
Productionizing models: there are some great recent book-length treatments of this problem (which I reference at the end of the chapter). In this chapter I discuss data and model drift, and propose a simple-to-implement minimal architecture that can take you very far if your organization lacks the necessary data engineering skills.
Storytelling in ML: this chapter is very different from its counterpart in Part I. Here I argue that storytelling is an integral part of the data science flow, and not only of your salesperson persona. I also discuss and present techniques on the problem of interpretability, as it relates to storytelling exclusively. This is one of my favorite chapters of the book, and is also an area of constant research and advancement.
From predictions to decisions: I’m obsessed with simple methods and techniques that can take us from the predictive to the prescriptive arena. Here I discuss some examples of such methods: smart thresholding and confusion matrix optimization.
Incrementality: this chapter is a practical introduction to causal inference. I briefly discuss the directed acyclic graph (DAG) approach to causality, confounders and colliders bias, and then move on to the potential outcomes approach (PO). Using PO I show you how to use the highly intuitive matching estimator, and the more flexible propensity score matching method. I also discuss some more recent, ML-based alternatives like the double machine learning method.
A/B tests: I spend almost all of this chapter discussing minimal detectable effects. Although the math is very simple, I find that there are many common misunderstandings. I also provide a general governance framework for experiments.
How will AI change our job description?
ChatGPT arrived in November 2022, when the plan for the book had already been defined and when I was already working on the first chapters. In March-April 2023, with the launch of GPT-4, I had an Aha! moment and started to feel that something had fundamentally changed. At some point, Corbin Collins, my editor, correctly challenged one claim I made later in the book and used ChatGPT as an example. I then felt I had to think hard about whether the book’s contents would remain useful in a post-ChatGPT world. I decided to include one last chapter on this topic.
It’s the only chapter where I don’t present any techniques or methods, and it’s the only one with a clearly speculative tone. I’ll leave the full argument for the book, but in a nutshell, I make the prediction that the data science job description will radically change in the upcoming years, and recommend that data scientists start preparing themselves for such a shift. Just in case you’re wondering, I think that some of the chapters of the book will still be valuable in that world, but other techniques will most likely be automated by AIs. Nonetheless, and perhaps not surprisingly, I’m convinced that the general tone and specific strategies and techniques presented in this book, and in AS, will remain valuable in the years to come.
Just as an aside, it’s quite interesting to think about this problem analytically, and how it can be dealt with if you’re Amazon or Airbnb (or any other company). I’ll try to write about it in another post.
Internal seminars have many additional benefits: they help practitioners work and improve on their communication skills, ensure that they have a full understanding of the techniques presented, and create a sense of unity within a team that works remotely. To me, as a somewhat distant manager (individual contributors had managers that reported directly to me) it had the additional benefit of providing a more direct assessment of the skills we had and which were lacking.
When I wrote AS, I naively thought that it could well suit practitioners (data scientists) and non-technical business people. In the Preface of the book, I explicitly asked the latter group to skip any technical material and try to stick to the underlying intuitions. In retrospect, while it’s true the book could be read like that, it’s also true that I should’ve stuck to writing a book for practitioners, and only for them.
The problem with seniority is that it’s multidimensional, and many times we just give a high weight to technical things. But even with technical topics, “the devil is in the details”, and I’m convinced that many senior data scientists will benefit.