Doing Data Right by Shifting "Left"

Focus first on the source systems to correct data quality issues cheaply and permanently

No, I’m not talking about becoming liberal or left-leaning to get good at data ;-) There is a growing movement within the data industry to shift focus off the backend money pit of data (data warehousing, streaming, BI, analytics, lakes, lakehouses, AI/ML) and focus time, energy, and money on the producers of data, the source systems.

This shift from the tailpipe of data to the front engine of data generation is considered part of the “shift left” movement, which also includes finding and preventing issues earlier in the lifecycle of software development, ITSM, observability, DevOps, Testing, etc.

Look to the Source

In other words, instead of dumping all the company’s data budget into patching up the dying patient (everything the heroic data engineers do to put “lipstick on a pig” and produce semi-useable, semi-reliable data for analysis and decisions), what if we prevented the patient from getting ill in the first place?

An ounce of prevention is worth a pound of cure. What a novel concept!

If the source data is well-designed and clean, EVERYTHING downstream gets easier, cheaper, and more reliable as the friction of poor source data is eliminated.

I’ve always referred to this simply as “Doing Data Right”, and have built my career on this concept.

This blog and newsletter will be dedicated to doing data right the first time (or doing it right the second time, after everything is falling apart data-wise). I will share what I have learned over three decades that has made my time in data relatively pain-free, saved my employers millions, and helped projects finish on time and often under budget.

Within most companies, once you open the hood and critically examine the source data, you will be appalled at what you find. Most have postponed this kind of honest introspection for far, far too long. The 2024 dbt Labs State of Analytics Engineering report found that 57% of data professionals see data quality as the primary obstacle hindering them from organizing and preparing data for analysis and use in AI/ML; that is up from 41%, so things are getting worse, not better.

I am deeply grateful for the recent AI craze because business leaders are finally seeing the light, speaking of data quality and governance with urgency and frequency. Their headfirst FOMO dive into AI/ML investment has thrown a harsh spotlight on the woeful condition of source system data.

Gollum (from Lord of the Rings), representing terrible, poorly-designed source system data, hiding among mossy boulders, is exposed by a ray of sunlight. He is not happy.

Image generated by the author using Imagen 3

As Dr. Peter Aiken, and other data luminaries, have been preaching for a long time: garbage in + (any awesome technology) still equals = garbage out. Reports, dashboards, prescriptive analytics, agentic AI, LLMs, RAG, and all the wonderful, modern things you’ve been reading about, can’t produce the expected outcomes if they’re fed bad data.

The dust-up at DOGE over supposed “vampires” receiving social security is a recent example, where the empty death date values caused the inexperienced data engineers to report to Elon Musk that there are people over 150 years old receiving payments. A simple matter of a nullable field that should have been required caused this nonsense.

A report from the Data Warehouse Institute estimated that poor data costs businesses around $600 billion annually.

Gartner found that bad data costs each US business, on average, $12.5 million per year (the amount is greater at larger companies).

A Harvard Business Review study revealed only 3% of businesses rated their data quality as acceptable, an epidemic of unhealthy tolerance for poor data.

It takes very little bad data to wreak havoc. Data that is just slightly inconsistent, incomplete, inaccurate, misleading, or misunderstood amplifies errors as it rolls downhill, gathering mass and inertia, increasing the consequences and side effects the longer it goes unremedied.

Once bad data is injected into the system, it spreads and infects downstream processes, systems, and decision-making. Although AI can be used to help find and fix bad data, in several ways the advent of AI in the middle is making matters worse. For example, LLMs are running out of training data. Now AI itself is generating test and training data from sources that weren’t good to begin with, and this is being fed recursively into LLMs, leading to worse than hallucinations, bias, and mistakes, leading to model collapse.

Tolerating poor data, instead of fixing the data model and code that are creating it, is madness. Yet, pumping money into the backend data stack, migration to the cloud and microservices, streaming, and AI projects is so de rigueur and entrenched that most leaders’ gut response will be to reject the following recommendation.

I realize this may seem naive or impossible to some, but for others it will resonate and solve many of their woes: Set your source data in order before throwing another penny at your other data and AI initiatives.

Chad Sanderson, the CEO of Gable.ai, and one of the leading figures in data, calls this Design-time Data Quality, and speaks often of this concept of shifting left in data, getting the software engineers and their code involved in data design, data integrity, and data management.

It will take some time, money, and sharp people to dig in and understand the data well enough to get it done. It will be hard, but it will be worth it! Getting your source system data in top shape will give you a huge competitive edge because almost no company is doing it. They don’t dare, and they’re too used to funneling money at everything except the source of the problem.

How Did It Get This Bad?

The short version: ignorance, fads, and higher education

For three decades my focus, passion, and career have centered on designing enterprise-grade databases for custom web and mobile applications that were cash cows for my clients and employers.

Sometimes that was for startups where nothing existed beforehand and the field was green. Being a fan of Beck, Highsmith, Cockburn, Fowler, and other founders of the Agile movement, I adapted and introduced agile into data design and data management practices early in the late nineties, and trained my colleagues in the same. By getting in front of things like this, we avoided 99% of the typical data headaches afflicting most businesses. It is truly a beautiful thing to behold. I’ll introduce some of these hard-won guiding principles and practices in future articles.

However, with other engagements, it was a data “re-modeling” effort of legacy systems. At these employers, the backstory was always the same:

  • The company had grown organically and rapidly with little to no rigor in data design or data management.

  • The software architects ran the show and figured they could do it all themselves. DBAs, database developers, and modelers steadily lost the voices they once had. This worked out well in the beginning, but apps were thrown together with no consideration of data purpose, data sharing, master data, common reference data, data integrity, data access, queryability, data privacy, data security, data retention, and so forth. This myopia was forgivable when the company was small and nimble, but became an anchor around the company’s collective neck as it grew.

  • When the engineers designed tables or collections, it was to solve for a single screen, or set of SPA requirements, never with the whole enterprise and its long-term needs in mind. This was exacerbated when the fads of microservices and NoSQL document databases barreled into town, encouraging the isolation of data silos, “eventual infrequent consistency”, and denormalization/duplication everywhere.

  • Then management, sales, finance, and marketing started needing insight out of the data held by the apps, so a data warehouse was cobbled together by data engineers, with pipelines held together by duct tape and baling wire. Simple changes to source systems broke stuff, often causing downtime or unexpected job delays. Over time the scheduled jobs to move and fiddle with all this data started to exceed the nightly window, even after employing expensive consultants to optimize it, move it to the cloud, and re-write it all (sometimes two or three times)

  • All of the above was accomplished with little to zero documentation of the original data structures. Nuances, caveats, full definitions, original requirements, and other critical tribal knowledge about tables and columns were stuck within the heads of, or forgotten by, the original engineers who built the system and had since left the company or moved on to other projects.

  • About 10 to 20 years into this journey, the whole data contraption starts collapsing under the weight and inertia of the bad data boulders caught in their buckets and pipes. Management is upset because it takes months from request for data, to actually getting it. Data engineers are blamed for everything data-centric, even though they don’t own or understand the data they were tasked with moving to the warehouse. The flagship app is failing no matter how the host database is scaled vertically. Even the best engineers struggle with writing aggregations and linked collections, trying to use their document database like an ad-hoc query engine. And ORM-generated SQL is dragging the whole company down with timeouts, deadlocks, and unacceptable response times, giving the customer a miserable experience.

As it becomes clear that costs are skyrocketing, and revenue is significantly impacted by ongoing data neglect, the company finally realizes it needs a data architect who can design and optimize data models, databases, queries, data contracts, and data flows enterprise-wide.

Before that point, they had either never heard of data architecture, or scoffed at the notion. If engineering managers were aware of the role of data architecture and modeling, they had willfully scorned or ignored that practice since 2007, the same time NoSQL and ORMs burst onto the scene, promising to free developers from the “tyranny” of SQL, data modelers, and DBAs.

And this pattern was happening everywhere.

To make matters worse, academia stopped teaching basic data modeling principles because it wasn’t cool anymore. Over the last 10 to 18 years, most college graduates have only been trained on MongoDB. The lucky ones might get one or two classes using a small MySQL database. They leave school thinking they know data and how to work well with databases.

On the contrary, high-volume, high-velocity databases, and enterprise data management demand so much more than the minimal exposure recent grads are getting. Working with massive datasets is an entirely different world, and requires skills and a mindset that only comes with experience.

And yet companies around the world trust their “full-stack” engineers that they know data well because they can store, retrieve, and update a JSON doc. Next thing you know the company is mired in a mess of duplication, redundant collections, spreadsheets, and third-party databases, where precious little is consistent, valid value lists don’t agree, customer data is spread in twenty-five places, reports mislead, and so forth.

That’s when companies in this situation start throwing millions every year at the latest “silver bullet”, hoping their data woes will finally end: data warehousing, data lakes, cloud and streaming solutions, Hadoop and Big Data, data scientists, data engineering, data lakehousing, data fabric, and now data mesh. Sound familiar?

This common pattern of growth without data discipline, ignorance of the basics, technology fads, reduced education, and mistaken priorities — from startups to large enterprises across the world — is how we got to this point, where tens of billions are being sunk into everything to the right of the source systems, everything that isn’t actually the source of the problem.

Fixing the problem at the source, re-modeling the data structures of your internal applications, and protecting the integrity of the data where it is created, can seem daunting, like lifting a home and replacing the foundation. I understand why business leaders are afraid of beginning something that seems this difficult.

However, fixing the source data model and the quality of its data is the only way to bring sanity back and reduce data costs many-fold. To do anything else is willfully ignoring the sickly patient.

How Do We Start?

The good news is that remodeling and remediating poor source data can often be done by existing in-house engineers and data people. These folks share an eye and passion for detail and getting the data right; they’ve just never been given the time, resources, and enthusiastic executive support to dive in with both feet and get it done.

And now is probably the best time in the foreseeable future to get the funding to finally fix the source data, riding on the coattails of the billions being pumped into AI initiatives.

For the foreseeable future, the newsletters of datasherpa.blog will be dedicated to sharing the tools, tips, and tasks required to “Doing Data Right”, modeling or re-modeling source systems, and correcting data quality and governance issues. If you haven’t already, please subscribe to be notified whenever another newsletter is published.

We will be covering things like:

  • Model-first. Model cleanly. Model thoroughly: Data modeling principles, practices, and recommended tools

  • Really Know Your Data: Get to know the data, rules, and requirements intimately

  • Duplication is a Plague: And what to do about it

  • Naming Things Well

  • Keeping It Simple

  • Protect Your Data

  • Choose the Right Database

  • Unleash Your Database

  • Design to End [Data] Goals from Day One

I look forward to sharing what I’ve learned, and what I’ve used to help companies modernize, avoid big problems, save millions, restore sanity, comply with regulations, improve performance by huge leaps, and make working with data fun again.

Feel free to comment on this post if you agree, disagree, have questions, or feel strongly about anything. I’d love your feedback to improve my perspective and writing.

If you would like to talk, ask in-depth questions, or need help at your company, please email [email protected]

Until next time, my friend. Enjoy the views while we climb this mountain of data together!

Reply

or to participate.