• Data Sherpa
  • Posts
  • The Lost Art and Science of Data Modeling

The Lost Art and Science of Data Modeling

Good data models are enormously beneficial for the enterprise in the short and long-term

In the first article of this series, “Doing Data Right by Shifting Left,” we discussed the woeful state of source data and the need to fix it today, not someday. I promised to address how to do that, and the first topic on the list was data modeling. This will be the first of many articles to come on this subject.

A Data Model? We Don’t Need No Stinking Data Model!

“A picture is worth a thousand words”. This is simply true. We learn and retain knowledge better when accompanied by visual stimuli, like pictures, graphs, animation, and symbols.

In software engineering and innovation, the data model is critical for designing data well, communicating, and comprehending complex business domains, from executives and product managers to quality assurance and software creation.

DVD Rental Store - Physical Data Model diagram as visualized in DbSchema (image by the author)

In the middle of my career, new engineering managers would ask me for the ERD (entity-relationship diagram) of the monolith, service, or application they were inheriting. This helped them quickly comprehend the included objects, relationships, business rules, attribute definitions, required vs optional fields, etc. It wasn’t everything about the system, but it was enough to get them up to speed rapidly. They all found data model diagrams and data dictionaries to be highly advantageous.

Similarly, when product leadership researched enhancements to existing systems, they were grateful for and came to expect that I would bring a diagram of the data model to the discussion. Instead of explaining the system with a stream of spoken words that were quickly forgotten, the visual representation of the domain aided learning, was easy to recall, and reduced time to comprehension. This fostered far shorter and more productive meetings. It allowed us to quickly proceed with re-arranging things within the data model, with confidence we weren’t forgetting something important.

Then all that goodness mysteriously faded away without a whimper as the hype of NoSQL, Cloud, and Microservices occupied the whole stage. During this time the engineering team quietly co-opted the ownership of clean data design, and frankly, most of them didn’t know what they were doing.

Data modeling and relational databases were no longer cool, and engineers who steered their careers by buzzwords didn’t want to be associated with “grandpa’s technology.” Recently, traditional data modeling became conflated with data warehouse schema, the medallion architecture, GraphQL, and even dbt DAGs. An entire generation was forgetting source-system data modeling.

The baby had been thrown out with the bathwater.

What Replaced Data Models? (Nothing Great)

I surveyed several staff engineers, asking what replaced UML and data model diagrams to communicate the content and scope of their applications. Their answers were disheartening. They included slogging through reams of code, version control logs, auditory osmosis, change requests, emails, Slack chat history, knowledge transfer meetings, Swagger docs (now OpenAPI), README files, and rare documentation. In the best of circumstances, there might be a Lucidchart C4 diagram or Visio diagram some architect threw together once.

Since about 2011, everywhere I’ve been engaged to introduce good data architecture, they have little to no documentation and haven’t seen data modeling done in at least a decade. If they have a licensed data modeling tool, it is often nine versions behind and the existing data models are woefully out-of-date. If anyone even knows where they are stored, they no longer match production, so they are ignored and grow more irrelevant every year.

This is a sad place to be, especially because most businesses still keep 60% to 80% of their data on relational databases. They should have a current data model to represent each database’s design and serve as the starting line for future changes.

Despite that fact, as data was migrated to the cloud, and NoSQL gained mindshare (I’d argue mind control) over IT leadership, the veteran DBAs and modelers who had built the flagship systems were “put out to pasture” (laid off, retired, or ignored).

2007 to 2021 was a bleak time for passionate, relational data modelers and architects. Although there were still remnants here and there, the art and science of data modeling had been scattered to the wind.

Data Modeling is Reborn

Luckily, with the advent of AI and its tendency to expose poorly designed and curated data, data modeling is coming back into vogue. Now, it seems that 50 to 60% of AI articles are either entirely or partially about the need for better data quality from the source systems feeding the hungry AI/ML models.

Chad Sanderson and his business partner Mark Freeman speak of this movement as Shifting Left on data, aka Design-time Data Quality. Although their data interests and expertise are broad, their focus is on automating data contracts between data producers and consumers, which is a form of modeling data.

Joe Reis, a local friend, data renaissance man, and “recovering data scientist,” is currently writing a book on Data Modeling as there is such a renewed interest in this topic.

“Doing Data Right” at the far left, where the source systems capture or create the data, is what I have been teaching engineers for decades and what has kept me employed.

Some of what I teach include:

  • Understand your data and business requirements well.

  • Model and document the data structures according to that understanding. Keep modeling until the model is clean and simple.

  • Avoid duplication, unless absolutely necessary.

  • Use data model diagrams to communicate and collaborate.

  • If there are changes, start with the model first, then push the changes to the code.

  • Avoid ORM-generated SQL. Write and tune SQL by hand. It’s not hard.

Notice that data modeling is behind the majority of these teaching points.

Another factor in the resurgence of data modeling is the growing discontent with NoSQL databases. Document and key-value databases have their place, but many who tried to move their systems to NoSQL discovered, to the chagrin of their budget and decision-making reputation, that they caused far more problems than they solved.

It seems the majority of businesses I know are migrating legacy, expensive, monolithic RDMS engines and NoSQL databases, to cloud-hosted flavors of PostgreSQL. The 2024 Stack Overflow survey agrees, showing Postgres is more popular by 2X to 4X the various once-popular NoSQL offerings. MySQL and MariaDB are two other competing open-source RDBMS engines. But I’ve never seen a job posting where the company is modernizing towards MySQL or MariaDB; it’s always Postgres.

What once was old, is new again. Relational databases and data modeling are becoming trendy.

The baby, and much of the bathwater, is back in the tub.

The resurgence of enterprise-grade data modeling will result in far better application design, better data quality and understanding, and far less friction and problems as data flows downstream for analysis, reporting, AI use cases, and impactful business decisions.

I couldn’t be happier.

How Can I Learn Good Data Modeling?

If you haven’t already, subscribe to datasherpa.blog. In future articles, I’ll be going over the best guiding principles and practices I have had to learn the hard way in 30 years of practicing and preaching this art and science. The next one will be about the best data modeling tools, so you can have something to work with as you begin your data modeling adventures.

You can learn this. You can do this.

I have mentored and trained over seventy DBAs, five architects, and nearly 600 software and DevOps engineers at eight large enterprises and five startups. IT professionals are quick learners. Within one to three months, they produce data models as cleanly and completely as I would. Democratizing good data design is fantastic and has worked very well for me. It’s pretty easy to learn how to do modeling right! (Re-modeling existing systems is the hard part of this job.)

If you can’t wait for my articles to be released at a slow pace, there are several great books and authors on data modeling. My favorites are primarily written by Steve Hoberman. He offers a free newsletter, the Data Modeling Zone annual in-person conference, and master classes. We brought him in-house once to train a roomful of DBAs for a week. He is engaging, full of energy, and makes it fun.

Then there is Joe Reis. He and Matt Housley wrote the definitive book on data engineering, Fundamentals of Data Engineering. And now Joe is tackling a new book on Data Modeling. While waiting for him to finish, you can subscribe to Joe’s substack ($5/mo), read preliminary chapters, and collaborate with the best modelers around the globe.

There are a few others I have learned from through articles, books, and in-person conferences, like Scott Taylor, David C. Hay, and John Giles (what a hoot; his personality and data modeling stories are a blast).

There are online courses as well, of course. But I cannot speak to those from personal experience, so YMMV.

Feel free to comment on this post if you agree, disagree, have questions, or feel strongly about anything. I’d love your feedback to improve my perspective and writing.

If you would like to talk, ask in-depth questions, or need help with data modeling or database performance tuning at your company, please email info@datasherpa.blog

Until next time, my friend. Enjoy the views while we climb this mountain of data together!

Reply

or to participate.