Case Study: Enterprise Scenario for Modern Cloud-based Data Warehouse
The U.S. State Government school, having more than 1,000 schools, is a massive data consumption organization, from studying how disconnected students feel after two years of COVID isolation and its effect on individual grade-level performance in comparison to statewide trends, to the effect of a change in curriculum on student achievement in certain tasks as related to overall employee performance, to school establishment/abolishment of Community isolation at the time of COVID. It has undertaken exciting multi-year trend analysis to view “wholeness” trends seen in student and teacher performance, as well as comparative analysis of our numbers in respect to other statewide local education authorities and large urban cities across the country.
The problem statement that was coming from leadership was: There were multiple data marts available in organizations for students, teachers, academics, finance, human resources, etc., but from the department’s view/perspective, the data warehouse/BI (Business Intelligence) essentially consists of SharePoint folders holding old spreadsheets and reports from which they often re-key/modify data. Analytics tools like power BI and Tableau were used, but in limited ways and selectively. However, the choice by default for analytical reporting tools is Excel. An organization relies on putting our static reports that have presumptions about the Business understand or definition of data, and these presumptions may have changed with time; e.g., a student’s home situation could be different this year than it was last year.
After further analysis, it was clear that the problem was multidimensional and deeper rooted than initially assessed.
The IT systems and departments were working in silo environments, not collaborating or coordinating with each other frequently. One of the issues was related to governance. There was not a stewardship or ownership framework nor a data-governance framework that could work with the governance council and governance steering committee. Another issue was related to team efficiency and skills that needed to be upgraded to match the solution proposed. Last but not least, another issue was related to systems’ not talking to each other, which was related to the interoperability of systems. This led to an increase in the cost of rework resulting from misunderstanding of requirements, which led to delays and an indirect increase in costs.
After discussions with users and stakeholders, the following storage and processing requirements were listed and finalized:
• Ability of the data warehouse platform to enable machine learning (ML) that can support all formats and frequencies.
• Ability to store data in multiple cloud environments, such as S3 from Amazon or Blob storage from Azure, to reduce storage costs.
• Ability to support data versioning so that one can find out what, how, and when data has changed over time and which are the current and historical versions of data.
• Enables direct data access so that users can query the data directly from raw data.
• Enable users to have insights and build transformation logics and improve/clean the quality of the data.
• Ability for self-service analytics for data analysts across all departments.
• Ability to connect directly to other cloud or non-cloud platforms and connect directly to BI tools, like Tableau, Power BI, and so on.
• Platform should be able to reduce data redundancy’s happening in the presence of the multiple tools and platforms used for cleaning data warehouses for processing.
• Enable a single enterprise-level combined cloud data platform for the whole organization within the existing architectural framework.
The new solution should be enabled by a new system design that implements similar data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes. Merging them into a single system means that analysis, data science, and machine learning teams can move faster because they are able to use data without needing to go to and fro for access to multiple systems.
Given these requirements, it is an easy and common but wrong approach to use multiple specialized systems or data warehouses for different departments for analysis. Having a multitude of systems introduces more complexity for governance and, importantly, introduces redundancy, data quality issues, and delays as analysts/users need to move or copy data between different systems.
After analyzing multiple analytics platforms, the solution that ticked all the boxes for a requirement traceability matrix along with a long list of analytical business use cases was the Databricks Unified Analytics Platform for Lakehouse.
Note Requirement traceability Matrix (RTM) is Document which helps to map initial requirements to final artifacts and output of development. RTM is required. It’s used to prove that requirements have been fulfilled. This document is created at the time of requirement gathering and used in testing and scoping.
This enterprise-level modern data warehouse is expected to allow the organization to optimize machine learning algorithm requirements at scale, streamline workflows of inter- or intra-departmental teams, and reduce complexity. Simplified infrastructure management reduces OPEX costs through the automation of cluster management, auto-scaling, auto-backup, and on-the-spot instance management. It increases collaboration
within the team through collaborative workspaces, integration with CI/CD, DevOps, and interactive notebooks. This effectively reduces silos within departments and prevents duplicated work efforts.