Data Variety
Variety is defined as the ability to ingest from and integrate with a variety of formats of data, both structured and unstructured.
Operational systems like Enterprise Resource Planning (ERP), and Customer Relationship Management (CRM) systems generate numeric and text data in a .csv or Excel-like format, in the form of rows and columns; these are structured data and are categorized as conventional sources. However, images, video, audio, and a variety of other formats of data are unstructured data and are categorized as unconventional data. Unstructured data can be found in a variety of places, including radio frequency identification (RFID), data from industrial or home devices, image and video feeds from social networking sites, payment transaction data, global positioning system (GPS) data, call center voice feeds, email, SMS, and so on.
Data variety happens in many ways: Having a variety of files, with variance in file formats both structured and unstructured; variety in different formats of data files; or the file format is the same but column numbers or sequences vary. We must have the ability to handle the unpredictability of file structures.
The basic premise of the traditional data warehouse was to narrow variety and structured content. Enabled by new AI and analytics solutions, the modern data warehouse (MDW) has significantly expanded our horizons by enabling a wide variety of data formats and sources; for example, images and videos from social media, emails, blogs, SMS data, GPS data, and so forth.
There are 500+ varieties of format with varying sizes; for example, BMP, CALS, DDZ, GIF, JPEG, HTML, SKB, VOL (video), and more. Each format is compatible with other formats, or not, but can be analyzed in different ways; for example, by tagging audio or video files, or forensic voice comparison with reference data. Actual data can be computed using metadata. Network distances can be calculated using graphic data. Emotions can be analyzed in text, emoticons, tweets, social media messages, text, etc. These are a few examples, but not all can be accessed and processed in same way.
Complexity or variety is changing in the rapidly shifting landscape of DW technologies. The data is processed, transformed, compiled, and ingested from a variety of sources using Extract Transform Load (ETL)/Extract Load Transform (ELT).
Figure 2-1 demonstrates the variety of data that flow through the ingestion pipeline, which include data from multimedia, xml, email systems, traditional RDBMS (Relation Data Base Management Systems), data streaming, legacy data, cloud platform, internet, and mobile applications.
Figure 2-1. Variety of data sources