This article defines data quality in to following dimensions:
The main approach provided in this article is doing “Unit Test” to Data. This tool was then open sourced by Amazon teams, namely, deequ
This approach allow user to define a set of constraint for data quality checks. Then the computable metrics should also be defined which provided a way to measure the current data quality regards to above mentioned three dimension. After these information and the verified data have been set up, the system inspects the checks and their constraints, and collects the metrics required to evaluate the checks. The output of the system will be a report showcase the qualification of the defined metrics on dataset.
The later part of article also mention how to extend this approach for incrementally growing datasets(in the case of ingestion pipelines in a data warehouse).
This article focus on a data quality management solution to detect errors in sensor nodes’s measurement in large scale CPS systems.
Cyber-physical systems (CPSs) are integrated systems engineered to combine computational control algorithms and physical components such as sensors and actuators, effectively using an embedded communication core. Large-scale CPSs are vulnerable to enormous technical and operational challenges that may compromise the quality of data of their applications and accordingly reduce the quality of their services.
This article defined following data quality dimensions:
This research use a methodology called SLR(systematic literature review) to understand the data quality main challenges while categorize them into above mentioned dimensions and proposed solution based on the SLR results.
This article provide a system design for an industrial big data pipeline of large-scale smart manufacturing facilities(IoT/CPS for example). Although it’s not directly related to data quality, but still provide a through study of an information system model that provides a scalable and fault tolerant big data pipeline for integrating, processing and analyzing industrial equipment data.
This article covered all the aspect regards to the entire Lifecycle of manufacturing data. Could be used as a good start point and reference to understand our current situation.
The lifecycle included:
The main difference of snowflake to other traditional data warehouse technology is it is design for the benefit for cloud. It’s processing engine and most of other part are developed from scratch instead of using existing big data technology like hadoop.
Heterogeneous Workload
In share-nothing architectures, multiple nodes usually have same hardware configuration. There could be two different type of tasks(one is I/O intensive the other is CPU intensive) This need the hardware to be configured in a trade-off with low average utilization. Platform like Amazon EC2 allow different instance type from which this share-nothing architectures cannot take advantage.
Membership changes
In the cloud concept, the node failures are more frequent and performance can vary even in the same type of nodes(EC2 instances). This could be a result of non-virtualized resources like network bandwidth. When it happens, data need to be reshuffled between nodes which will impact the system performance.
Snowflake solution: separates storage and compute. Storage in Amazon S3(blob store). Compute in snowflake shared-nothing engine.
How SF shored data in S3
❌ S3: No possible to append data the end of file instead of fully overwritten.
💡 Snowflake: Tables are horizontally partitioned into immutable files.
✅ S3: Support GET requests for parts of a file.
💡 Snowflake: Within each file, the values of same column or attribute are grouped together and heavily compressed. Each table file has a header contain the offsets of each column within the file.
Compute QUERY: Virtual Warehouse
Isolation
Each query runs on one VW. VM of different size can have different number of worker nodes (a EC2 instance). Worker nodes not shared across VWs resulting in strong performance isolation for queries.
Consistent hashing: improve caching with query optimizer
LRU used to manage cache replacement in each worker node. Optimizer assigns input file sets to worker nodes using consistent hashing over table file names. The queries accessing to same table file will then be put to the same worker node.
Execution engine
TODO..
These following parts might be relevant.
P.S. Below is my personal understanding based on the paper, could be inaccurate and would be adjust in the future.
Publication types
Peer-reviewed publications
Non-reviewed publications
Reading the contributions, asking questions
Something i did not know before about citing:
we can not copy verbatim from a paper and cite it (paraphrasing instead).
if we really do need to do the thing above, we should use quotations( “the original content”) instead.
always cite the original instead of the one that reference to the original content.