Write in

Master thesis preparation

Topic: data quality

This article defines data quality in to following dimensions:

The main approach provided in this article is doing “Unit Test” to Data. This tool was then open sourced by Amazon teams, namely, deequ

This approach allow user to define a set of constraint for data quality checks. Then the computable metrics should also be defined which provided a way to measure the current data quality regards to above mentioned three dimension. After these information and the verified data have been set up, the system inspects the checks and their constraints, and collects the metrics required to evaluate the checks. The output of the system will be a report showcase the qualification of the defined metrics on dataset.

The later part of article also mention how to extend this approach for incrementally growing datasets(in the case of ingestion pipelines in a data warehouse).

This article focus on a data quality management solution to detect errors in sensor nodes’s measurement in large scale CPS systems.

Cyber-physical systems (CPSs) are integrated systems engineered to combine computational control algorithms and physical components such as sensors and actuators, effectively using an embedded communication core. Large-scale CPSs are vulnerable to enormous technical and operational challenges that may compromise the quality of data of their applications and accordingly reduce the quality of their services.

This article defined following data quality dimensions:

This research use a methodology called SLR(systematic literature review) to understand the data quality main challenges while categorize them into above mentioned dimensions and proposed solution based on the SLR results.

Topic: Big data, Smart manufacturing, data lifecycle

3. An industrial big data pipeline for data‑driven analytics maintenance applications in large‑scale smart manufacturing facilities

This article provide a system design for an industrial big data pipeline of large-scale smart manufacturing facilities(IoT/CPS for example). Although it’s not directly related to data quality, but still provide a through study of an information system model that provides a scalable and fault tolerant big data pipeline for integrating, processing and analyzing industrial equipment data.

This article covered all the aspect regards to the entire Lifecycle of manufacturing data. Could be used as a good start point and reference to understand our current situation.

The lifecycle included:

Topic: data warehouse

5. snowflake the original paper

The main difference of snowflake to other traditional data warehouse technology is it is design for the benefit for cloud. It’s processing engine and most of other part are developed from scratch instead of using existing big data technology like hadoop.

snowflake vs shared-nothing architectures

Snowflake solution: separates storage and compute. Storage in Amazon S3(blob store). Compute in snowflake shared-nothing engine.

How SF shored data in S3

❌ S3: No possible to append data the end of file instead of fully overwritten.

💡 Snowflake: Tables are horizontally partitioned into immutable files.

✅ S3: Support GET requests for parts of a file.

💡 Snowflake: Within each file, the values of same column or attribute are grouped together and heavily compressed. Each table file has a header contain the offsets of each column within the file.

Compute QUERY: Virtual Warehouse

These following parts might be relevant.

Topic: data steam processing

P.S. Below is my personal understanding based on the paper, could be inaccurate and would be adjust in the future.

Problems to sovle
Main contribution

Thesis Writing

literature search, evaluation and proper citing

Publication types

Reading the contributions, asking questions

Something i did not know before about citing: