Abstract: Data is continuously coming in from many sources at a fast rate in today's big data ecosystems. Because of this, making ensuring the data is of a suitable quality is both crucial and challenging. Inaccurate analytics outputs and worse performance, fairness, and credibility of later AI models can result from poor data quality, which can show up as missing values, inconsistencies, anomalies, duplication, and delayed updates. The conventional methods of evaluating data quality—static rules, manual profiling, and preset limitations—do not work well with big data pipelines that are constantly changing.

An approach for AI-assisted data quality evaluation that can be used to both massive batch and streaming pipelines is covered in this paper. In order to automatically profile data, detect quality problems, and produce adaptive quality ratings for a range of attributes like completeness, consistency, correctness, timeliness, and validity, the suggested method makes use of machine learning and deep learning. Unsupervised models for anomaly detection include autoencoders and isolation forests. In contrast, supervised learning techniques estimate quality labels and scores based on expert feedback and historical data. The architecture provides a way for the system to keep learning, allowing it to adjust when pipeline configurations and data distributions change.

The AI-assisted approach beats conventional rule-based approaches at identifying intricate and previously unidentified data quality problems, according to an experimental evaluation using real-world big data workloads. The findings also demonstrate that downstream analytics and machine learning models perform better when trained on data validated by the suggested approach. All things considered, our study shows that AI-driven data quality assessment may be a scalable, flexible, and astute way to guarantee that future large data pipelines contain accurate data.

Keywords: AI-driven data quality and big data pipelines assessing data quality, applying machine learning, identifying anomalies, profiling, streaming, and data governance. clever engineering of data.


Downloads: PDF | DOI: 10.17148/IJIREEICE.2025.131223

Cite This:

[1] Mohammed Imran Ahmed, Syed Saifuddin Ahmed Muzaffar, "AI-Assisted Data Quality Assessment for Big Data Pipelines: Framework, Techniques, and Empirical Evaluation," International Journal of Innovative Research in Electrical, Electronics, Instrumentation and Control Engineering (IJIREEICE), DOI 10.17148/IJIREEICE.2025.131223

Open chat