Abstract: Unstructured data, which includes everything from text, documents, and emails to scanned images and log files, is the dominant type of data in many industries, including the realm of enterprise systems and digital communications. Despite the immense potential this type of data holds for analytics and decision-making processes, the usefulness of this data is hindered by the quality issues it faces, including duplication, inconsistency, incomplete data, and unclear data. Making the situation even worse is the presence of personal data, which creates privacy and compliance issues. The current data quality frameworks, which were initially designed to work with structured data, are inadequate to deal with the challenges posed by unstructured data. As such, this project seeks to address the limitations of the current data quality frameworks by developing an innovative and exhaustive data quality assessment framework. This framework is designed to automatically assess the quality of the data while protecting the privacy of the data. It incorporates anomaly detection techniques for log data, cleaning and normalization techniques for text data, and OCR techniques for image data. Additionally, the framework incorporates transformer-based techniques to automatically identify and mask PII. Data quality is assessed based on different parameters, including completeness, consistency, duplication, semantic correctness, and privacy. Beyond reporting, the system produces a cleaned, privacy preserved dataset that is ready for safe use in analytics and machine learning pipelines. By combining AI driven quality assessment with automated privacy safeguards, this project bridges a critical gap between data reliability and regulatory compliance, offering organizations a scalable solution for managing unstructured data with confidence.
Keywords: Unstructured Data, Data Quality Framework, Automation, PII Detection, Text Analytics, Image Quality, Log Analysis, AI-driven Data Cleaning
Downloads:
|
DOI:
10.17148/IJIREEICE.2026.14417
[1] Vanitha A, Jumana J, "AN AI FRAMEWORK FOR UNSTRUCTURED DATA QUALITY ASSESSMENT WITH INTEGRATED PII DETECTION," International Journal of Innovative Research in Electrical, Electronics, Instrumentation and Control Engineering (IJIREEICE), DOI 10.17148/IJIREEICE.2026.14417