Abstract: The rapid spread of false information in the digital age poses a serious threat to society, influencing how people think and make decisions. As a result, identifying fake news has become essential to ensuring the reliability of information found online. Traditional fact-checking methods often rely on slow, labor-intensive manual processes that are increasingly ineffective given the volume and speed of misinformation. This has led to growing interest in machine learning-based solutions for automating fake news detection. In this study, we propose a fake news classification model that uses Logistic Regression for classification and TF-IDF (Term Frequency-Inverse Document Frequency) vectorization for feature extraction, helping to distinguish between real and fake news articles more efficiently.However, many existing fake news detection systems face significant challenges. Traditional models often struggle to adapt to evolving misinformation patterns, leading to outdated or inaccurate results. Additionally, class imbalances in datasets — where one type of news (real or fake) heavily outweighs the other — can create biased predictions. Feature extraction techniques commonly used in older models also fail to capture the deeper, semantic meaning of text, resulting in subpar classification performance. Moreover, many models lack the ability to generalize across diverse datasets, which limits their effectiveness in real-world applications. These challenges highlight the need for a more reliable and adaptable system for fake news detection.
To address these issues, we propose an enhanced machine learning-based detection system. Our approach incorporates Logistic Regression alongside TF-IDF vectorization for effective feature extraction. We also introduce stratified train-test splitting to maintain class distribution during training and use RandomOverSampler to combat class imbalances by generating synthetic samples for underrepresented classes. To thoroughly evaluate performance, we measure accuracy, precision, recall, and visualize results using a confusion matrix, providing a clearer picture of how well the model performs.In addition to these core techniques, our system introduces several novel improvements. We implement automatic dataset validation to identify and handle missing or imbalanced labels, ensuring data is ready for training without manual intervention. If one class is significantly underrepresented or missing altogether, our model performs class augmentation, generating synthetic data to restore balance. We also introduce an interactive user prediction feature, allowing users to input custom news articles for real-time classification. This interactive component enhances the model’s practicality, making it a valuable tool for everyday use. These improvements collectively enhance model reliability, resulting in a more robust, accurate, and adaptable fake news detection system capable of keeping up with the ever-changing landscape of misinformation.