Social Impact Tech: How to ensure data quality in AI models
According to Refinitiv’s 2019 Report “Smarter Humans. Smarter Machines”, the biggest barrier to the adoption and deployment of Machine Learning is poor data quality.
Data quality is crucial in artificial intelligence because it directly impacts AI models’ performance, accuracy, and reliability. Poor data quality can undermine AI systems, leading to inaccurate predictions, flawed decision-making, and diminished trust in AI.
For this reason, ensuring data quality is essential, not just as a technical requirement, but also as a strategic one. To achieve this, ongoing efforts to clean, validate, and update data regularly are required.
In the fourth contribution of our Social Impact Tech series, we interviewed Ahmed Magdy, NLP & Machine Learning Engineer at Data Friendly Space (DFS). Honing his craft for over 5 years within the realms of Machine Learning and AI, Ahmed’s journey commenced before the advent of the Large Language Models (LLM) era, back to a time when he passionately engaged with the cornerstone Machine Learning models, while also training and deploying impactful Deep Learning Models in real-world environments.
We discussed with him the role of public opinion in the development and deployment of technologies, and DFS’ approach to gathering automated, updated and verified comprehensive information to support humanitarian organizations in responding to disasters, whether natural or man-made.
What inspired you to start working in technology?
I have always been fascinated by computers and the technology behind them. Early on, I spent time reading about these topics, and during my time at the Faculty of Engineering, I was introduced to programming. From that point, I decided to pursue a career in programming. As I continued learning, my passion for mathematics and statistics led me to develop a strong interest in AI. This field perfectly blends all three areas — technology, mathematics, and statistics. As an AI engineer, I am committed to progressing further in this path, hoping to make significant contributions and establish a name for myself in the field.
What are some of the biggest challenges you face when integrating technology into sectors that are still catching up?
One of the biggest challenges is managing the expectations around the limitations of technology, particularly in AI. Many people expect AI to be perfect, but the reality is that it heavily relies on data. While data collection has become easier, there are still scenarios that require complex and nuanced labeling. For instance, gathering data about areas needing humanitarian assistance, identifying conflict resources in news and social media, or detecting defects in products can be quite challenging. These are examples I’ve personally encountered. Building a product that meets the client’s expectations can be difficult when they imagine AI can achieve flawless results. While we manage to deliver solid solutions, they are frequently compared to more data-rich fields where AI performs with higher accuracy.
What should everyone have in mind when handling data collection and data quality?
Ethics, bias, and privacy — three simple words that carry immense responsibility. With the rise of Large Language Models (LLMs), like the technology behind ChatGPT, the landscape of data collection has shifted. These models require billions of data points, making it nearly impossible to carefully investigate and clean every piece of data. This raises concerns about whether the data being used adheres to ethical standards. While automated tools can help by applying predefined rules, they can’t fully guarantee data quality.
Human oversight is essential to maintain in the data collection process, especially regarding the sources fed into AI applications. For example, at DFS, we take the time to carefully analyze each resource, even if it delays us from having real-time data. This process enables us to deliver ethical products. While it’s impossible to eliminate bias, we strive to use diverse sources and provide multiple perspectives. Additionally, we equip our LLM applications with instructions that promote transparency in the information they present. Though this approach may introduce conflicts in the data, it’s far easier to manage visible discrepancies than to deal with hidden biases that remain undetected.
Can you discuss any innovative techniques or approaches our team has developed?
At DFS, our primary mission is to support humanitarian organizations in responding to disasters, whether natural or man-made. To achieve this, we developed an advanced data pipeline that goes beyond the typical event-tracking systems. This pipeline is designed to gather comprehensive information, including detailed event timelines, the individuals and areas affected, the supplies that have been distributed, and the ongoing status of the situation.
What sets our approach apart is the automated, continuous updating of event data, ensuring that we always have the most current information. Additionally, we put significant effort into verifying the quality of the sources we use, prioritizing accuracy and reliability. This allows us to offer a clear, actionable view of disaster events and provide critical support where it is most needed.
What role should public opinion play in guiding the development and deployment of technologies?
Public opinion should play a significant role in guiding the development and deployment of technologies, especially when those technologies have a direct impact on people’s lives, privacy, and well-being. While technologists can create innovative solutions, the public often provides essential insights into the societal implications, ethical concerns, and real-world consequences that might not be fully visible during the development process. For instance, at DFS, we ensure that our technologies align with the needs and concerns of the people we serve. Public feedback helps us refine our systems to be more transparent, fair, and accountable. This is particularly important in AI, where biases and privacy issues can arise. Listening to public concerns allows us to address those issues more effectively.
Ultimately, while public opinion should not be the sole driver of technological progress, it serves as a critical check on innovation, ensuring that the technologies we develop are not only technically sound but also socially responsible and beneficial to all.