Harnessing the Power of Data to Identify Fraudulent Water Usage

Jan 08, 2020

Luciano Kalatalo, Founder and Chief Data Scientist at ScientifiCloud

Introduction – The challenge

For a country that holds 12 percent of the planet’s water supply, Brazil faces significant water management issues. In addition to its commonly known sanitation problems, the country’s infrastructure lends itself to distribution issues, including fraudulent use.

Fraudulent water use can be particularly hard to track and identify, and often goes unaddressed for significant periods of time – especially in highly populated areas where physically checking people’s homes and water meters isn’t an option. Instead, companies need to find ways to swiftly identify and eliminate fraudulent water activity which impacts an already scarce supply and costs communities money.

To address this challenge, a utilities company from Mato Grosso, Brazil recently worked with a group of data engineers at ScientificCloud. The goal was to develop a solution that could better locate fraudulent water usage by tracking data patterns based on home location and property attributes. As a Sao Paolo-based data science company that develops and deploys machine learning (ML) and artificial intelligence (AI)-powered applications, ScientificCloud understood these problems first hand.

Identifying the Issue

Organizational silos are a major hindrance to developing and deploying new solutions in the workplace. All too often the IT department is told one thing, only to find out that the division they’re building a tool for has something else completely different in mind. When it came time to devise a plan for the Mato Grosso project, we knew we had to speak with all business units from teams we’d work closely with such as IT and innovation to those that would be most impacted by the tools we were building such as billing and operations.

Following these discussions, we quickly discovered that data was being prioritized and analyzed differently across the organization, making it critical for us to find and unify the data to ensure we developed a universal resource for collecting, storing, and analyzing information.

Addressing the Problem

Data collection and management can be a challenge for utilities companies, especially given the large quantity of information they process every day. In this project, the information was there – property descriptions, addresses, etc. – but it was not housed in a central database. Analyzing the data and identifying which information was most relevant to extract was nearly impossible.

ScientificCloud needed to build a foundation prior to implementing any automation practices. First, we created a single database leveraging InterSystems IRIS Data Platform™which would allow us to handle large volumes of data and integrate with multiple sources such as batch data, API data and log data. With the help of Google APIs we were able to filter and identify the general location of each home. This was a major step forward as we prepared to better understand each home and analyze the characteristics that may classify them as fraudulent water users.

The newly uncovered data enabled us to then build an ML algorithm which was built in Python performing tensorflow algorithms. This would pull satellite images from Google Maps, which would be used to cross-reference data with property imagery. Ultimately, this allowed surveyors to pare down potential fraudulent cases to 200 households, a much more reasonable dataset than previously available. This finite dataset also allowed the team to uncover new insights, such as common property attributes of fraudulent home users including corner lots and swimming pool ownership. ML models were built in python performing tensorflow algorithms. The database decision was based on: capacity to handle huge volumes of data, integration with multiple sources like batch data, API data, log data.

Constant Iterations

Like any ML application, it’s critical to continuously test and reconfigure algorithms to ensure accuracy – especially as data evolves over time. This ultimately comes down to the strength of the data science team to add new insights as updates are deployed and project priorities like improve team efficiency without new hires, with just 10% of the old investigations we could reach the same KPI shift.

Conclusion

In the case of the Mato Grosso project, clean data was critical to the success of the project. Without it, teams were sinking hours into analyzing the wrong data, increasing the risk of inaccurate decision making. Once the right data was identified, the team was able to swiftly deploy ML algorithms that created accurate outcomes, and ultimately achieved some of the main goals of any ML-powered application – improved insights, accelerated timelines, and accurate predictions.

When preparing to develop a data-driven solution within your organization, your team must first step back and identify the real business problem. Ask different divisions within the organization about what keeps them up at night or what processes could be improved on their end. From there, put the right people in place to help analyze information and identify what data will be most useful for your desired outcome.

Luciano Kalatalo, Founder and Chief Data Scientist at ScientifiCloud