At CoreLogic, as we continue our pursuit to provide deeper analytics and insights for our customers, we challenge ourselves every day to continuously evolve. Artificial intelligence (AI) and machine learning (ML) are areas where we’re constantly seeking improvement in both innovation and efficiency. We have implemented a variety of ML models into our platforms, solutions and processes, but with great innovation comes setbacks as well.
A lot of data and a lot of compute
Typically, ML models require a lot of data and computing power to come together. And with coverage of 99.9% of U.S. properties, more than 5.5 billion records and over 1 billion records updated annually, CoreLogic has no shortage of data. On the technology side, given our collaboration with Google Cloud Platform, we also don’t have much of a limit on computing power.
But this is where a high degree of vigilance and optimization is needed. With that much data and processing power, the costs to build, train and implement these models can really rack up without any significant improvement in the outputs they are producing.
There are several different methodologies around the implementation and development of ML and analytical models. Traditionally, this works by selecting the data required for the model, processing the data (usually in memory) for feature engineering and training purposes and then feeding it into the model for accurate predictions.
However, in a lot of use cases, this methodology requires a huge amount of data flow which necessitates a lot of processing power, complex integrations into data pipelines and above all, longer run times.
Bringing efficiency through innovation
In a quest to improve the efficiency of models, one new methodology we have established is to bring the power of compute to the data. In our implementation, the data and compute are co-located within the same cloud environment.
For example, when we’re implementing ML models on property data to enrich it through imputations and predictions, we have several models implemented using Google AutoML Tables and Google BigQuery ML. With the majority of the source data persistent and available in Google BigQuery and Cloud Storage, instead of performing the feature engineering process in tools like Spark, we implemented the model utilizing a combination of Data Build Tool (DBT) and Google BigQuery engines to bring the computing power to the data.
While we used DBT to deconstruct large processes into reusable components, organized into a directed acyclic graph (DAG) that allows them to run concurrently, Google BigQuery brought to the table its massively parallel processing architecture.
In the end, not only did the implementation of the model, integration into the pipeline and surveying the model became much simpler, the compute and run time was cut down by a whopping 85%.
Sustaining machine learning
Overall, with advancements in AI/ML space, building and implementing these models is becoming easier. However, successfully implementing these models is not just about applying the best tools and technologies; the key is a broader approach to integrate and sustain these models in an efficient manner. This is what helps organizations to reap benefits and succeed in the long run.
Authors
Anand Singh
Sr Leader, Data Technology
Sunny Chun-Jou Hsiang
Sr Professional, Data Scientist