Data Requirements
We used a randomized data from a set of 300,000 Southern California properties with 9 unique features: id, value, ltv_ratio, zip, city, county_id, state, latitude and longitude. We also used this dataset from Github to add zip code centroid latitude and longitude and used Google Maps to pick a retailer in Southern California and use its latitude and longitude. Your dataset should contain at least value, ltv_ratio, zip, latitude and longitude in its main property dataset.
Data Exploration
In this Playbook, we assume that high value customers are those who are in the top quintile for both Property Value (value) and Loan-to-Value Ratio (ltv_ratio). To understand what we should make our cutoffs to be, we must first explore our data so that we can understand our population. We will perform the following tasks separately for both Property Value and Loan-to-Value Ratio, I will refer to each as the "target feature":
- Row Filter to remove null values and zeros from the target feature
- Variable operator to create 5 equal bins of the target feature by using Quantile Variables and "Average Ascend" for Quantile Type
- Summary Statistics on our target feature using the bins we created in the previous step for the Group by parameter
We can then analyze the output of our Summary Statistics operators to see the ranges for our top quintiles of Property Value and Loan-to-Value Ratio - these values will be used in a Row Filter to remove anyone who does not meet these thresholds.
At this point, your flow should look something like this:
Data Enrichment and Feature Engineering
In this section we will want to perform the following tasks:
- Join in our zip code centroids on zip
-
Perform a side join of our retailer location by joining on 'dummy variables'
- For this step, we have created two new columns in each dataset called 'dummy' that has a value of 1 for every record
-
Calculate the great circle distances of each zip centroid and property from the retailer location
-
In this example we are using Postgres database so we can leverage the
earthdistance
functions - these return distance in meters so we also divided the output by 1609.34 to convert to miles. If you are using another database or Hadoop, please refer to their documentation or calculate the Haversine Distance using the Haversine Formula
-
In this example we are using Postgres database so we can leverage the
- Row Filter to select zip codes whose centroids are within 10 miles of the retailer - we are considering this radius to be the trade area
-
Variable operator and use CASE statements (or IF statements in other languages) to create booleans for subsets of radiuses
- We used 2.5 miles and 7.5 miles for ours - this a nice indication of how close these high value customers actually are
- Aggregate by zip to sum the counts of each radius and the total counts in the trade area
Your final flow should look something like this:
Other Playbook Assets
Homeowner Segmentation Touchpoint
In this Touchpoint, we allow users to enter threshold values for both Property Value and Loan-to-Value Ratio and are returned a ranked list of all cities who have properties that satisfy the given constraints. This Touchpoint allows non-analytic users to understand the demographics of the cities in their dataset in a code-free and workflow-free environment. The list of cities can be easily exported as a .csv file to be added to a sales presentation or to populate an email.
Recommended Comments
There are no comments to display.