Klos Energy Consulting, LLC
            
Probabilistic Systems to Fill in Missing Data

Missing data is a problem in many population datasets. If information about the characteristics of individual customers is not known, it is not possible to know the distribution of the characteristics. For example, if we don't have NAICS information for every business customer, we don't know how many customers are in each business type group (manufacturing, retail, office, school, etc.).

The problem of missing data is particularly acute for NAICS information in utility databases. It is not necessary for utilities to know the NAICS code for each customer to fulfill their normal business transactions (energy delivery and billing). Combine this lack of necessity with the fact that small businesses turn over very rapidly and you have the consequence that many NAICS codes are missing within most utility customer databases. This is unfortunate because the business type is a very important characteristic for understanding customer energy use and potential energy efficiency and demand response opportunities.

It is possible to buy NAICS information from third-party sources like Dun & Bradstreet or InfoUSA. However, this data is expensive and there is no guarantee that it is 100% matchable or correct. When utilities do not have the funds to buy this type of data, Daniel has developed an inexpensive method for filling in missing NAICS code information in a way that makes it usually right, but not always.

This method is called “Pizza Logic” and it works like this. He takes all customer records that do have an assigned NAICS code (and assumes these are right). He then parses out the words in each business name and keeps track of what the associated NAICS code is for each of those words. For example, if the word 'Pizza' shows up in the business name it is likely that the business has a NAICS code that indicates they are a restaurant. Each word then gets a probability score for belonging to a particular NAICS group. These probability scores are then matched to all business names that do not have any NAICS information. So, a business called 'Pizza Place' will probably get a restaurant NAICS code assigned to it with a high probability. The result is a NAICS code for every customer that is probably right.

The downside of this method is that it is not always right for everyone. It is only right most of the time. For example, a business called “Farm Supply” is likely to get assigned to the insurance category because so many insurance companies (State Farm, Farmer's Insurance) have the word 'Farm' within them. With the Pizza Logic method there is no way to avoid this kind of mistake. Data will not be 100% clean. But when there is no budget available to buy good NAICS data, Pizza Logic can be a quick way to significantly improve on all of the missing NAICS data.

PECO Energy DSM Baseline Study (2010)

AEP-Ohio DSM Baseline and Potential Study (2010)

MidAmerican Energy Small Commercial and Industrial Energy Efficiency Market Penetration Study. (2007)