Skip to main content
Log in

Recording discrepancies in Nielsen Homescan data: Are they present and do they matter?

  • Published:
QME Aims and scope Submit manuscript

Abstract

We report results from a validation study of the Nielsen Homescan consumer panel data. We use data from a large grocery retailer to match transactions that were recorded by the retailer (at the store) and by the Homescan panelist (at home). The matched data allow us to identify and document discrepancies between the two data sets in reported shopping trips, products, prices, and quantities. We find that the discrepancies are largest for the price variable, and show that they are due to two effects: the first seems like standard recording errors (by Nielsen or the panelists), while the second is likely due to the way Nielsen imputes prices. We present two simple applications to illustrate the impact of recording differences, and we use one of the applications to illustrate how the validation study can be used to adjust estimates obtained from Nielsen Homescan data. The results suggest that while recording discrepancies are clearly present and potentially impact results, corrections, like the one we employ, can be adopted by users of Homescan to investigate the robustness of their results to such potential recording differences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. See also http://www.nielsen.com/clients/index.html for additional information about the Homescan data.

  2. Earlier we mentioned a simple matching algorithm we used for the data construction. This was only used to speed up the data requesting process from the retailer, and we do not use its results further. In this section we describe a more systematic matching strategy that is used for the remainder of the paper.

  3. A natural speculation is that some of these mis-recorded trips simply mis-record the date by a day (e.g., because the household did not get around to actually scanning the purchased products at home until the next day). Using the retailer’s data from the second step we found that while such cases occur, they do not account for a large fraction of the 20% mis-recorded trips reported here.

  4. For each of the 291 Homescan households for which we obtained data in the second step, we compute the fraction of their retail trips that produced a match, where a match is defined as a trip, of any size, with r1 greater than 0.7. A higher fraction implies that this household made fewer errors in recording the store and date. The distribution of this fraction is bimodal. We define a poor match household as one in which the fraction is less than 0.3. This procedure eliminated 18 households and left us with 273 households, who used the same loyalty cards (or matched cards, as linked by the retailer) consistently. We then applied a similar procedure to specific cards of these households, which made us drop a small number of cards.

  5. While we do not have direct store-level data on card usage, we can get a rough idea of this. Specifically, for each observation in the transaction-level data that is associated with a loyalty card discount that reduced the price from p to p − d, we ask what is the corresponding store-level (average) price \(\overline{p}\) at that store and week. Our estimate of loyalty card use (for a given item at a given store and week) is then given by \(u=(p-\overline{p}) /d\). Of course, this may vary due to sampling variation, but across items, stores, and weeks, the distribution of u is centered around 75–80%.

  6. The reported results do not account for coupons. Results that use prices net of coupons are qualitatively similar, and are available from the authors upon request.

  7. Our analysis so far used data from two metropolitan areas (see Appendix). Here we only use data from the larger metropolitan area, as a way to minimize confounding the results due to pricing differences between the two areas. Coincidentally, this is also the metropolitan area covered by the Homescan data used in Dube (2004) and Aguiar and Hurst (2007).

  8. We note that our exercise is somewhat similar in spirit to the exercise reported by Gupta et al. (1996) who compare demand elasticities estimated from consumer-level data to those estimated from store-level data. Unlike them, however, we use the same set of transactions, so we can focus on the measurement error; their results are likely driven by selection issues: consumers in the panel might not represent the population of shoppers in the store.

  9. The average semi-elasticity in this selected sample is lower than that reported earlier, but the difference between the data sets is similar. In this selected sample, we estimate semi-elasticities of − 0.128 (0.022) and − 0.378 (0.039) using the Homescan and the retailer data, respectively.

References

  • Aguiar, M., & Hurst, E. (2007). Life-cycle prices and production. American Economic Review, 97(5), 1533–1559.

    Article  Google Scholar 

  • Ashenfelter, O., & Krueger, A. B. (1994). Estimates of the economic returns to schooling from a new sample of twins. American Economic Review, 84(5), 1157–1173.

    Google Scholar 

  • Bound, J., Brown, C. C., & Mathiowetz, N. (2001). Measurement error in survey data. In E. E. Learner & J. J. Heckman (Eds.), Handbook of econometrics (pp. 3705–3843). New York: North Holland.

    Google Scholar 

  • Bound, J., & Krueger, A. B. (1991). The extent of measurement error in longitudinal earnings data: Do two wrongs make a right? Journal of Labor Economics, 9(1), 1–24.

    Article  Google Scholar 

  • Broda, C., & Weinstein, D. E. (2009). Product creation and destruction: Evidence and price implications. American Economic Review, in press.

  • Broda, C., & Weinstein, D. E. (2008). Understanding international price differences using barcode data. NBER Working Paper No. 14017.

  • Chen, X., Hong, H., & Tamer, E. (2005). Measurement error models with auxiliary data. Review of Economic Studies, 72(2), 343–366.

    Article  Google Scholar 

  • Dube, J-P. (2004). Multiple discreteness and product differentiation: Demand for carbonated soft drinks. Marketing Science, 23(1), 66–81.

    Article  Google Scholar 

  • Einav, L., Leibtag, E., & Nevo, A. (2008). On the accuracy of Nielsen Homescan data. USDA Economics Research Report Number 69.

  • Gupta, S., Chintagunta, P., Kaul, A., & Wittink, D. R. (1996). Do household scanner data provide representative inferences from brand choices: A comparison with store data. Journal of Marketing Research, 33(4), 383–398.

    Article  Google Scholar 

  • Hausman, J., & Leibtag, E. (2007). Consumer benefits from increased competition in shopping outlets: Measuring the effect of Wal-Mart. Journal of Applied Econometrics, 22(7), 1157–1177.

    Article  Google Scholar 

  • Katz, M. (2007). Estimating supermarket choice using moment inequalities. Ph.D. Dissertation, Harvard University.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aviv Nevo.

Additional information

We are grateful to two anonymous referees, to Peter Rossi (the Editor), and to participants at the Chicago-Northwestern IO-Marketing conference, the Hoover Economics Bag Lunch, the NBER Price Dynamics Conference, the NBER Productivity Potpourri, the Stanford Economics Junior Lunch, and the World Congress on National Accounts for many helpful comments. We thank Andrea Pozzi and Chris Taylor for outstanding research assistance. This research was funded by a cooperative agreement between the USDA/ERS and Northwestern University, but the views expressed herein are those of the authors and do not necessarily reflect the views of the U.S. Department of Agriculture.

Appendix: Detailed description of the data construction

Appendix: Detailed description of the data construction

As mentioned in the text and sketched in Fig. 1, our data construction process involved two distinct steps. Below we describe each step in turn.

First step

In principle, we could have asked the retailer to supply information on any of its stores visited at any point by a Homescan panelist. However, since generating the data involved some effort for the retailer we had to limit our data request, in the first step, to roughly fifteen hundred store-day transaction-level records.

We therefore proceeded as follows. First, we restricted the data set to two metropolitan areas, in which the retailer has high market share. This left us with 265 different retailer stores (147 in one area, and 118 in the other). Since we identify the store by the zip code of its location, we restricted attention to retailer stores that are the only retailer stores in the same zip code. This eliminated 76 stores (29%), and left us with 189 stores (101 in one area, 88 in the other). We then searched the Homescan data for shopping trips at these stores, with the additional conditions that: (i) the trip includes purchase of at least 5 distinct UPCs (to make a match easier); (ii) the trip occurred after February 15, 2004 (to guarantee that the retailer, who deletes transaction-level data older than two years, still had these data at the time we put in the data request); and (iii) the household shops at the retailer stores (according to Homescan) more than 20% and less than 80% of its trips. Our initial goal in generating the data was to study store choice; hence, we wanted consumers who visited the retailer’s frequently, but not always. These trips were made by 342 distinct households in the Homescan data. For 240 of these households, we randomly selected a single trip for each of them. For the remaining 102 households, which included households with at least 10, and not more than 20, reported trips in Homescan data, we selected all their trips. We then requested from the retailer the full transaction records for the store-days that matched these 1,779 trips. Since 74 of these trips were to the same store on the same date, we expected to get 1,705 store-day transaction-level records.

We eventually got 1,603 of these 1,705 requested store-days (1,247 in the first area, 356 in the other), which account for 4,080,770 shopping trips. The missing stores were mostly due to random coding errors when generating the data. The retailer had little idea how we were going to match the data and had no way to systematically impact our results by dropping data. They include 122 distinct stores (74 in the first area, and 48 in the other). These 1,603 store-days are associated with 1,675 trips from the sample of 1,779 shopping trips described above. However, since the retailer enjoys high market share in both areas, it is not surprising that the 1,603 store-day transaction-level data records we obtained are associated with additional 904 trips in Homescan. These additional trips happened when two households in the Homescan panel visited the same store on the same day, which is somewhat likely since the market share of the retailer is high in the markets we studied. Given the way we constructed the sample, however, many of these additional trips include a small number of items, or households that rarely shop at the retailer’s stores.

Second step

After obtaining the data from the first step, we developed a simple algorithm to find likely matches between trips in the Homescan data with trips in the retailer’s data. These likely matches were only used to speed up the data construction process (as described in the text, the data analysis in the paper uses a more systematic matching procedure.) The algorithm used the first five UPCs in the Homescan trip, and declared a match if at least three of these five were found in a given trip in the retailer’s data. We used this algorithm with the data we obtained in the first step and found 1,372 likely matches that, according to Homescan, are associated with 293 distinct households. Of these households, 166 were associated with more than one likely match, and 105 with four or more.

We then asked the retailer to use the loyalty card used in these 1,372 shopping trips and to provide us with all the transactions available for the households associated with these cards (in the retailer’s data during the year 2004). Only two of the requested trips were not associated with loyalty cards. For the rest, we obtained all the transactions associated with the same loyalty card, and additional transactions that are associated with loyalty cards used by the same household, as classified by the retailer. Since associating multiple cards with the same household may not be perfect, in the analysis we experimented with both the card-level and the household-level matching.

In this step we obtained a total of 40,036 shopping trips from the retailer. These 40,036 trips are associated with 384 distinct stores (139 in the first area, 109 in the second, and 136 in other areas), with 682 distinct loyalty cards (472 in the first area, 203 in the second, and 7 in other areas), and with 529 distinct households, according to the retailer’s definition (380 in the first area, 140 in the other). Finally, the 40,036 trips are associated with 34,316 unique store-date-loyalty card combinations, 33,744 unique store-data-household combinations (using the retailer’s definition of a household), and 27,746 unique store-date-household combinations, using the Homescan definition. Of these trips, 3,884 (9.7%) occurred in a store-day already appearing in the data we obtained earlier, and therefore are one of the 4,080,770 trips obtained in the first step. Recall that the algorithm we used to request these data was geared to find likely matches, and therefore may have also found wrong matches. This is one reason that the number of households we intended to match (291, the original 293 minus two that had no associated loyalty cards) is less than the number of households associated with these trips. A second reason may be multiple cards used by the same household that are not linked to each other by the retailer.

Summary

To summarize, we have two different types of data from the retailer. The first data set includes full transaction record of 1,603 distinct store-days. In the data set trips are not associated with a loyalty card. The second data set includes 40,036 trips, which are associated with particular loyalty cards and households. 3,884 of these trips overlap and appear in both data sets. The first data set is designed to match multiple transactions of 102 households in the Homescan data, and isolated transactions of other households. The second data set is designed to match all transactions of almost 300 households.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Einav, L., Leibtag, E. & Nevo, A. Recording discrepancies in Nielsen Homescan data: Are they present and do they matter?. Quant Mark Econ 8, 207–239 (2010). https://doi.org/10.1007/s11129-009-9073-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11129-009-9073-0

Keywords

JEL Classification

Navigation