There are probably a variety of ways to say that one sample is similar to another. Similar could be that brand sales have to match or nearly match by month. Depending on how many brands and months of data you have, this could be overwhelming. Another way could be that brand sales have to match over a certain time period, e.g. over 12 months. Yet a third way could be that total store sales, regardless of brand, have to match over a certain time period. These examples have the matching criteria getting progressively less detailed and I'm sure there are a variety of other matches that could make two samples similar. If the number of brands are few enough then the algorithm from the paper or something similar could work. For example, say you have your universe broken into 2 data files, one consisting of all the records pulled in the first sample and the second consisting of all records not pulled in the first sample (assuming you don't want to pull the same record in the matched data file). For simplicity, let's say there are 2 brands and we have the annual sales for each brand called AnnualSales_Brand1 and AnnualSales_Brand2. Then one way to match could be: proc sql; create table matchdata as select one.StoreID as One_StoreID, two.StoreID as Two_StoreID, one.AnnualSales_Brand1 as One_SalesB1, two.AnnualSales_Brand1 as Two_SalesB1, one.AnnualSales_Brand2 as One_SalesB2, two.AnnualSales_Brand2 as Two_SalesB2 from Sample1 one, UniverseLessSample1 two where (-xxx <= one.AnnualSales_Brand1 - two.AnnualSales_Brand1 <= xxx and -xxx <= one.AnnualSales_Brand2 - two.AnnualSales_Brand2 <= xxx); This matching is based on finding values that fall within a certain xxx dollar amount, but you could use ratios to get within certain % or other matching criteria. It will also provide a one to many match. If you want to have a 1:1 match then you could randomly select one Two_StoreID for each One_StoreID value. Of course if your first sample was randomly selected, then taking another random selection for you 2nd sample may provide you with a "similar" data set too. Hope this helps!
... View more