turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Match-Control Samples

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-19-2013 09:41 AM

Hi,

I have a sample from a Universe which was taken a couple of weeks ago.

Now I need to take a second sample from the same universe with similar characteristics to the previous sample (brands sales for example). The Idea is to have a second sample which can be used on a significance test against the first sample.

Can i use PROC SURVEYSELECT for this? How can I achive that?

Should I use another proc? Or do I need to code it myself?

Thanks!!!

Accepted Solutions

Solution

10-02-2013
11:23 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-02-2013 11:23 AM

Hi Mostater,

Again, thank you for your contribution.

At the end, I did something else. For your reference, here is what I did:

First I summed up sales for the category of interest (beers). Thus, I ended up with a table having Beer Sales by Store By Month.

Next, I calculated the euclidean distance between each store and all the rest. So I took the difference in Beer Sales for each month store by store against the rest, and squared that. After, I summed up the squared differences by month resulting in one number per each combination of stores. Finally I calculated the squared root of that sum, and that's the euclidean distance.

Then, I just sorted the table by that distance and took the store pair (test-control) where that distance was the minimum for the test store.

With this method, I ended up with 2 samples, one test and one control, with similar sales trend month by month over 20 months. Nice, isn't it?

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-19-2013 11:49 AM

Hi MarcosAramburu,

There may be more elegant solutions for your situation, but one way I can think of would be to use proc SQL. Here is a SAS user group paper that may be of help http://analytics.ncsu.edu/sesug/2001/P-403.pdf. Another possibility is to use propensity scores and some sort of Greedy or optimal matching algorithm...this may be overly complex for what you need but is definitely something to consider if you have many variables that you need to match on.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-19-2013 01:41 PM

Hi Mostater,

Thank you very much for your quick response. I've gone over that paper and sadly,It doesn't help me.

I need to extract a second sample from the universe. All I have is the sales of a couple of brands by store by month. I need to extract that second sample which behaves "similar" to the sample I've already have, according to those brands sales.

Do you think that algorithm will help me?

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-19-2013 05:03 PM

There are probably a variety of ways to say that one sample is similar to another. Similar could be that brand sales have to match or nearly match by month. Depending on how many brands and months of data you have, this could be overwhelming. Another way could be that brand sales have to match over a certain time period, e.g. over 12 months. Yet a third way could be that total store sales, regardless of brand, have to match over a certain time period. These examples have the matching criteria getting progressively less detailed and I'm sure there are a variety of other matches that could make two samples similar.

If the number of brands are few enough then the algorithm from the paper or something similar could work. For example, say you have your universe broken into 2 data files, one consisting of all the records pulled in the first sample and the second consisting of all records not pulled in the first sample (assuming you don't want to pull the same record in the matched data file). For simplicity, let's say there are 2 brands and we have the annual sales for each brand called AnnualSales_Brand1 and AnnualSales_Brand2. Then one way to match could be:

proc sql;

create table matchdata

as select

one.StoreID as One_StoreID,

two.StoreID as Two_StoreID,

one.AnnualSales_Brand1 as One_SalesB1,

two.AnnualSales_Brand1 as Two_SalesB1,

one.AnnualSales_Brand2 as One_SalesB2,

two.AnnualSales_Brand2 as Two_SalesB2

from Sample1 one, UniverseLessSample1 two

where (-xxx <= one.AnnualSales_Brand1 - two.AnnualSales_Brand1 <= xxx and

-xxx <= one.AnnualSales_Brand2 - two.AnnualSales_Brand2 <= xxx);

This matching is based on finding values that fall within a certain xxx dollar amount, but you could use ratios to get within certain % or other matching criteria. It will also provide a one to many match. If you want to have a 1:1 match then you could randomly select one Two_StoreID for each One_StoreID value.

Of course if your first sample was randomly selected, then taking another random selection for you 2nd sample may provide you with a "similar" data set too.

Hope this helps!

Solution

10-02-2013
11:23 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

10-02-2013 11:23 AM

Hi Mostater,

Again, thank you for your contribution.

At the end, I did something else. For your reference, here is what I did:

First I summed up sales for the category of interest (beers). Thus, I ended up with a table having Beer Sales by Store By Month.

Next, I calculated the euclidean distance between each store and all the rest. So I took the difference in Beer Sales for each month store by store against the rest, and squared that. After, I summed up the squared differences by month resulting in one number per each combination of stores. Finally I calculated the squared root of that sum, and that's the euclidean distance.

Then, I just sorted the table by that distance and took the store pair (test-control) where that distance was the minimum for the test store.

With this method, I ended up with 2 samples, one test and one control, with similar sales trend month by month over 20 months. Nice, isn't it?