BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MarcosAramburu
Calcite | Level 5

Hi,

I have a sample from a Universe which was taken a couple of weeks ago.

Now I need to take a second sample from the same universe with similar characteristics to the previous sample (brands sales for example). The Idea is to have a second sample which can be used on a significance test against the first sample.

Can i use PROC SURVEYSELECT for this? How can I achive that?

Should I use another proc? Or do I need to code it myself?

Thanks!!!

1 ACCEPTED SOLUTION

Accepted Solutions
MarcosAramburu
Calcite | Level 5

Hi Mostater,

Again, thank you for your contribution.

At the end, I did something else. For your reference, here is what I did:

First I summed up sales for the category of interest (beers). Thus, I ended up with a table having Beer Sales by Store By Month.

Next, I calculated the euclidean distance between each store and all the rest. So I took the difference in Beer Sales for each month store by store against the rest, and squared that. After, I summed up the squared differences by month resulting in one number per each combination of stores. Finally I calculated the squared root of that sum, and that's the euclidean distance.

Then, I just sorted the table by that distance and took the store pair (test-control) where that distance was the minimum for the test store.

With this method, I ended up with 2 samples, one test and one control, with similar sales trend month by month over 20 months. Nice, isn't it?

View solution in original post

4 REPLIES 4
mostater
Obsidian | Level 7

Hi MarcosAramburu,

There may be more elegant solutions for your situation, but one way I can think of would be to use proc SQL.  Here is a SAS user group paper that may be of help http://analytics.ncsu.edu/sesug/2001/P-403.pdf.  Another possibility is to use propensity scores and some sort of Greedy or optimal matching algorithm...this may be overly complex for what you need but is definitely something to consider if you have many variables that you need to match on.

MarcosAramburu
Calcite | Level 5

Hi Mostater,

Thank you very much for your quick response. I've gone over that paper and sadly,It doesn't help me.

I need to extract a second sample from the universe. All I have is the sales of a couple of brands by store by month. I need to extract that second sample which behaves "similar" to the sample I've already have, according to those brands sales.

Do you think that algorithm will help me?

Thanks!

mostater
Obsidian | Level 7

There are probably a variety of ways to say that one sample is similar to another. Similar could be that brand sales have to match or nearly match by month.  Depending on how many brands and months of data you have, this could be overwhelming.  Another way could be that brand sales have to match over a certain time period, e.g. over 12 months.  Yet a third way could be that total store sales, regardless of brand, have to match over a certain time period.  These examples have the matching criteria getting progressively less detailed and I'm sure there are a variety of other matches that could make two samples similar.

If the number of brands are few enough then the algorithm from the paper or something similar could work.  For example, say you have your universe broken into 2 data files, one consisting of all the records pulled in the first sample and the second consisting of all records not pulled in the first sample (assuming you don't want to pull the same record in the matched data file).  For simplicity, let's say there are 2 brands and we have the annual sales for each brand called AnnualSales_Brand1 and AnnualSales_Brand2.  Then one way to match could be:

proc sql;

create table matchdata      

as select                

  one.StoreID as One_StoreID,     

  two.StoreID as Two_StoreID,   

  one.AnnualSales_Brand1 as One_SalesB1,   

  two.AnnualSales_Brand1 as Two_SalesB1, 

  one.AnnualSales_Brand2 as One_SalesB2,   

  two.AnnualSales_Brand2 as Two_SalesB2 

from Sample1 one, UniverseLessSample1 two 

where (-xxx <= one.AnnualSales_Brand1 - two.AnnualSales_Brand1 <= xxx and

          -xxx <= one.AnnualSales_Brand2 - two.AnnualSales_Brand2 <= xxx);


This matching is based on finding values that fall within a certain xxx dollar amount, but you could use ratios to get within certain % or other matching criteria.  It will also provide a one to many match.  If you want to have a 1:1 match then you could randomly select one Two_StoreID for each One_StoreID value.


Of course if your first sample was randomly selected, then taking another random selection for you 2nd sample may provide you with a "similar" data set too.

Hope this helps!

MarcosAramburu
Calcite | Level 5

Hi Mostater,

Again, thank you for your contribution.

At the end, I did something else. For your reference, here is what I did:

First I summed up sales for the category of interest (beers). Thus, I ended up with a table having Beer Sales by Store By Month.

Next, I calculated the euclidean distance between each store and all the rest. So I took the difference in Beer Sales for each month store by store against the rest, and squared that. After, I summed up the squared differences by month resulting in one number per each combination of stores. Finally I calculated the squared root of that sum, and that's the euclidean distance.

Then, I just sorted the table by that distance and took the store pair (test-control) where that distance was the minimum for the test store.

With this method, I ended up with 2 samples, one test and one control, with similar sales trend month by month over 20 months. Nice, isn't it?

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 2328 views
  • 3 likes
  • 2 in conversation