Solved: Re: data set with all possible combinations of variable values

bnorth66 · Posted 05-15-2019 06:42 AM

is there a way to take a data set that contains 2 categorical variables and create an output data set with a single line for all possible combinations of those two variables (maybe I'd have to create data sets with duplicates first ? not sure)

if it makes it clearer in R this would be done as follows using the mtcars data (all combinatios returned) - many thanks

expand.grid(unique(mtcars$cyl),unique(mtcars$gear))
Var1 Var2
1 6 4
2 4 4
3 8 4
4 6 3
5 4 3
6 8 3
7 6 5
8 4 5
9 8 5

PaigeMiller · Posted 05-15-2019 08:02 AM

I think you need to use a Cartesian join in PROC SQL.

proc sql;
    create table want as select a.name,b.amount
         from have(keep=name) a,have(keep=amount) b;
quit;

--
Paige Miller

View solution in original post

Kurt_Bremser · Posted 05-15-2019 06:58 AM

Do you mean to do a frequency count for the distinct combinations? Most easily done in SQL:

proc sql;
create table want as
select var1, var2, count(*) as count
from have
group by var1, var2;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

bnorth66 · Posted 05-15-2019 07:56 AM

thanks, KurtBremser,

I must admit though that doesn't seem to be working for me - I suspect my question wasn't clear.

So I adapted some code/data online and run the code - is this what was intended ? I do just get the counts rather than all combinations - there is no entry in want for name=Raj and amount=20 for example

data have;
  input NAME $ AMOUNT ;
  datalines;
RAJ     90  
RAJ     40
RAVI    20  
JOHN    30  
JOSEPH  40  
run;

proc sql;
create table want as
select name, amount, count(*) as count
from have
group by name, amount;
quit;

many thanks again for your help

bnorth66 · Posted 05-15-2019 08:00 AM

thanks, Paige, good point the first column in the "R" example is jus the row number that R inserts - it is confusing apologies

for the data set I sent earlier the output I'd want would be

RAJ 20  
RAJ 30  
RAJ 40
RAJ 90 
RAVI 20  
RAVI 30  
RAVI 40
RAVi 90  
JOHN 20  
JOHN 30  
JOHN 40
JOHN 90 
JOSEPH 20  
JOSEPH 30  
JOSEPH 40
JOSEPH 90  
run;

thanks again for all your advice

Kurt_Bremser · Posted 05-15-2019 09:02 AM

You need to do a cartesian join with distinct, then:

proc sql;
create table want as
select distinct
  a.name as name,
  b.amount as amount
from have a, have b
;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PaigeMiller · Posted 05-15-2019 08:02 AM

I think you need to use a Cartesian join in PROC SQL.

proc sql;
    create table want as select a.name,b.amount
         from have(keep=name) a,have(keep=amount) b;
quit;

--
Paige Miller

PaigeMiller · Posted 05-15-2019 07:56 AM

Some of us don't understand R code. However, you talk about 2 categorical variables, but you show an example with three variables, which is confusing. Could you show us the input SAS data set, and the desired output?

--
Paige Miller

Tom · Posted 05-15-2019 09:07 AM

So you only want the combinations based on the values that actually appear in the data?

Not ALL of the possible combinations based on some external definition of the population of possible values for those two variables?

For the latter you would need some other source of the list of possible values.

bnorth66 · Posted 05-15-2019 09:36 AM

many thanks, Paige and Kurt, both methods work brilliantly so I'm spoilt for choice 🙂

yes, thats right Tom - just the values in the actual data set would be OK at the moment though I can see maybe your more general problem might be useful sometimes too

many thanks again for all your help - sorry I'm a bit of a novice SAS user

Kurt_Bremser · Posted 05-15-2019 09:53 AM

If you need to treat the two RAJ's in your example data as separate persons, @PaigeMiller's code is the way to go. If it's one person, then mine with the distinct prevents the doubling of the RAJ entries.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

bnorth66 · Posted 05-15-2019 10:05 AM

Hi KurtBremser

ah yes ... you're right of course - I think its your code I need - many many thanks for all your help and patience !

Registration is open