I have two very large datasets with the exact same variables name and row counts, and I'm doing comparisons between the variables. I'm hoping to generate frequency counts in output format such as below:
Variable  0_data1  1_data1  0_data2  1_data2  etc 
ang1  45  3  45  2 

ang2  43  23  42  23 

ang3  44  43  44  43 

ang4  23  343  23  321 

etc 





All variables are in character format with level '0' or '1', and I need to compare counts among levels. There are over 100 variables on each dataset , each dataset has over 10million rows.
To arrive at the output, I thought about outputting categorical variable frequency then merge outputs from both datasets by variable name.
Such that:
Original , there is no ID/index variable
ang1  ang2  ang3  ang4  Etc. 
0  1  0  0  0 
1  1  1  1  1 
0  1  0  1  0 
0  0  1  1  1 
etc 




To
Variable  0_data1  1_data1 
ang1  45  3 
ang2  43  23 
ang3  44  43 
ang4  23  343 
etc 


How to arrive at the above output for data1?
Appreciate for any inputs.
If it's 0/1 data, and you're only comparing counts that could be equivalent of comparing the mean and N of those variables, or sum.
So you can run each data set through proc means summarizing the Mean and N and then compare them. If you want the 0/1 values then it's bit more work but doable.
ods select none;
proc means data=have1 N SUM MEAN;
var _numeric_;
ods output want = _summary1;
run;
ods select all;
Run something like that for each data set and then combine the results as follows:
data want;
length ds dsname $50.;
set _summary1 _summary2 indsname=ds;
dsname = ds;
run;
Then you can use proc transpose if you really want the long format. Otherwise you transpose and then merge so really it's the same operations, different order.
If your data is character this will not work. Instead you'll need to use PROC FREQ to replace the PROC MEANS.
Rough idea is here https://gist.github.com/statgeek/e0903d269d4a71316a4e
Refer to all variables using _all_, _numeric_ or _character_.
Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6easywaystospecifyalistofvariablesinsas.html
@lydiawawa wrote:
I have two very large datasets with the exact same variables name and row counts, and I'm doing comparisons between the variables. I'm hoping to generate frequency counts in output format such as below:
Variable
0_data1
1_data1
0_data2
1_data2
etc
ang1
45
3
45
2
ang2
43
23
42
23
ang3
44
43
44
43
ang4
23
343
23
321
etc
All variables are in character format with level '0' or '1', and I need to compare counts among levels. There are over 100 variables on each dataset , each dataset has over 10million rows.
To arrive at the output, I thought about outputting categorical variable frequency then merge outputs from both datasets by variable name.
Such that:
Original , there is no ID/index variable
ang1
ang2
ang3
ang4
Etc.
0
1
0
0
0
1
1
1
1
1
0
1
0
1
0
0
0
1
1
1
etc
To
Variable
0_data1
1_data1
ang1
45
3
ang2
43
23
ang3
44
43
ang4
23
343
etc
How to arrive at the above output for data1?
Appreciate for any inputs.
If it's 0/1 data, and you're only comparing counts that could be equivalent of comparing the mean and N of those variables, or sum.
So you can run each data set through proc means summarizing the Mean and N and then compare them. If you want the 0/1 values then it's bit more work but doable.
ods select none;
proc means data=have1 N SUM MEAN;
var _numeric_;
ods output want = _summary1;
run;
ods select all;
Run something like that for each data set and then combine the results as follows:
data want;
length ds dsname $50.;
set _summary1 _summary2 indsname=ds;
dsname = ds;
run;
Then you can use proc transpose if you really want the long format. Otherwise you transpose and then merge so really it's the same operations, different order.
If your data is character this will not work. Instead you'll need to use PROC FREQ to replace the PROC MEANS.
Rough idea is here https://gist.github.com/statgeek/e0903d269d4a71316a4e
Refer to all variables using _all_, _numeric_ or _character_.
Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6easywaystospecifyalistofvariablesinsas.html
@lydiawawa wrote:
I have two very large datasets with the exact same variables name and row counts, and I'm doing comparisons between the variables. I'm hoping to generate frequency counts in output format such as below:
Variable
0_data1
1_data1
0_data2
1_data2
etc
ang1
45
3
45
2
ang2
43
23
42
23
ang3
44
43
44
43
ang4
23
343
23
321
etc
All variables are in character format with level '0' or '1', and I need to compare counts among levels. There are over 100 variables on each dataset , each dataset has over 10million rows.
To arrive at the output, I thought about outputting categorical variable frequency then merge outputs from both datasets by variable name.
Such that:
Original , there is no ID/index variable
ang1
ang2
ang3
ang4
Etc.
0
1
0
0
0
1
1
1
1
1
0
1
0
1
0
0
0
1
1
1
etc
To
Variable
0_data1
1_data1
ang1
45
3
ang2
43
23
ang3
44
43
ang4
23
343
etc
How to arrive at the above output for data1?
Appreciate for any inputs.
The github example worked very well. Thanks!
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.