i have a dataset with about 80 variables and half million records. i want to do a frequency check on each variable and compare the frequencies of each variable with the previous year's frequencies to see if there are any abnormal numbers with the previous year. i used the proc freq statement to get all the variables' frequencies and then i did the same thing to get the previous years frequencies. then i copy/paste both results to an excel file and compare them one by one. it is time consuming. so i am wondering if anybody know an easy way to do this task in sas. Example of the sas code that i used as following:
proc freq data=data2019;
table age diagage birthyear race raceeth hisp gender housing povertylevel
povlevpercent insurance risk................./missing;
run;
proc freq data=data2020;
table age diagage birthyear race raceeth hisp gender housing povertylevel
povlevpercent insurance risk................./missing;
run;
1. Combine your data sets into one, with a Year to identify which records come from which data set.
data combined;
set data2019 (in=in2019) data2020 (in=in2020);
if in2019 then year=2019;
else year=2020;
run;
2. Run a PROC FREQ by year for each variable and save it to a data set
proc freq data=combined noprint;
table year*(age diagage birthyear race raceeth hisp gender housing ....) / missing out=want;
ods output onewayFreqs=want2
run;
3. Transpose (PROC TRANSPOSE) so you have variable counts and each year, ie: (could also do this via PROC TABULATE but only for display purposes)
Variable 2019 2020
Age
<5 5 7
5 to 10 8 10
10 to 15 10 12
....
...
...
Fully worked example:
data combined;
set sashelp.heart (in=in2019) sashelp.heart (in=in2020 obs=140);
if in2019 then year=2019;
else year=2020;
run;
ods select none;
proc freq data=combined ;
table year*(ageAtStart BP_STATUS ) / missprint ;
ods output crossTabFreqs=want;
run;
ods select all;
data long;
length variable $32. variable_value $50.;
set want;
where not missing(year);
Variable=scan(table, 2, '*');
Variable_Value=strip(trim(vvaluex(variable)));
presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
if missing(variable_value) or variable_value = "." then variable_value = "MISSING";
keep year variable variable_value frequency percent presentation;
label variable='Variable' variable_value='Variable Value';
run;
proc sort data=long;
by variable variable_value;
run;
proc transpose data=long out=wide (drop = _:);
by variable variable_value;
id year;
var frequency;
run;
@juliajulia wrote:
i have a dataset with about 80 variables and half million records. i want to do a frequency check on each variable and compare the frequencies of each variable with the previous year's frequencies to see if there are any abnormal numbers with the previous year. i used the proc freq statement to get all the variables' frequencies and then i did the same thing to get the previous years frequencies. then i copy/paste both results to an excel file and compare them one by one. it is time consuming. so i am wondering if anybody know an easy way to do this task in sas. Example of the sas code that i used as following:
proc freq data=data2019;
table age diagage birthyear race raceeth hisp gender housing povertylevel
povlevpercent insurance risk................./missing;
run;
proc freq data=data2020;
table age diagage birthyear race raceeth hisp gender housing povertylevel
povlevpercent insurance risk................./missing;
run;
You may need think some of this through a bit more.
For instance, one of your variables is Birthyear. I would be very surprised if a data set named Data2019 had any 2020 values while there could be some in the Data2020.
What sort of value is PovertyLevel? If it is one of the cut off limits for a specific categorical range of the Federal Poverty Guidelines then the boundary values change every year.
Personally I would probably start by combining the two data sets making sure there is a variable that captures which set a particular record is from. Then use that variable such as
proc freq data=Combined; table ( age diagage birthyear race raceeth hisp gender housing povertylevel povlevpercent insurance risk.................) * datayear /missing norowpct nopct; run;
This will create data with the year as a column heading and you can see things next to each other.
It would likely be helpful if you have any variables that are continuous such as Height, Weight, Income, (and likely ages as well) to create groups, such as 5-year or 10-year age ranges, 10 or 20 pound/ kg weight bands, or BMI categories.
Thank you very much. I only compare the distribution of the variables but not the values of the variables because the values are different from year to year. So i think i just combine the two years datasets and use the year as a crosstab to check the frequency. It seems served my purpose.
|
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.