Hi,
I have a results dataset with study participant IDs and string result values. The data were recorded over different time period, hence there are multiple rows for each ID, and some IDs have the same result value at different time periods (e.g. for ID #1 resultB is red-apples at both time periods),
data have;
input ID resultA :$200. resultB :$200. ;
datalines;
1 green-apples red-apples
1 red-apples red-apples
2 green-apples yellow-pears
2 red-apples yellow-pears
3 green-apples red-apples
3 green-apples green-apples
4 red-apples green-apples
4 red-apples red-apples
;
run;
I want to generate a unique list of all string result values, and then determine how many IDs had this result (so collapsing the result string values by ID/reporting the result string by total number of IDs with this result...). From my searching on the forum so far, I think I need to first append the resultA and result B columns..
data Appended;
set have;
array NAMES resultA--resultB;
do i = 1 to dim(NAMES);
if not missing (NAMES{i}) then do;
All_Names = NAMES{i};
output;
end;
end;
drop resultA--resultB i;
run;
And then perhaps use proc report to generate a summary, similar to a pivot table in Excel
proc report data=Appended;
column All_Names id ;
define All_Names/group "Names";
define ID/group "id";
run;
But I'm not sure if this is the most efficient way to achieve what I want (dataset below)
data want;
input Names :$200. ID_count ;
datalines;
green-apples 4
red-apples 4
yellow-pears 1
;
run;
Any help much appreciated. Happy to clarify if above unclear.
Thank you,
I find counting is usually easily done in PROC FREQ.
proc freq data=appended;
table id*all_names/noprint out=_counts1_;
run;
proc freq data=_counts1_;
table all_names/noprint out=want;
run;
Your idea for reshaping is correct though I wouldn't think of it as appending.
See if this is what you need:
proc sql; create table want as select all_names, count(*) from (select distinct all_names,id from Appended) group by all_names ; quit;
The sub-query in the From, the (select distinct all_names,id from Appended) gets unique combinations of the all_names and id. the outer select with the group by gets the count of those unique by the values of all_names.
If the data set is real large sometimes the distinct bit can take a bit of time as there is sorting and comparisons going on in the background.
I find counting is usually easily done in PROC FREQ.
proc freq data=appended;
table id*all_names/noprint out=_counts1_;
run;
proc freq data=_counts1_;
table all_names/noprint out=want;
run;
Thank you very much, @ballardw and @PaigeMiller - both of your responses are practical and readily solve my question (I wish I could accept them both as the solution!).
Thank you very much again for your help.
One related query I have is, how I can reconcile reverse-order results in the dataset (e.g. ID #2 apples-red instead of red-apples). For example, based on the above data and code,
data have;
input ID resultA :$200. resultB :$200. ;
datalines;
1 green-apples red-apples
1 red-apples red-apples
2 green-apples yellow-pears
2 apples-red yellow-pears
3 green-apples red-apples
3 green-apples green-apples
4 red-apples green-apples
4 red-apples red-apples
;
run;
/*APPEND resultA--resultB COLUMNS */
data Appended;
set have;
array NAMES resultA--resultB;
do i = 1 to dim(NAMES);
if not missing (NAMES{i}) then do;
All_Names = NAMES{i};
output;
end;
end;
drop resultA--resultB i;
run;
proc sql;
create table want as
select all_names, count(*)
from (select distinct all_names,id from Appended)
group by all_names
;
quit;
/*ALTERNATIVELY, USING PROC FREQ*/
proc freq data=appended;
table id*all_names/noprint out=_counts1_;
run;
proc freq data=_counts1_;
table all_names/noprint out=want;
run;
You can see that the frequency count reports for 3 "red-apples" and 1 for "apples-red", but I really just want the results to show for "apples-red" format - perhaps I need to clean the results data so the order of the 2 words either side of the hyphen is consistent throughout the results... can you recommend an efficient way to do this?
Thank you again,
In a SAS data step, after your create data set APPENDED (or even while you are creating APPENDED), split the text string into two words at the hyphen, so you would get (depending on the original text string)
word1=apples
word2=red
OR
word1=red
word2=apples
Then sort the values of WORD1 and WORD2 into alphabetical order, using CALL SORTC() In this case, apples will always be first and red will always be second.
Then combine WORD1 and WORD2 back into one text string and run the double PROC FREQ as before.
My first reaction to seeing your example was "why are there two values in one variable". That almost never bodes well for good analysis or reporting.
This question then also brings up the possible case of what if you only have "apples" or "red" but not both for some value?
Hi @ballardw , the output data are numerical and linked to a reference file containing "two values in one variable", as you note. The examples are just that, as I can't share my data.
You can sort them, but if you only have two values then a simple IF/THEN achieve the ordering.
Just stick it into the step that is generating the list of all values.
...
All_Names = NAMES{i};
if scan(all_names,1,'-') > scan(all_names,2,'-') then
all_names = catx('-',scan(all_names,2,'-'),scan(all_names,1,'-'))
;
...
Thanks Tom, this is pretty much how I ended up doing it based on @PaigeMiller 's helpful guidance
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.