Solved: Generate Unique List of String Values from dataset and then Generate C...

Epi_Stats · Posted 01-27-2024 01:12 PM

Hi,

I have a results dataset with study participant IDs and string result values. The data were recorded over different time period, hence there are multiple rows for each ID, and some IDs have the same result value at different time periods (e.g. for ID #1 resultB is red-apples at both time periods),

data have;
input ID resultA :$200. resultB :$200. ;
datalines;
1 green-apples red-apples
1 red-apples red-apples
2 green-apples yellow-pears
2 red-apples yellow-pears
3 green-apples red-apples
3 green-apples green-apples
4 red-apples green-apples
4 red-apples red-apples
;
run;

I want to generate a unique list of all string result values, and then determine how many IDs had this result (so collapsing the result string values by ID/reporting the result string by total number of IDs with this result...). From my searching on the forum so far, I think I need to first append the resultA and result B columns..

data Appended;
set have;
array NAMES resultA--resultB;
do i = 1 to dim(NAMES);
	if not missing (NAMES{i}) then do;
	All_Names = NAMES{i};
	output;
	end;
end;
drop resultA--resultB i;
run;

And then perhaps use proc report to generate a summary, similar to a pivot table in Excel

proc report data=Appended;
column All_Names id ;
define All_Names/group "Names";
define ID/group "id";
run;

But I'm not sure if this is the most efficient way to achieve what I want (dataset below)

data want;
input Names :$200. ID_count ;
datalines;
green-apples 4 
red-apples 4
yellow-pears 1
;
run;

Any help much appreciated. Happy to clarify if above unclear.

Thank you,

PaigeMiller · Posted 01-27-2024 01:52 PM

I find counting is usually easily done in PROC FREQ.

proc freq data=appended;
	table id*all_names/noprint out=_counts1_;
run;
proc freq data=_counts1_;
	table all_names/noprint out=want;
run;

--
Paige Miller

View solution in original post

ballardw · Posted 01-27-2024 01:51 PM

Your idea for reshaping is correct though I wouldn't think of it as appending.

See if this is what you need:

proc sql;
   create table want as
   select all_names, count(*)
   from (select distinct all_names,id from Appended)
   group by all_names
   ;
quit;

The sub-query in the From, the (select distinct all_names,id from Appended) gets unique combinations of the all_names and id. the outer select with the group by gets the count of those unique by the values of all_names.

If the data set is real large sometimes the distinct bit can take a bit of time as there is sorting and comparisons going on in the background.

PaigeMiller · Posted 01-27-2024 01:52 PM

I find counting is usually easily done in PROC FREQ.

proc freq data=appended;
	table id*all_names/noprint out=_counts1_;
run;
proc freq data=_counts1_;
	table all_names/noprint out=want;
run;

--
Paige Miller

Epi_Stats · Posted 01-27-2024 03:11 PM

Thank you very much, @ballardw and @PaigeMiller - both of your responses are practical and readily solve my question (I wish I could accept them both as the solution!).

Thank you very much again for your help.

Epi_Stats · Posted 01-27-2024 03:27 PM

One related query I have is, how I can reconcile reverse-order results in the dataset (e.g. ID #2 apples-red instead of red-apples). For example, based on the above data and code,

data have;
input ID resultA :$200. resultB :$200. ;
datalines;
1 green-apples red-apples
1 red-apples red-apples
2 green-apples yellow-pears
2 apples-red yellow-pears
3 green-apples red-apples
3 green-apples green-apples
4 red-apples green-apples
4 red-apples red-apples
;
run;

/*APPEND resultA--resultB COLUMNS */
data Appended;
set have;
array NAMES resultA--resultB;
do i = 1 to dim(NAMES);
	if not missing (NAMES{i}) then do;
	All_Names = NAMES{i};
	output;
	end;
end;
drop resultA--resultB i;
run;

proc sql;
   create table want as
   select all_names, count(*)
   from (select distinct all_names,id from Appended)
   group by all_names
   ;
quit;

/*ALTERNATIVELY, USING PROC FREQ*/
proc freq data=appended;
	table id*all_names/noprint out=_counts1_;
run;
proc freq data=_counts1_;
	table all_names/noprint out=want;
run;

You can see that the frequency count reports for 3 "red-apples" and 1 for "apples-red", but I really just want the results to show for "apples-red" format - perhaps I need to clean the results data so the order of the 2 words either side of the hyphen is consistent throughout the results... can you recommend an efficient way to do this?

Thank you again,

PaigeMiller · Posted 01-27-2024 04:26 PM

In a SAS data step, after your create data set APPENDED (or even while you are creating APPENDED), split the text string into two words at the hyphen, so you would get (depending on the original text string)

word1=apples

word2=red

OR

word1=red

word2=apples

Then sort the values of WORD1 and WORD2 into alphabetical order, using CALL SORTC() In this case, apples will always be first and red will always be second.

Then combine WORD1 and WORD2 back into one text string and run the double PROC FREQ as before.

--
Paige Miller

ballardw · Posted 01-28-2024 11:08 AM

My first reaction to seeing your example was "why are there two values in one variable". That almost never bodes well for good analysis or reporting.

This question then also brings up the possible case of what if you only have "apples" or "red" but not both for some value?

Epi_Stats · Posted 02-08-2024 06:22 PM

Hi @ballardw , the output data are numerical and linked to a reference file containing "two values in one variable", as you note. The examples are just that, as I can't share my data.

Tom · Posted 01-28-2024 12:02 PM

You can sort them, but if you only have two values then a simple IF/THEN achieve the ordering.

Just stick it into the step that is generating the list of all values.

...
All_Names = NAMES{i};
if scan(all_names,1,'-') > scan(all_names,2,'-') then 
  all_names = catx('-',scan(all_names,2,'-'),scan(all_names,1,'-'))
;
...

Epi_Stats · Posted 02-08-2024 06:16 PM

Thanks Tom, this is pretty much how I ended up doing it based on @PaigeMiller 's helpful guidance

Generate Unique List of String Values from dataset and then Generate Count by participant ID

Re: Generate Unique List of String Values from dataset and then Generate Count by participant ID

Re: Generate Unique List of String Values from dataset and then Generate Count by participant ID

Re: Generate Unique List of String Values from dataset and then Generate Count by participant ID

Re: Generate Unique List of String Values from dataset and then Generate Count by participant ID

Re: Generate Unique List of String Values from dataset and then Generate Count by participant ID

Re: Generate Unique List of String Values from dataset and then Generate Count by participant ID

Re: Generate Unique List of String Values from dataset and then Generate Count by participant ID

Re: Generate Unique List of String Values from dataset and then Generate Count by participant ID

Re: Generate Unique List of String Values from dataset and then Generate Count by participant ID

Re: Generate Unique List of String Values from dataset and then Generate Count by participant ID

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!