Hello
I have a data set with 2 columns called ID and Vector.
Column "Vector" including concatenation of codes .
I want to create a new data set called wanted with a new column called "Ind" that classify each ID into one of two groups: Red or Black.
If field "Vector" contain at least one of the codes in macro variable black_list then Ind='Black'
else Ind='Green'
I also want to create new columns with binary values 1/0 of splitting column Vector.
The new columns names will be :Code_23, Code_48,Code_65 and so on.
The order of the columns will be by the numbers order (3,6,12,13 and so on)
What is the way to do it please?
Data have;
Input ID Vector $20.;
Cards;
1 23,48,65,19,82,17,15
2 16
3 98,76,13,19,25
4 87,12,52,76,24
5 3,28
6 6,16
7 19,82,25
8 28,98,26
9 76,43,25
10 43,65,87,98,12
;
Run;
%let black_list=25,19,18;
For the first, do this:
Data have;
Input ID Vector $20.;
Cards;
1 23,48,65,19,82,17,15
2 16
3 98,76,13,19,25
4 87,12,52,76,24
5 3,28
6 6,16
7 19,82,25
8 28,98,26
9 76,43,25
10 43,65,87,98,12
;
%let black_list=25,19,18;
data green_black;
set have;
ind = "green";
do i = 1 to countw("&black_list.",",");
if index(vector,scan("&black_list.",i,",")) then ind = "black";
end;
drop i;
run;
For the second, transpose to long, create a sorted template, and transpose back to wide:
data long;
set have;
length code $7;
value = 1;
do i = 1 to countw(vector,",");
code = "Code_" !! scan(vector,i,",");
output;
end;
drop i vector;
run;
proc sort data=long (keep=code) out=template nodup;
by code;
run;
data long2;
set
template
long
;
run;
proc transpose data=long out=wide (drop=_: where=(id ne .));
by id;
var value;
id code;
run;
data want;
set wide;
array nums _numeric_;
do over nums;
nums = coalesce(nums,0);
end;
run;
Consider the advantages of creating variables for all 99 codes. (You can adjust this if there are more than 99 possible codes.)
The programming will be easier. If you have multiple batches of data, they will all contain the same variables. And if you really want to eliminate codes which do not appear in the data, you can always do that later.
Here is the idea:
data want;
set have;
array codes {99} code_1 - code_99;
do _n_=1 to 99;
codes{_n_} = 0;
end;
do _n_=1 to countw(vector, ',');
codes{input(scan(vector, _n_, ','), 2.)} = 1;
end;
ind='green';
do _n_=&blacklist;
if codes{_n_}=1 then ind='black';
end;
run;
Now you're done ... both objectives ... unless you still want to remove codes that are always 0. If so, continue with:
proc summary data=want;
var code_1 - code_99;
output out=stats sum=;
run;
data _null_;
set stats;
length droplist $ 1000;
array codes {99} code_1-code_99;
do _n_=1 to 99;
if codes{_n_}=0 then droplist = catt (droplist, vname(codes{_n_}));
end;
call symputx('droplist', droplist);
end;
data really_want;
set want (drop=(&droplist));
run;
The code is all untested, so write back if you encounter any difficulty with testing and debugging.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.