Hi All, I am stuck with a SAS file and not sure how to progress.
The layout is -
Cust No | Prod | Acc flag | Bal |
---|---|---|---|
111 | a1 | qaz | 10 |
111 | b1 | wsx | 11 |
222 | c1 | wsx | 233 |
333 | d1 | grw | 251 |
444 | e1 | dc | 25 |
444 | f1 | vefw | 1231 |
444 | g1 | wefr | 36 |
555 | h1 | bdfgh | 615 |
666 | i1 | qe | 767 |
777 | j1 | erg | 31 |
777 | k1 | eth | 472 |
777 | l1 | ret | 251 |
777 | m1 | dfwef | 3761 |
888 | p1 | ffef | 36 |
I want to introduce two columns in this sas file. One lets say prod_con and other can be Acc_con.
It will look for the cust_no values in this above file. If it finds distinct values then it will simply copy the values of the column Prod and Acc Flag to the new columns prod_con and Acc_con respectively. But whenever it will find duplicates, it should concatenate the values in column Prod and Acc Flag and then copy them into the new columns prod_con and Acc_con respectively.
E.g. for Cust_no = 777, it should put j1k1l1m1 to prod_con and ergethretdfwef to Acc_con and this value will be repeated off course for all 777.
How about this:
Create an index on Cust_no on the file
Then create an extra file of unique customers numbers
Read the file with the unique customers numbers and for each one go to the original file using the KEY option in the SET statement to retrieve and concatenate the Prod and Acc_flag variables. The _iorc_=0 and _error_=0 at the end of the step are to prevent error messages in the log.
data original(index=(Cust_no));
input Cust_no Prod $ ACC_flag $ Bal $ ;
datalines;
111 a1 qaz 10
111 b1 wsx 11
222 c1 wsx 233
333 d1 grw 251
444 e1 dc 25
444 f1 vefw 1231
444 g1 wefr 36
555 h1 bdfgh 615
666 i1 qe 767
777 j1 erg 31
777 k1 eth 472
777 l1 ret 251
777 m1 dfwef 3761
888 p1 ffef 36
run;
proc sort data=original(keep=cust_no) out=customers nodupkey;
by Cust_no;
run;
data new;
set customers;
length Prod_con Acc_con $50;
do until (_iorc_ ne 0);
set original key = Cust_no;
if _iorc_ = 0 then do;
Prod_con = CATX('|',Prod_Con,Prod);
Acc_con = CATX('|',Acc_con,Acc_flag);
end;
end;
output;
_iorc_ = 0; _error_=0;
run;
proc print;
run;
Hi Tapas. I recommend doing two passes if you do know how many duplicates you can encounter. First to create your string and the second pass to add it back to the original records
data inds;
input custno prod $ acc_flag $ bal ;
datalines; 111 a1 qaz 10
111 b1 wsx 11
222 c1 wsx 233
333 d1 grw 251
444 e1 dc 25
444 f1 vefw 1231
444 g1 wefr 36
555 h1 bdfgh 615
666 i1 qe 767
777 j1 erg 31
777 k1 eth 472
777 l1 ret 251
777 m1 dfwef
3761
888 p1 ffef 36 ;
run;
proc sort data=inds; by custno prod; run;
data tmp_ds (keep=custno prod_con acc_con);
set inds;
by custno;
length prod_con acc_con $50.;
retain prod_con acc_con;
if first.custno then do;
prod_con =""; Acc_con = "";
end;
prod_con = trim(prod_con) || trim(prod);
acc_con = trim(acc_con) || trim(acc_flag);
if last.custno then output;
run;
data final_ds;
merge inds
tmp_ds;
by custno;
run;
@kevin_123,
That's pretty much how I would do it. One simplification you could make is to use the CALL CATS function in place of
Varname=TRIM(....) || TRIM(...).
So the 2 lines would be :
call cats(prod_con,prod);
call cats(acc_con,acc_flag);
Here's how I would approach this problem
Hi, From what I can see it is a typical 2XDOW scenario, if the group variable 'custno' is clustered, then no need to sort:
data have;
input (custno prod acc_flag) (:$) bal ;
datalines;
111 a1 qaz 10
111 b1 wsx 11
222 c1 wsx 233
333 d1 grw 251
444 e1 dc 25
444 f1 vefw 1231
444 g1 wefr 36
555 h1 bdfgh 615
666 i1 qe 767
777 j1 erg 31
777 k1 eth 472
777 l1 ret 251
777 m1 dfwef 3761
888 p1 ffef 36
;
data want;
do until (last.custno);
set have;
by custno notsorted;
length prod_con acc_con $50.;
prod_con=cats(prod_con,prod);
acc_con=cats(acc_con,acc_flag);
end;
do until (last.custno);
set have;
by custno notsorted;
output;
end;
run;
proc print;run;
Haikuo
I am also wondering if there is way to use cats() +group by in proc sql in a summary function kind of way, seems neat to me.
For this question, there is another data step way, originally from Mike Zdeb (I think):
data have;
input (custno prod acc_flag) (:$) bal ;
datalines;
111 a1 qaz 10
111 b1 wsx 11
222 c1 wsx 233
333 d1 grw 251
444 e1 dc 25
444 f1 vefw 1231
444 g1 wefr 36
555 h1 bdfgh 615
666 i1 qe 767
777 j1 erg 31
777 k1 eth 472
777 l1 ret 251
777 m1 dfwef 3761
888 p1 ffef 36
;
data want;
set have (in=up) have;
by custno;
length prod_con acc_con $50.;
retain prod_con acc_con;
if first.custno then call missing (prod_con,acc_con);
if up then do;
prod_con=cats(prod_con,prod);
acc_con=cats(acc_con,acc_flag);
end;
else output;
run;
proc print;run;
Haikuo
Hi Haikuo Bian, combining your two proposals might be optimal :
data want;
do until (last.custno);
set have (in=up) have;
by custno; /* NOTSORTED cannot be used here */
length prod_con acc_con $50.;
if up then do;
prod_con = cats(prod_con, prod);
acc_con = cats(acc_con, acc_flag);
end;
else output;
end;
run;
PG
notsorted removed by PG.
Yes, PG. That is a good combination, Thanks!
Haikuo
Questions, questions, questions ...
For a single customer, could you ever have two observations with the same Product? With the same Account flag?
The answers so far might be assuming uniqueness. That might be correct, but the question should at least be asked.
If duplicates are possible, should the same Product appear twice in the new prod_con variable?
The programming won't change drastically ... it's just a matter of tweaking it properly.
Possibly related: Could a product or an account flag ever contain an embedded blank?
How about this:
Create an index on Cust_no on the file
Then create an extra file of unique customers numbers
Read the file with the unique customers numbers and for each one go to the original file using the KEY option in the SET statement to retrieve and concatenate the Prod and Acc_flag variables. The _iorc_=0 and _error_=0 at the end of the step are to prevent error messages in the log.
data original(index=(Cust_no));
input Cust_no Prod $ ACC_flag $ Bal $ ;
datalines;
111 a1 qaz 10
111 b1 wsx 11
222 c1 wsx 233
333 d1 grw 251
444 e1 dc 25
444 f1 vefw 1231
444 g1 wefr 36
555 h1 bdfgh 615
666 i1 qe 767
777 j1 erg 31
777 k1 eth 472
777 l1 ret 251
777 m1 dfwef 3761
888 p1 ffef 36
run;
proc sort data=original(keep=cust_no) out=customers nodupkey;
by Cust_no;
run;
data new;
set customers;
length Prod_con Acc_con $50;
do until (_iorc_ ne 0);
set original key = Cust_no;
if _iorc_ = 0 then do;
Prod_con = CATX('|',Prod_Con,Prod);
Acc_con = CATX('|',Acc_con,Acc_flag);
end;
end;
output;
_iorc_ = 0; _error_=0;
run;
proc print;
run;
Is there any way to do this in proc sql?
I need to calcualte 40+ column for sum and mean. SQL is easier to calculate sum and mean, but how can I concatenate row values across each ID: E.g. for Cust_no = 777, it should put j1k1l1m1 to prod_con and get the total balance for each SSN in proc sql
Yes. There is one way. But you need to list all of these unique value as columns , You wouldn't want to that . That is horrible .
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.