Hi All,
I need some help in optimizing my code.
I have two datasets, sample and history. Where the sample contains randomly selected rows from the history file.
Later I want to inner join the sample with history based on some condition.
Sample dataset:
loan_id month var1 var2 var3
1111 3 a1 b1 c1
2222 8 a2 b2 c2
3333 5 a3 b3 c3
4444 2 a4 b4 c4
History dataset:
loan_id month var1 var2 var3
1111 3 a1 b1 c1
1111 4
1111 5
2222 8 a2 b2 c2
2222 9
2222 10
2222 11
3333 5 a3 b3 c3
3333 6
3333 7
3333 8
3333 9
4444 2 a4 b4 c4
4444 3
4444 4
4444 5
4444 6
output dataset:
1111 3 a1 b1 c1
1111 4 a1 b1 c1
1111 5 a1 b1 c1
2222 8 a2 b2 c2
2222 9 a2 b2 c2
2222 10 a2 b2 c2
2222 11 a2 b2 c2
3333 5 a3 b3 c3
3333 6 a3 b3 c3
3333 7 a3 b3 c3
3333 8 a3 b3 c3
3333 9 a3 b3 c3
4444 2 a4 b4 c4
4444 3 a4 b4 c4
4444 4 a4 b4 c4
4444 5 a4 b4 c4
4444 6 a4 b4 c4
The main aim is to get all the month records from history dataset, but retain the values of var1, var2, var3 from the sample dataset itself. I have achieved this through proc sql. my code is below:
proc sql;
create table outds as
select a.loan_id, a.var1, a.var2, a.var3, b.month
from sample a, history b
where a.loan_id = b.loan_id
and a.month <= b.month
quit;
I want to optimize this, if there is any alternative way to do this in sas datastep or any other procedure in sas, as I am planning to do further analysis on the sample.
Please help me with this.
Thanks
data Sample;
input loan_id$ month var1$ var2$ var3$;
datalines;
1111 3 a1 b1 c1
2222 8 a2 b2 c2
3333 5 a3 b3 c3
4444 2 a4 b4 c4
;
data History;
input loan_id$ month var1$ var2$ var3$;
infile datalines truncover;
datalines;
1111 3 a1 b1 c1
1111 4
1111 5
2222 8 a2 b2 c2
2222 9
2222 10
2222 11
3333 5 a3 b3 c3
3333 6
3333 7
3333 8
3333 9
4444 2 a4 b4 c4
4444 3
4444 4
4444 5
4444 6
;
proc sort data=sample; by loan_id; run;
proc sort data=History; by loan_id; run;
data outds;
merge History(keep=loan_id month) Sample;
by loan_id;
run;
Hi,
I forgot to add scenario where the starting month may not be same in history and sample. That was the reason I was joining on month (a.month <= b.month)
Also the history may have many more variables... but I want to retain only var1, var2, var3
Eg:
History dataset:
loan_id month var1 var2 var3 var4 var5 var6
1111 1 l r y u d a f
1111 2 f y e t s g
1111 3 a1 b1 c1 x y z
1111 4 p q r s t u
1111 5 e f g h i j
2222 7 k l m n o p
2222 8 a2 b2 c2 x y z
2222 9
2222 10
2222 11
3333 5 a3 b3 c3 x y z
3333 6
3333 7
3333 8
3333 9
4444 1
4444 2 a4 b4 c4 x y z
4444 3
4444 4
4444 5
4444 6
The output should be:
loan_id month var1 var2 var3 var4 var5 var6
1111 3 a1 b1 c1 x y z
1111 4 a1 b1 c1 s t u
1111 5 a1 b1 c1 h i j
2222 7 a1 b1 c1 n o p
2222 8 a2 b2 c2 x y z
2222 9 a2 b2 c2
2222 10 a2 b2 c2
2222 11 a2 b2 c2
so on....
In your merge statement contains only loan_id. This may not give same results...
@PeterClemmensen's data step needs a slight modification:
data outds;
merge
History (in=a keep=loan_id month)
Sample (in=b rename=(month=smonth))
;
by loan_id;
if a and b;
if smonth <= month;
drop smonth;
run;
Hi,
Since we are not using month in the by statement, few records are being dropped.
The counts are not matching with proc sql results.
Post your exact data in a data step (see @PeterClemmensen's example) and the sql you run against it. Use code posting windows, as described here, so that the code can easily be copy/pasted.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.