For each application_id, I expect to 1) identify the applicant which applied the application_id and then 2) identify application_id which applied by applicants over the past 5 years.
By using the following data as an example,
data have;
infile cards dsd dlm=",";
input
Obs :8.
Applicant_name :$8.
application_date: YYMMDD10.
application_id :8.
;
format
application_date :YYMMDD10.
;
cards;
1,A,2021-07-28,101
2,B,2021-07-28,101
3,A,2017-07-28,102
4,A,2016-07-27,103
5,B,2018-07-28,104
;;;;
run;
I expect to have the below result.
Obs,Applicant_name,application_date,application_id,applied_id
1,A, 2021-07-28, 101, 102
2,B, 2021-07-28, 101, 104
3,A, 2017-07-28, 102, 103
4,A, 2016-07-27, 103
It is because,
For application_id 101, it is applied by two applicants, which are A and B. and then I expect to include application_id 102, because it is applied by applicant A on 2017-07-28, which is within 5 years before 2021-07-28. I include application_id 104 for the same reason (because 1) B applied for application 101 in 2021-07-28, 2) B also applied application 104 in 2018-07-28, 3) 2018-07-28 within the 5-years before 2021-07-28). I do not include application_id 103, because although it is applied by applicant A, it is not applied within 5 years before 2021-07-28.
I am using the code,
%MACRO Explo(year=);
PROC SQL;
CREATE TABLE want_p5y_&year. AS
SELECT DISTINCT
citing1.Applicant_name,
citing1.application_date,
citing1.application_id,
citing2.application_id as applied_id
FROM
have as citing1
left JOIN have AS citing2 ON citing1.Applicant_name = citing2.Applicant_name
where citing1.application_id ^= citing2.application_id
group by citing1.application_date
having citing2.application_date between intnx('year', min(citing1.application_date), -5, 'S') and citing1.application_date
;
QUIT;
%MEND Explo;
%Explo(year=2021)
and I got the result as
Applicant_name | application_date | application_id | applied_id |
A | 2017/7/28 | 102 | 103 |
A | 2021/7/28 | 101 | 102 |
B | 2021/7/28 | 101 | 104 |
The problem is
1) I only get 3 rows rather than 4 rows.
2) My dataset has 17883455 rows and 3 columns (from 1985 to 2020). It is too big for me. So I split the dataset by year. Even so, when I run the dataset with the above code, I still get the following results
ERROR: Insufficient space in file WORK.'SASTMP-000000010'n.UTILITY. ERROR: File WORK.'SASTMP-000000010'n.UTILITY is damaged. I/O processing did not complete. NOTE: Error was encountered during utility-file processing. You may be able to execute the SQL statement successfully if you allocate more space to the WORK library.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 56:41.45
cpu time 2:14.93
Could you please give me some advice to get the expected result? I guess there should be some smart methods to get the result easily and quickly. but I just do not know.
Many thanks in advance for any advice.
Seems to be the same problem as in https://communities.sas.com/t5/SAS-Programming/for-each-applicaiont-collect-applications-appllied-by.... Right? I tend to merge both posts ...
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.