For each application_id, I expect to 1) identify the applicant which applied the application_id and then 2) identify application_id which applied by applicants over the past 5 years.
By using the following data as an example,
data have;
infile cards dsd dlm=",";
input
Obs :8.
Applicant_name :$8.
application_date: YYMMDD10.
application_id :8.
;
format
application_date :YYMMDD10.
;
cards;
1,A,2021-07-28,101
2,B,2021-07-28,101
3,A,2017-07-28,102
4,A,2016-07-27,103
5,B,2018-07-28,104
;;;;
run;
I expect to have the below result.
Obs,Applicant_name,application_date,application_id,applied_id
1,A, 2021-07-28, 101, 102
2,B, 2021-07-28, 101, 104
3,A, 2017-07-28, 102, 103
4,A, 2016-07-27, 103
It is because,
For application_id 101, it is applied by two applicants, which are A and B. and then I expect to include application_id 102, because it is applied by applicant A on 2017-07-28, which is within 5 years before 2021-07-28. I include application_id 104 for the same reason (because 1) B applied for application 101 in 2021-07-28, 2) B also applied application 104 in 2018-07-28, 3) 2018-07-28 within the 5-years before 2021-07-28). I do not include application_id 103, because although it is applied by applicant A, it is not applied within 5 years before 2021-07-28.
I am using the code,
%MACRO Explo(year=);
PROC SQL;
CREATE TABLE want_p5y_&year. AS
SELECT DISTINCT
citing1.Applicant_name,
citing1.application_date,
citing1.application_id,
citing2.application_id as applied_id
FROM
have as citing1
left JOIN have AS citing2 ON citing1.Applicant_name = citing2.Applicant_name
where citing1.application_id ^= citing2.application_id
group by citing1.application_date
having citing2.application_date between intnx('year', min(citing1.application_date), -5, 'S') and citing1.application_date
;
QUIT;
%MEND Explo;
%Explo(year=2021)
and I got the result as
Applicant_name | application_date | application_id | applied_id |
A | 2017/7/28 | 102 | 103 |
A | 2021/7/28 | 101 | 102 |
B | 2021/7/28 | 101 | 104 |
The problem is
1) I only get 3 rows rather than 4 rows.
2) My dataset has 17883455 rows and 3 columns (from 1985 to 2020). It is too big for me. So I split the dataset by year. Even so, when I run the dataset with the above code, I still get the following results
ERROR: Insufficient space in file WORK.'SASTMP-000000010'n.UTILITY. ERROR: File WORK.'SASTMP-000000010'n.UTILITY is damaged. I/O processing did not complete. NOTE: Error was encountered during utility-file processing. You may be able to execute the SQL statement successfully if you allocate more space to the WORK library.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 56:41.45
cpu time 2:14.93
Could you please give me some advice to get the expected result? I guess there should be some smart methods to get the result easily and quickly. but I just do not know.
Many thanks in advance for any advice.
Seems to be the same problem as in https://communities.sas.com/t5/SAS-Programming/for-each-applicaiont-collect-applications-appllied-by.... Right? I tend to merge both posts ...
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.