Conditional flagging id by few variables

Miracle · Posted 03-04-2021 10:20 PM

Dear All,

Can I please seek your help on this by using sql?

I have a dataset containing id, date_1, date_2 and indication.

I would like to derive get, keep, date_3 and group.

The rationale is to flag keep=Y for the all the data rows with the earliest dates of both date_1 and date_2 for all the ids
If the data rows of keep=Y having both the same month and year i.e. id=1, then date_3=date_2 and group=Gen
If the data rows of keep=Y having different month and year i.e. id=2, then group=Indication.

Thanking you in advance.

id	date_1	        date_2	       indication	keep	date_3	       group
1	01-02-2020	20-02-2020	Y			
1	01-02-2018	03-03-2018	Y			
1	01-02-1990	20-02-1990	N	        Y	       20-02-1990	Gen
1	01-02-1990	20-02-1990	Y	        Y	       20-02-1990	Gen
1	01-02-1990	20-02-1990	Y	        Y	       20-02-1990	Gen
1	01-02-1990	20-02-1990	N	        Y	       20-02-1990	Gen
2	01-06-2009	25-08-2010	Y	        Y		                        Y
2	01-06-2009	25-08-2010	N	        Y		                        N
2	01-06-2009	25-08-2010	N	        Y		                        N
2	01-06-2009	25-08-2010	Y	        Y		                        Y
2	01-05-2018	08-06-2019	Y			
2	01-05-2018	04-12-2019	Y

andreas_lds · Posted 03-05-2021 01:01 AM

Please post the data in usable form and explain why you have to use proc sql. Is this homework?

Miracle · Posted 03-05-2021 01:23 AM

Thanks @andreas_lds for your response.

It is not homework.

It is a claim data with >1billion data rows.

I was advised to use sql because it will run faster but I am not good at sql.
Thanks.

andreas_lds · Posted 03-05-2021 02:28 AM

If the data is stored in a database, i agree: using proc sql with passthrough is a good idea, but if the data is stored in a dataset, i would avoid using proc sql, especially if it is large. But maybe @ChrisNZ can share his thoughts about performance.

Miracle · Posted 03-05-2021 02:41 AM

Yes, you are correct. The data is stored in a database.
Thanks @andreas_lds for pointing out.

ChrisNZ · Posted 03-05-2021 02:57 AM

It is always a good idea to pass the SQL to the remote database.

The reason is not that the database is faster than SAS (It could be. Or not), but that the data transfer is always very slow, especially for the volumes mentioned.

The code I provided can be adapted:

- Write the SQL as pass-through

- Change the date-comparison test to a syntax appropriate to the database rather than using function INTNX()

- Change the IFC() function to a CASE clause

- Combine both statements into a single statement using a subclause

Merging the original data to the flagged data will also require some other logic as presumably the row number cannot be used.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 03-05-2021 03:21 AM

This query gives the same result as the previous query, and is easier to translate :

proc sql;
  select *
       , month(DATE_1)=month(DATE_2) and year(DATE_1)=year(DATE_2) as EQ
       , KEEP='Y' as K
       , case when calculated K and calculated EQ=1 then 'Gen'
              when calculated K and calculated EQ=0 then INDICATION
              else                                       ' '         end as GROUP
  from HAVE        
  group by ID, KEEP
  order by ROW;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 03-05-2021 02:42 AM

Sorry your need is unclear.

Which date do we consider to assess same month and year?

What happens is same month but not same year, or same year but not same month?

This should get you started:

data HAVE;   
  ROW=_N_;
  infile cards missover;
  input ID DATE_1 ddmmyy10. DATE_2 : ddmmyy10.  INDICATION $ KEEP $;
  format   DATE_1 ddmmyy10. DATE_2 : ddmmyy10.;
cards;
1 01-02-2020 20-02-2020 Y   
1 01-02-2018 03-03-2018 Y   
1 01-02-1990 20-02-1990 N         Y        20-02-1990 
1 01-02-1990 20-02-1990 Y         Y        20-02-1990 
1 01-02-1990 20-02-1990 Y         Y        20-02-1990 
1 01-02-1990 20-02-1990 N         Y        20-02-1990 
2 01-06-2009 25-08-2010 Y         Y                   
2 01-06-2009 25-08-2010 N         Y                   
2 01-06-2009 25-08-2010 N         Y                   
2 01-06-2009 25-08-2010 Y         Y                   
2 01-05-2018 08-06-2019 Y   
2 01-05-2018 04-12-2019 Y   
run;
 
proc sql;    
  create table KEEP_Y as
  select *, intnx('month',min(DATE_1),0) =  intnx('month',min(DATE_2),0) as EQ  
  from HAVE  
  where KEEP='Y'  
  group by ID;
proc sql;
  create table WANT as 
  select h.*, ifc(EQ=1 & h.KEEP='Y','Gen',ifc(EQ=0 & h.KEEP='Y',h.INDICATION,' ')) as GROUP
  from HAVE        h
         left join
       KEEP_Y      k
         on  h.ROW=k.ROW
  order by h.ROW;

ROW	ID	DATE_1	DATE_2	INDICATION	KEEP	GROUP
1	1	01/02/2020	20/02/2020	Y
2	1	01/02/2018	03/03/2018	Y
3	1	01/02/1990	20/02/1990	N	Y	Gen
4	1	01/02/1990	20/02/1990	Y	Y	Gen
5	1	01/02/1990	20/02/1990	Y	Y	Gen
6	1	01/02/1990	20/02/1990	N	Y	Gen
7	2	01/06/2009	25/08/2010	Y	Y	Y
8	2	01/06/2009	25/08/2010	N	Y	N
9	2	01/06/2009	25/08/2010	N	Y	N
10	2	01/06/2009	25/08/2010	Y	Y	Y
11	2	01/05/2018	08/06/2019	Y
12	2	01/05/2018	04/12/2019	Y

High-Performance SAS Coding - Third Edition

Conditional flagging id by few variables

Re: Conditional flagging id by few variables

Re: Conditional flagging id by few variables

Re: Conditional flagging id by few variables

Re: Conditional flagging id by few variables

Re: Conditional flagging id by few variables

Re: Conditional flagging id by few variables

Re: Conditional flagging id by few variables

SAS Innovate 2025: Call for Content

Classroom Training Available!