Hi all,
I have a bit of a tricky situation. In the dataset below, I have multiple ID's, dates and codes. I'm trying to figure out the best way to manipulate it such that:
for each ID and DATE that are the same, keep rows where CODE in (0441T, 0442T).
OR
if ID and DATE are the same, keep rows that have a combination of both CODES 64640 AND C2618.
Data in bold is what should be kept if the logic works correctly.
CODE | ID | DATE |
64640 | 12345 | 06/30/2020 |
64640 | 12345 | 06/30/2020 |
64640 | 23456 | 02/10/2020 |
C2618 | 23456 | 02/10/2020 |
C2618 | 34567 | 06/24/2020 |
C2618 | 45678 | 01/06/2020 |
0442T | 56789 | 02/11/2020 |
C2618 | 67890 | 02/11/2020 |
64640 | 78901 | 02/14/2020 |
0441T | 89012 | 02/10/2020 |
C2618 | 90123 | 02/17/2020 |
C2618 | 11234 | 01/22/2020 |
64640 | 22345 | 02/10/2020 |
64640 | 22345 | 02/10/2020 |
C2618 | 22345 | 02/10/2020 |
64640 | 22345 | 02/10/2020 |
64640 | 22345 | 02/10/2020 |
64640 | 22345 | 02/10/2020 |
0441T | 33456 | 03/16/2020 |
C2618 | 33456 | 03/16/2020 |
64640 | 44555 | 03/22/2020 |
C2618 | 44555 | 04/18/2020 |
Any ideas on how to approach this?
Ah, well, you didn't say that in your original post, but it can be accommodated. Take a look at the below code and results.
I'm basically scanning through all the rows and setting Boolean flags for each of the four codes you specified. When LAST. is true, I evaluate the flags to see what combination of codes has been encountered. If the combination meets the criteria, a row is added to the Keys dataset which is later inner joined with the Have dataset to obtain our final results. I've found the Boolean flag technique for situations like this. It's worth knowing about.
Jim
DATA Have;
FORMAT ID $5. Date MMDDYY10. Code $5.;
INFILE DATALINES DSD DLM='09'X;
INPUT CODE $ ID $ DATE : MMDDYY10.;
DATALINES;
64640 12345 06/30/2020
64640 12345 06/30/2020
64640 23456 02/10/2020
C2618 23456 02/10/2020
C2618 34567 06/24/2020
C2618 45678 01/06/2020
0442T 56789 02/11/2020
C2618 67890 02/11/2020
64640 78901 02/14/2020
0441T 89012 02/10/2020
C2618 90123 02/17/2020
C2618 11234 01/22/2020
64640 22345 02/10/2020
64640 22345 02/10/2020
C2618 22345 02/10/2020
64640 22345 02/10/2020
64640 22345 02/10/2020
64640 22345 02/10/2020
0441T 33456 03/16/2020
C2618 33456 03/16/2020
64640 44555 03/22/2020
C2618 44555 04/18/2020
;
RUN;
PROC SORT DATA=Have;
BY ID DATE;
RUN;
DATA Keys;
DROP _: Code;
SET Have;
BY ID DATE;
RETAIN _CD0441T 0
_CD0442T 0
_CD64640 0
_CDC2618 0
;
SELECT (UPCASE(CODE));
WHEN ('0441T')
_CD0441T = 1;
WHEN ('0442T')
_CD0442T = 1;
WHEN ('64640')
_CD64640 = 1;
WHEN ('C2618')
_CDC2618 = 1;
OTHERWISE
DO;
END;
END;
IF Last.DATE THEN
DO;
IF (_CD0441T AND NOT (_CD64640 OR _CDC2618)) OR
(_CD0442T AND NOT (_CD64640 OR _CDC2618)) OR
(_CD64640 AND _CDC2618) THEN
DO;
CALL MISSING(_CD0441T,_CD0442T,_CD64640,_CDC2618);
OUTPUT;
END;
ELSE
DO;
CALL MISSING(_CD0441T,_CD0442T,_CD64640,_CDC2618);
DELETE;
END;
END;
ELSE
DO;
DELETE;
END;
RUN;
PROC SQL NOPRINT;
CREATE TABLE Want AS
SELECT Have.*
FROM Have
INNER JOIN Keys
ON Have.ID = Keys.ID
AND Have.Date = Keys.Date
;
QUIT;
Results:
OK, interesting problem.
But why are the following exluded?
0441T | 33456 | 03/16/2020 |
C2618 | 33456 | 03/16/2020 |
I see a 0441T. Shouldn't those two be included?
Here's a sample program, but it's going to pick up the above two rows because of the 0441T. Results are below the code.
Jim
DATA Have;
FORMAT ID $5. Date MMDDYY10. Code $5.;
INFILE DATALINES DSD DLM='09'X;
INPUT CODE $ ID $ DATE : MMDDYY10.;
DATALINES;
64640 12345 06/30/2020
64640 12345 06/30/2020
64640 23456 02/10/2020
C2618 23456 02/10/2020
C2618 34567 06/24/2020
C2618 45678 01/06/2020
0442T 56789 02/11/2020
C2618 67890 02/11/2020
64640 78901 02/14/2020
0441T 89012 02/10/2020
C2618 90123 02/17/2020
C2618 11234 01/22/2020
64640 22345 02/10/2020
64640 22345 02/10/2020
C2618 22345 02/10/2020
64640 22345 02/10/2020
64640 22345 02/10/2020
64640 22345 02/10/2020
0441T 33456 03/16/2020
C2618 33456 03/16/2020
64640 44555 03/22/2020
C2618 44555 04/18/2020
;
RUN;
PROC SORT DATA=Have;
BY ID DATE;
RUN;
DATA Keys;
DROP _: Code;
SET Have;
BY ID DATE;
RETAIN _CD0441T 0
_CD0442T 0
_CD64640 0
_CDC2618 0
;
SELECT (UPCASE(CODE));
WHEN ('0441T')
_CD0441T = 1;
WHEN ('0442T')
_CD0442T = 1;
WHEN ('64640')
_CD64640 = 1;
WHEN ('C2618')
_CDC2618 = 1;
OTHERWISE
DO;
END;
END;
IF Last.DATE THEN
DO;
IF _CD0441T OR
_CD0442T OR
(_CD64640 AND _CDC2618) THEN
DO;
CALL MISSING(_CD0441T,_CD0442T,_CD64640,_CDC2618);
OUTPUT;
END;
ELSE
DO;
CALL MISSING(_CD0441T,_CD0442T,_CD64640,_CDC2618);
DELETE;
END;
END;
ELSE
DO;
DELETE;
END;
RUN;
PROC SQL NOPRINT;
CREATE TABLE Want AS
SELECT Have.*
FROM Have
INNER JOIN Keys
ON Have.ID = Keys.ID
AND Have.Date = Keys.Date
;
QUIT;
Results:
Ah, well, you didn't say that in your original post, but it can be accommodated. Take a look at the below code and results.
I'm basically scanning through all the rows and setting Boolean flags for each of the four codes you specified. When LAST. is true, I evaluate the flags to see what combination of codes has been encountered. If the combination meets the criteria, a row is added to the Keys dataset which is later inner joined with the Have dataset to obtain our final results. I've found the Boolean flag technique for situations like this. It's worth knowing about.
Jim
DATA Have;
FORMAT ID $5. Date MMDDYY10. Code $5.;
INFILE DATALINES DSD DLM='09'X;
INPUT CODE $ ID $ DATE : MMDDYY10.;
DATALINES;
64640 12345 06/30/2020
64640 12345 06/30/2020
64640 23456 02/10/2020
C2618 23456 02/10/2020
C2618 34567 06/24/2020
C2618 45678 01/06/2020
0442T 56789 02/11/2020
C2618 67890 02/11/2020
64640 78901 02/14/2020
0441T 89012 02/10/2020
C2618 90123 02/17/2020
C2618 11234 01/22/2020
64640 22345 02/10/2020
64640 22345 02/10/2020
C2618 22345 02/10/2020
64640 22345 02/10/2020
64640 22345 02/10/2020
64640 22345 02/10/2020
0441T 33456 03/16/2020
C2618 33456 03/16/2020
64640 44555 03/22/2020
C2618 44555 04/18/2020
;
RUN;
PROC SORT DATA=Have;
BY ID DATE;
RUN;
DATA Keys;
DROP _: Code;
SET Have;
BY ID DATE;
RETAIN _CD0441T 0
_CD0442T 0
_CD64640 0
_CDC2618 0
;
SELECT (UPCASE(CODE));
WHEN ('0441T')
_CD0441T = 1;
WHEN ('0442T')
_CD0442T = 1;
WHEN ('64640')
_CD64640 = 1;
WHEN ('C2618')
_CDC2618 = 1;
OTHERWISE
DO;
END;
END;
IF Last.DATE THEN
DO;
IF (_CD0441T AND NOT (_CD64640 OR _CDC2618)) OR
(_CD0442T AND NOT (_CD64640 OR _CDC2618)) OR
(_CD64640 AND _CDC2618) THEN
DO;
CALL MISSING(_CD0441T,_CD0442T,_CD64640,_CDC2618);
OUTPUT;
END;
ELSE
DO;
CALL MISSING(_CD0441T,_CD0442T,_CD64640,_CDC2618);
DELETE;
END;
END;
ELSE
DO;
DELETE;
END;
RUN;
PROC SQL NOPRINT;
CREATE TABLE Want AS
SELECT Have.*
FROM Have
INNER JOIN Keys
ON Have.ID = Keys.ID
AND Have.Date = Keys.Date
;
QUIT;
Results:
DATA Have; FORMAT ID $5. Date MMDDYY10. Code $5.; INFILE DATALINES expandtabs truncover; INPUT CODE $ ID $ DATE : MMDDYY10.; DATALINES; 64640 12345 06/30/2020 64640 12345 06/30/2020 64640 23456 02/10/2020 C2618 23456 02/10/2020 C2618 34567 06/24/2020 C2618 45678 01/06/2020 0442T 56789 02/11/2020 C2618 67890 02/11/2020 64640 78901 02/14/2020 0441T 89012 02/10/2020 C2618 90123 02/17/2020 C2618 11234 01/22/2020 64640 22345 02/10/2020 64640 22345 02/10/2020 C2618 22345 02/10/2020 64640 22345 02/10/2020 64640 22345 02/10/2020 64640 22345 02/10/2020 0441T 33456 03/16/2020 C2618 33456 03/16/2020 64640 44555 03/22/2020 C2618 44555 04/18/2020 ; RUN; proc sql; create table want as select * from have group by id,date having sum(CODE in ('0441T' '0442T'))=count(*) or (sum(code='64640') ne 0 and sum(code='C2618') ne 0 and count(distinct code)=2 ); quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.