I would like to create two separate datasets based on whether the time was updated. As shown in the sample dataset below, the caseids with the same state were unchanged, but the sample IDs will change if the date is updated. I want to identify which casids have been changed based on the date. The result datasets are shown below.
data test;
Length State $2 CaseID $9 SampleID 7;
input State $ CaseID $ SampleID date: mmddyy10. ;
format date mmddyy10.;
infile datalines delimiter='#';
datalines;
GA#EC1R00002#7913247#01/04/2024#
GA#EC1R00002#7913289#01/06/2024#
GA#EC1R00007#8627568#01/04/2024#
GA#EC1R00007#8627568#01/04/2024#
IN#EC1N00002#7913259#03/04/2024#
IN#EC1N00002#7913389#03/09/2024#
IN#EC1N00009#8827568#01/04/2023#
IN#EC1N00009#8827588#01/24/2023#
KS#EC1K00010#7945647#03/04/2022#
KS#EC1K00010#7945647#03/04/2022#
KS#EC1K00027#8827568#01/04/2023#
KS#EC1K00027#8827588#01/24/2023#
;
run;
data Updated;
Length State $2 CaseID $9 SampleID 7 change $1;
input State $ CaseID $ SampleID date: mmddyy10. Change $;
format date mmddyy10.;
infile datalines delimiter='#';
datalines;
GA#EC1R00002#7913247#01/04/2024#Y#
GA#EC1R00002#7913289#01/06/2024#Y#
IN#EC1N00002#7913259#03/04/2024#Y#
IN#EC1N00002#7913389#03/09/2024#Y#
IN#EC1N00009#8827568#01/04/2023#Y#
IN#EC1N00009#8827588#01/24/2023#Y#
KS#EC1K00027#8827568#01/04/2023#Y#
KS#EC1K00027#8827588#01/24/2023#Y#
;
run;
data No_Updates;
Length State $2 CaseID $9 SampleID 7 change $1;
input State $ CaseID $ SampleID date: mmddyy10. change $1;
format date mmddyy10.;
infile datalines delimiter='#';
datalines;
GA#EC1R00007#8627568#01/04/2024#N#
GA#EC1R00007#8627568#01/04/2024#N#
KS#EC1K00010#7945647#03/04/2022#N#
KS#EC1K00010#7945647#03/04/2022#N#
;
run;
Not sure why you would want two datasets instead of one, but that is easy to do once you have derived the CHANGE variable.
First let's just add your expected value for the new CHANGE variable to your example input so it will be easier to see if we have derived CHANGE properly.
data test;
length State $2 CaseID $9 SampleID Date 8 expect $1;
format date yymmdd10.;
informat date mmddyy.;
input State--expect ;
datalines;
GA EC1R00002 7913247 01/04/2024 Y
GA EC1R00002 7913289 01/06/2024 Y
GA EC1R00007 8627568 01/04/2024 N
GA EC1R00007 8627568 01/04/2024 N
IN EC1N00002 7913259 03/04/2024 Y
IN EC1N00002 7913389 03/09/2024 Y
IN EC1N00009 8827568 01/04/2023 Y
IN EC1N00009 8827588 01/24/2023 Y
KS EC1K00010 7945647 03/04/2022 N
KS EC1K00010 7945647 03/04/2022 N
KS EC1K00027 8827568 01/04/2023 Y
KS EC1K00027 8827588 01/24/2023 Y
;
One easy way to tell if a group of observations has different values for a variable is the COUNT(DISTINCT ...) aggregate function in SQL.
proc sql;
create table want as
select *
, case when (count(distinct sampleid)>1) then 'Y' else 'N' end as Change
from test
group by state,caseid
;
quit;
Results
Sample Obs State CaseID ID Date expect Change 1 GA EC1R00002 7913247 2024-01-04 Y Y 2 GA EC1R00002 7913289 2024-01-06 Y Y 3 GA EC1R00007 8627568 2024-01-04 N N 4 GA EC1R00007 8627568 2024-01-04 N N 5 IN EC1N00002 7913389 2024-03-09 Y Y 6 IN EC1N00002 7913259 2024-03-04 Y Y 7 IN EC1N00009 8827568 2023-01-04 Y Y 8 IN EC1N00009 8827588 2023-01-24 Y Y 9 KS EC1K00010 7945647 2022-03-04 N N 10 KS EC1K00010 7945647 2022-03-04 N N 11 KS EC1K00027 8827588 2023-01-24 Y Y 12 KS EC1K00027 8827568 2023-01-04 Y Y
Looks like that logic worked for your example.
If you want to create two separate datasets that is now simple:
data Updated No_Updates;
set want;
if change='Y' then output Updated;
else output No_Updates;
run;
I am guessing that something in data set TEST is supposed to tell us that something was updated. What is it?
If I have to guess at a rule I am pretty sure I am going to miss something. Plus you may have cases in the actual data that have other rules that might not be guessed from a small example.
Not sure why you would want two datasets instead of one, but that is easy to do once you have derived the CHANGE variable.
First let's just add your expected value for the new CHANGE variable to your example input so it will be easier to see if we have derived CHANGE properly.
data test;
length State $2 CaseID $9 SampleID Date 8 expect $1;
format date yymmdd10.;
informat date mmddyy.;
input State--expect ;
datalines;
GA EC1R00002 7913247 01/04/2024 Y
GA EC1R00002 7913289 01/06/2024 Y
GA EC1R00007 8627568 01/04/2024 N
GA EC1R00007 8627568 01/04/2024 N
IN EC1N00002 7913259 03/04/2024 Y
IN EC1N00002 7913389 03/09/2024 Y
IN EC1N00009 8827568 01/04/2023 Y
IN EC1N00009 8827588 01/24/2023 Y
KS EC1K00010 7945647 03/04/2022 N
KS EC1K00010 7945647 03/04/2022 N
KS EC1K00027 8827568 01/04/2023 Y
KS EC1K00027 8827588 01/24/2023 Y
;
One easy way to tell if a group of observations has different values for a variable is the COUNT(DISTINCT ...) aggregate function in SQL.
proc sql;
create table want as
select *
, case when (count(distinct sampleid)>1) then 'Y' else 'N' end as Change
from test
group by state,caseid
;
quit;
Results
Sample Obs State CaseID ID Date expect Change 1 GA EC1R00002 7913247 2024-01-04 Y Y 2 GA EC1R00002 7913289 2024-01-06 Y Y 3 GA EC1R00007 8627568 2024-01-04 N N 4 GA EC1R00007 8627568 2024-01-04 N N 5 IN EC1N00002 7913389 2024-03-09 Y Y 6 IN EC1N00002 7913259 2024-03-04 Y Y 7 IN EC1N00009 8827568 2023-01-04 Y Y 8 IN EC1N00009 8827588 2023-01-24 Y Y 9 KS EC1K00010 7945647 2022-03-04 N N 10 KS EC1K00010 7945647 2022-03-04 N N 11 KS EC1K00027 8827588 2023-01-24 Y Y 12 KS EC1K00027 8827568 2023-01-04 Y Y
Looks like that logic worked for your example.
If you want to create two separate datasets that is now simple:
data Updated No_Updates;
set want;
if change='Y' then output Updated;
else output No_Updates;
run;
For your example you could also check for changes in DATE. But perhaps your general case is to look for changes in either.
proc sql;
create table want as
select *
, case when (count(distinct date)>1) then 'Y'
when (count(distinct sampleid)>1) then 'Y'
else 'N'
end as Change
from test
group by state,caseid
;
quit;
Read all obs for each id twice. First to determine max and min dates. Second to output to one of two datasets, based on whether min and max dates are the same.
data updated (drop=_:) notupdated (drop=_:) ;
do until (last.id);
set test ;
by id notsorted;
_mindate=min(date,_mindate);
_maxdate=max(date,_maxdate);
end;
do until (last.id);
set test ;
by id notsorted;
if _mindate^=_maxdate then output updated;
else output notupdated;
end;
run;
Thanks for all the expert suggestions. I will try to modify the code to run my actual dataset.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.