Hello. I'm having trouble with writing code to get a subset of data.
Here's what I have:
Data looks like this (I have thousands of records, but an example).
RECORD | Date_Start |
123456 | 6/1/2023 |
123456 | 6/1/2023 |
123456 | 6/1/2023 |
I need to find unique values for the record and the date_start. So, for this example, I would want only 1 unique record in a subset to have the record and the date_start, like this:
RECORD | Date_Start |
123456 | 6/1/2023 |
I also have this:
RECORD | Date_Start |
1234567 | 6/1/2023 |
1234567 | 6/2/2023 |
1234567 | 6/3/2023 |
For this, I'd expect to have 3 unique records in the new subset, since all are unique with respects to the record and the date_start.
I'm relatively new to SAS, so it's a bit of a challenge for me. I've tried sorting the set by record and date_start. Then creating a new variable that's retained, using first. etc, but I'm getting every record of the main data set exported to my subset data set.
Any ideas are appreciate.
SQL is easy enough.
data have;
input RECORD $ Date_Start mmddyy10.;
format date_start mmddyys10.;
cards;
1234560 6/1/2023
1234561 6/1/2023
1234561 6/1/2023
1234567 6/1/2023
1234567 6/2/2023
1234568 6/1/2023
1234568 6/2/2023
1234568 6/3/2023
1234569 6/1/2023
1234569 6/2/2023
1234569 6/2/2023
1234569 6/3/2023
1234569 6/3/2023
1234569 6/3/2023
;
proc sql;
create table tableA as
select *
from have
where record in (select record
from have
group by record
having max(Date_Start)=min(Date_Start)
);
create table tableA as
select *
from have
where record not in (select record
from have
group by record
having max(Date_Start)=min(Date_Start)
);
quit;
proc sort data=have;
by record date_start;
run;
data want;
set have;
by record date_start;
if first.date_start;
run;
Thanks for this. I have done this and I think I have not made my initial question clear enough. I have to get two separate subsets. One where the subset consists of instances like this:
RECORD | Date_Start |
123456 | 6/1/2023 |
123456 | 6/1/2023 |
123456 | 6/1/2023 |
So, this scenario would go to a new dataset called A.
Then instances where it is like this (so one record and multiple start dates for that record):
RECORD | Date_Start |
1234567 | 6/1/2023 |
1234567 | 6/2/2023 |
1234567 | 6/3/2023 |
This would go to a new dataset called B.
So, I need to new sets, one that contains instances where one record and consistent dates. Another that contains instances where one record and various dates. I apologize for not being more detailed.
The input dataset has 3 total variables. Only these 2 are relevant though, record and date_start.
@JH74 wrote:
I have done this and I think I have not made my initial question clear enough. I have to get two separate subsets. One where the subset consists of instances like this:
RECORD Date_Start 123456 6/1/2023 123456 6/1/2023 123456 6/1/2023 So, this scenario would go to a new dataset called A.
Then instances where it is like this (so one record and multiple start dates for that record):
RECORD Date_Start 1234567 6/1/2023 1234567 6/2/2023 1234567 6/3/2023 This would go to a new dataset called B.
So, I need to new sets, one that contains instances where one record and consistent dates. Another that contains instances where one record and various dates. I apologize for not being more detailed.
And its still not clear to me. You are phrasing the rule in terms of consistent dates, and another case where there are various dates. What happens if a RECORD has both, say three records for RECORD 45678 on 6/1/2023 and then one record for 45678 on 6/2/2023 and two records for 45678 on 6/3/2023.
Please give an explanation that covers all possibilities, or if a situation is impossible and never happens, state those cases as well.
Thanks for your question. The data can have any number of combinations. The record can be one line or 50 lines. However, the record value is always consistent whether 1 line or 50. Each record can have either the same date_start value, or multiple different date_start values.
RECORD | Date_Start |
1234560 | 6/1/2023 |
1234561 | 6/1/2023 |
1234561 | 6/1/2023 |
1234567 | 6/1/2023 |
1234567 | 6/2/2023 |
1234568 | 6/1/2023 |
1234568 | 6/2/2023 |
1234568 | 6/3/2023 |
1234569 | 6/1/2023 |
1234569 | 6/2/2023 |
1234569 | 6/2/2023 |
1234569 | 6/3/2023 |
1234569 | 6/3/2023 |
1234569 | 6/3/2023 |
So, for records where the date_start is the same, those need to go to data set A. For records where the date_start varies within the same record, those records need to go to data set B. So I will have 2 data sets. One contains records where the date_start never changes for any unique record. The other contains records where the date_start varies for the unique record.
Example of what is needed:
Data Set A | |
RECORD | Date_Start |
1234560 | 6/1/2023 |
1234561 | 6/1/2023 |
1234561 | 6/1/2023 |
Data Set B | |
RECORD | Date_Start |
1234567 | 6/1/2023 |
1234567 | 6/2/2023 |
1234568 | 6/1/2023 |
1234568 | 6/2/2023 |
1234568 | 6/3/2023 |
1234569 | 6/1/2023 |
1234569 | 6/2/2023 |
1234569 | 6/2/2023 |
1234569 | 6/3/2023 |
1234569 | 6/3/2023 |
1234569 | 6/3/2023 |
SQL is easy enough.
data have;
input RECORD $ Date_Start mmddyy10.;
format date_start mmddyys10.;
cards;
1234560 6/1/2023
1234561 6/1/2023
1234561 6/1/2023
1234567 6/1/2023
1234567 6/2/2023
1234568 6/1/2023
1234568 6/2/2023
1234568 6/3/2023
1234569 6/1/2023
1234569 6/2/2023
1234569 6/2/2023
1234569 6/3/2023
1234569 6/3/2023
1234569 6/3/2023
;
proc sql;
create table tableA as
select *
from have
where record in (select record
from have
group by record
having max(Date_Start)=min(Date_Start)
);
create table tableA as
select *
from have
where record not in (select record
from have
group by record
having max(Date_Start)=min(Date_Start)
);
quit;
Thanks for your reply. I have a question about the cards part. I'm not familiar with that. But, I will not know all of the values in the record or date_start field. There are thousands of records in the dataset.
Does that impact the cards code where you listed out the values?
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.