Solved: Re: Unique records within a group of records

JH74 · Posted 06-13-2023 02:22 PM

Hello. I'm having trouble with writing code to get a subset of data.

Here's what I have:

Data looks like this (I have thousands of records, but an example).

RECORD	Date_Start
123456	6/1/2023
123456	6/1/2023
123456	6/1/2023

I need to find unique values for the record and the date_start. So, for this example, I would want only 1 unique record in a subset to have the record and the date_start, like this:

RECORD	Date_Start
123456	6/1/2023

I also have this:

RECORD	Date_Start
1234567	6/1/2023
1234567	6/2/2023
1234567	6/3/2023

For this, I'd expect to have 3 unique records in the new subset, since all are unique with respects to the record and the date_start.

I'm relatively new to SAS, so it's a bit of a challenge for me. I've tried sorting the set by record and date_start. Then creating a new variable that's retained, using first. etc, but I'm getting every record of the main data set exported to my subset data set.

Any ideas are appreciate.

Reeza · Posted 06-14-2023 11:10 AM

SQL is easy enough.

data have;
    input RECORD $ Date_Start mmddyy10.;
    format date_start mmddyys10.;
    cards;
1234560 6/1/2023
1234561 6/1/2023
1234561 6/1/2023
1234567 6/1/2023
1234567 6/2/2023
1234568 6/1/2023
1234568 6/2/2023
1234568 6/3/2023
1234569 6/1/2023
1234569 6/2/2023
1234569 6/2/2023
1234569 6/3/2023
1234569 6/3/2023
1234569 6/3/2023
;

proc sql;
    create table tableA as 
    select * 
    from have 
    where record in (select record 
                    from have 
                    group by record 
                    having max(Date_Start)=min(Date_Start)
                    );
                    
                    
    create table tableA as 
    select * 
    from have 
    where record not in (select record 
                    from have 
                    group by record 
                    having max(Date_Start)=min(Date_Start)
                    );
quit;

View solution in original post

PaigeMiller · Posted 06-13-2023 02:26 PM

proc sort data=have;
    by record date_start;
run;

data want;
    set have;
    by record date_start;
    if first.date_start;
run;

--
Paige Miller

JH74 · Posted 06-13-2023 04:23 PM

Thanks for this. I have done this and I think I have not made my initial question clear enough. I have to get two separate subsets. One where the subset consists of instances like this:

RECORD	Date_Start
123456	6/1/2023
123456	6/1/2023
123456	6/1/2023

So, this scenario would go to a new dataset called A.

Then instances where it is like this (so one record and multiple start dates for that record):

RECORD	Date_Start
1234567	6/1/2023
1234567	6/2/2023
1234567	6/3/2023

This would go to a new dataset called B.

So, I need to new sets, one that contains instances where one record and consistent dates. Another that contains instances where one record and various dates. I apologize for not being more detailed.

Reeza · Posted 06-13-2023 04:42 PM

What would the input dataset look like?

JH74 · Posted 06-13-2023 07:35 PM

The input dataset has 3 total variables. Only these 2 are relevant though, record and date_start.

PaigeMiller · Posted 06-13-2023 04:52 PM

@JH74 wrote:

I have done this and I think I have not made my initial question clear enough. I have to get two separate subsets. One where the subset consists of instances like this:

RECORD Date_Start

123456 6/1/2023

123456 6/1/2023

123456 6/1/2023

So, this scenario would go to a new dataset called A.

Then instances where it is like this (so one record and multiple start dates for that record):

RECORD Date_Start

1234567 6/1/2023

1234567 6/2/2023

1234567 6/3/2023

This would go to a new dataset called B.

So, I need to new sets, one that contains instances where one record and consistent dates. Another that contains instances where one record and various dates. I apologize for not being more detailed.

And its still not clear to me. You are phrasing the rule in terms of consistent dates, and another case where there are various dates. What happens if a RECORD has both, say three records for RECORD 45678 on 6/1/2023 and then one record for 45678 on 6/2/2023 and two records for 45678 on 6/3/2023.

Please give an explanation that covers all possibilities, or if a situation is impossible and never happens, state those cases as well.

--
Paige Miller

JH74 · Posted 06-13-2023 05:54 PM

Thanks for your question. The data can have any number of combinations. The record can be one line or 50 lines. However, the record value is always consistent whether 1 line or 50. Each record can have either the same date_start value, or multiple different date_start values.

RECORD	Date_Start
1234560	6/1/2023
1234561	6/1/2023
1234561	6/1/2023
1234567	6/1/2023
1234567	6/2/2023
1234568	6/1/2023
1234568	6/2/2023
1234568	6/3/2023
1234569	6/1/2023
1234569	6/2/2023
1234569	6/2/2023
1234569	6/3/2023
1234569	6/3/2023
1234569	6/3/2023

So, for records where the date_start is the same, those need to go to data set A. For records where the date_start varies within the same record, those records need to go to data set B. So I will have 2 data sets. One contains records where the date_start never changes for any unique record. The other contains records where the date_start varies for the unique record.

Example of what is needed:

Data Set A
RECORD	Date_Start
1234560	6/1/2023
1234561	6/1/2023
1234561	6/1/2023

Data Set B
RECORD	Date_Start
1234567	6/1/2023
1234567	6/2/2023
1234568	6/1/2023
1234568	6/2/2023
1234568	6/3/2023
1234569	6/1/2023
1234569	6/2/2023
1234569	6/2/2023
1234569	6/3/2023
1234569	6/3/2023
1234569	6/3/2023

Reeza · Posted 06-14-2023 11:10 AM

SQL is easy enough.

data have;
    input RECORD $ Date_Start mmddyy10.;
    format date_start mmddyys10.;
    cards;
1234560 6/1/2023
1234561 6/1/2023
1234561 6/1/2023
1234567 6/1/2023
1234567 6/2/2023
1234568 6/1/2023
1234568 6/2/2023
1234568 6/3/2023
1234569 6/1/2023
1234569 6/2/2023
1234569 6/2/2023
1234569 6/3/2023
1234569 6/3/2023
1234569 6/3/2023
;

proc sql;
    create table tableA as 
    select * 
    from have 
    where record in (select record 
                    from have 
                    group by record 
                    having max(Date_Start)=min(Date_Start)
                    );
                    
                    
    create table tableA as 
    select * 
    from have 
    where record not in (select record 
                    from have 
                    group by record 
                    having max(Date_Start)=min(Date_Start)
                    );
quit;

JH74 · Posted 06-14-2023 11:53 AM

Thanks for your reply. I have a question about the cards part. I'm not familiar with that. But, I will not know all of the values in the record or date_start field. There are thousands of records in the dataset.

Does that impact the cards code where you listed out the values?

Reeza · Posted 06-14-2023 11:56 AM

That part is just to make fake data to test. Point it to your actual data set instead.

Reeza · Posted 06-13-2023 03:03 PM

Do you have other variables? If not SQL is an option.

If you do, stick with a data step as demonstrated by PaigeMiller.

Registration is open

SAS Training: Just a Click Away