Solved: Re: Random exclusion of duplicates

_MVB_ · Posted 06-10-2020 05:17 PM

I have a huge dataset with several million records, where IDs (customer identifier) are repeating many times, as I have historical data for same customers (lets say 10 years of historical data).

I need to remove duplicates and remain with only one record for each ID.

This must be a random exclusion, as in the end I need to have a representation of historical data.

Also, random selection of IDs must be repeatable - meaning, if someone else will run this logic on the same dataset, it will come back with exactly same Point-In-Time random selection every time (such as, if customer identifier 99900123 comes with 120 records (120 historical months) and was chosen to keep Aug 2012 in random selection - that historical month must come back every time for 99900123 when executing logic.

Hopefully, someone can come up with a quick resolution, as I need it relatively urgent 🙂

I looked at surveyselect, however my understanding any of the methods require sample size, which does not apply in my case

FreelanceReinh · Posted 06-10-2020 05:55 PM

Hello @_MVB_ and welcome to the SAS Support Communities!

Use PROC SURVEYSELECT. The sample size is 1 -- per ID:

/* Create test data for demonstration */

data have;
input ID $ month;
cards;
1 4
1 7
1 2
2 3
2 1
3 9
3 5
;

/* Select randomly one observation per ID */

proc surveyselect data=have
method=srs n=1
seed=1732 out=want(drop=selectionprob samplingweight);
strata id;
run;

The STRATA statement requires that the input dataset (HAVE) is sorted by ID. The selection is repeatable thanks to the random seed (arbitrary integer).

View solution in original post

FreelanceReinh · Posted 06-10-2020 05:55 PM

Hello @_MVB_ and welcome to the SAS Support Communities!

Use PROC SURVEYSELECT. The sample size is 1 -- per ID:

/* Create test data for demonstration */

data have;
input ID $ month;
cards;
1 4
1 7
1 2
2 3
2 1
3 9
3 5
;

/* Select randomly one observation per ID */

proc surveyselect data=have
method=srs n=1
seed=1732 out=want(drop=selectionprob samplingweight);
strata id;
run;

The STRATA statement requires that the input dataset (HAVE) is sorted by ID. The selection is repeatable thanks to the random seed (arbitrary integer).

_MVB_ · Posted 06-10-2020 09:15 PM

@FreelanceReinh - Thank you!

Question, can you elaborate on this:

drop=selectionprob samplingweight

What is it doing exactly?

FreelanceReinh · Posted 06-11-2020 04:33 AM

@_MVB_ wrote:

Question, can you elaborate on this:
drop=selectionprob samplingweight
What is it doing exactly?

This DROP= dataset option excludes two variables (SelectionProb and SamplingWeight) from being written to the output dataset (WANT) which are created automatically by PROC SURVEYSELECT due to the STRATA statement. I assumed that you wouldn't want to add these "unexpected," technical new variables to the existing variables in the random sample. Feel free to omit the "(drop=...)" part of my example code and then look at the additional columns in dataset WANT to decide if they might be useful for your purposes.

_MVB_ · Posted 06-11-2020 08:54 AM

@FreelanceReinh That is a good point - I do not want to have any extra variable created, as it would add to the total size of the dataset.

ballardw · Posted 06-10-2020 06:02 PM

Surveyselect is the tool. Your sample size is 1 per strata, where ID is the strata.

proc sort data=have;
  by id date; /* if there a multiple records for the it with the
              same date then you need to provide enough variables
              to duplicate the sort again later
              */
run;

Proc surveyselect data=have
     out=want sampsize=1 noprint
     seed=23456   /* this value initializing the sequence
                     of random numbers used and is critical
                     for duplicating the same selection
                     HOWEVER: upgrades to SAS software, different
                     OS make it impossible to guarantee same
                     results afterwards
                  */
;
   strata id; /*data must be sorted by the strata variable
                which is why your "run again later" must
                be able to duplicate the sort order
              */
run;

Any change that affects the number of variables or the input order will render "duplicating" the selection later useless. Though with that restriction make sure that you don't lose the first created data set. Not that hard to place into a permanent library and not delete it.

_MVB_ · Posted 06-10-2020 09:29 PM

@ballardw - Thank you!

My data is already sorted (repeating records with the same date already excluded), so pretty much I got customer ID, account/s (might be multiple) and date9.

So 'strata' ID will ensure none of the IDs will be completely removed in the 'want' dataset, while only one record kept for an ID, and this is because n=1? Is it because of the combination of 'strata' and 'n' used in the same logic? What would it be, if i keep n=1, but not including 'strata' at all? Also, what would it be if n=2 and 'strata' ID?

ballardw · Posted 06-11-2020 10:46 AM

@_MVB_ wrote:

@ballardw - Thank you!

My data is already sorted (repeating records with the same date already excluded), so pretty much I got customer ID, account/s (might be multiple) and date9.

So 'strata' ID will ensure none of the IDs will be completely removed in the 'want' dataset, while only one record kept for an ID, and this is because n=1? Is it because of the combination of 'strata' and 'n' used in the same logic? What would it be, if i keep n=1, but not including 'strata' at all? Also, what would it be if n=2 and 'strata' ID?

I did not use N= anything, so I am not sure what you refer to here. If you mean "number of records selected per strata" please use either the proper term "sampsize" or state that is what you mean. The SAMPSIZE is the number of records to select per strata. If there is only one number that is applied to all the strata.

The sampsize can be set to specific number of records per strata but it would require either a list in the order the strata appear in the data, such as Sampsize = (25 15 30) which would select 25 records from the first stratum, 15 from the second, and 30 from the third. I suspect such is not at all desirable from your data description, SAMPSIZE could also reference a data set of specific structure that has the strata values and the desired sample size per strata, which would likely be the way to go if you want to control 100's of strata.

If you indicate that you want a SAMPSIZE of n for each strata then you need to consider what to do for strata that have fewer than n records. If you specify more than 1 for the SAMPSIZE and you want all the records if n is larger than the count for that strata you use the SELECTALL option (limits based on selection method), otherwise there will be an error because you requested more than exist in the data.

If you do not include a STRATA statement then your entire data set is considered. If you set SAMPSIZE=1 then your output data set will consist of 1 record, SAMPSIZE=10 would be 10 records.

Since you say that a value of Id variable may have multiple accounts then you should sort by Id and account. And if the Account is duplicated within Id yet another variable to force a repeatable unique order of the data. As I said, anything that changes the order of the values in the data will affect the output. So unless you can 100 percent state "this data set will never ever be resorted" you need to have a repeatable order from your sort.

Also you can have more than one variable on the STRATA statement. If you specific Strata Id Account, then each combination of Id and Account is a strata and the Sampsize would be applied. So Sampsize=1 would get one record for each Id and Account.

If your Strata variable is not missing you will get one, with the example code, per record Id. You want to be careful just using the word ID as that is also a statement in Proc surveyselect (and quite a few other procedures).

_MVB_ · Posted 06-11-2020 01:28 PM

@ballardw -this is a detailed explanations, thanks!

I was referring to Sampsize (instead of N).

I called customer identifier ID here, however in my data this field has a different name, so there will not be an issue using it in a proc surveyselect statement.

My data already sorted by customer identifier, account, and historical month in ascending order, and this is my reference dataset that I am not going to change, hence if in the future this needs to be replicated, I can take this dataset and run surveyselect again with same seed number.

So if STRATA specified in the proc surveyselect and has reference to one variable only (in my case, lets call it Customer Identifier), while SAMPSIZE=1 - will this come back with unique records for each of the Customer Identifier, while a record will be chosen randomly?

To make sure we are on the same page, lets assume my dataset in 100M records, which are historical data for 1M customers and 3M accounts. Each customer has at least 1 account. In my output dataset (post surveyselect), I would want to have exactly 1M records. I do not need to capture a record for each of the accounts (which would make it 3M records in the output), but only one record per customer identifier.

Here is how I am about to run this:

PROC SURVEYSELECT DATA=HAVE
METHOD=SRS
SAMPSIZE=1
SEED=12345
OUT=WANT;
STRATA CUSTOMER_IDENTIFIER;
RUN;

ballardw · Posted 06-11-2020 01:42 PM

Since SAS will not disturb you starting data set why not run the code and check the output?

With as many strata as you are using I suggest adding NOPRINT to the Proc statement so the details of the selection do not get sent to the results window.

_MVB_ · Posted 06-11-2020 01:45 PM

That is what I am doing - wanted to double check here as well 🙂

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!