DATA Step, Macro, Functions and more

Filtering out entries that do not have a certain scheme

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 77
Accepted Solution

Filtering out entries that do not have a certain scheme

I have a dataset with IDs that are both valid and invalid. All valid IDs are numeric only and contain exactly 11 digits, which also MAY include 0’s in the front but not an absolute requirement. How do I write a sas code to delete all entries that do not fall into the 11 numeric only scheme? I have included a sample dataset on excel. Thank you so much.


Accepted Solutions
Solution
‎03-14-2017 10:01 AM
Super User
Posts: 7,796

Re: Filtering out entries that do not have a certain scheme


byeh2017 wrote:

Okay, I have a modified version of my original file.

 

data MYDATA.ORIGINALSAMPLE;
  infile datalines dsd truncover;
  input DCDeathDate:DATETIME16. NationalID:$20.;
datalines4;
01NOV16:00:00:00,
17JAN17:00:00:00,0000000000
28NOV16:00:00:00,00000000000
16NOV16:00:00:00,004131103
21DEC16:00:00:00,00438162
12DEC16:00:00:00,01001000012
30NOV16:00:00:00,01001000029
23DEC16:00:00:00,01001000142
06NOV16:00:00:00,01001000234
18DEC16:00:00:00,01001000275
23NOV16:00:00:00,01001000320
22NOV16:00:00:00,01001000392
26DEC16:00:00:00,01001000399
18NOV16:00:00:00,01001000554
09DEC16:00:00:00,01001000608
30NOV16:00:00:00,01001000657
17NOV16:00:00:00,01001000881
27DEC16:00:00:00,01001000951
28DEC16:00:00:00,01001001048
04NOV16:00:00:00,01001001804
10DEC16:00:00:00,01001001853
04NOV16:00:00:00,01001001978
20DEC16:00:00:00,01001002032
27DEC16:00:00:00,01001002219
05DEC16:00:00:00,01001002348
04NOV16:00:00:00,01001002451
16DEC16:00:00:00,01001002597
19NOV16:00:00:00,01001002603
10NOV16:00:00:00,01001002666
26NOV16:00:00:00,01001002761
04NOV16:00:00:00,01001002854
25DEC16:00:00:00,01001002870
18NOV16:00:00:00,01001002882
25NOV16:00:00:00,01001002950
31DEC16:00:00:00,01001003003
01NOV16:00:00:00,01001003057
21NOV16:00:00:00,01001003198
02NOV16:00:00:00,01001003231
02DEC16:00:00:00,01001003234
05DEC16:00:00:00,01001003253
26DEC16:00:00:00,01001003284
05NOV16:00:00:00,01001003312
15DEC16:00:00:00,01001003370
02DEC16:00:00:00,01001003414
14DEC16:00:00:00,01001003476
04DEC16:00:00:00,01001003583
10DEC16:00:00:00,01001003609
28DEC16:00:00:00,01001003788
30NOV16:00:00:00,01001003803
10DEC16:00:00:00,01001003913
;;;;

Use this:

data mydata.clean;
set MYDATA.ORIGINALSAMPLE;
if verify(trim(NationalID),"1234567890") = 0 and length(trim(NationalID)) = 11;
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers

View solution in original post


All Replies
Super User
Posts: 7,796

Re: Filtering out entries that do not have a certain scheme

Please post your example data in a data step.

An easy to use macro for converting your SAS dataset into a data step is found here:

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

Why?

- Excel files can carry malware

- therefore many institutional firewalls block their download

- and they cannot preserve SAS data as it is (attributes - like variable types and formats - are lost)

Posting the data step enables your possible helpers to exactly recreate your data with a simple copy/paste and run, without any kind of danger.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Frequent Contributor
Posts: 77

Re: Filtering out entries that do not have a certain scheme

Posted in reply to KurtBremser

Sorry. Attached is the sas dataset.

Attachment
Super User
Posts: 7,796

Re: Filtering out entries that do not have a certain scheme

Please reread my post. Use the macro provided in the link to create the datastep code in one easy step, and then post the resulting code here, using the "little running man" or {i} button.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Frequent Contributor
Posts: 77

Re: Filtering out entries that do not have a certain scheme

Posted in reply to KurtBremser

Sorry, here it is:

 

data MYDATA.SAMPLEINVALIDID;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:$11. Screened:$3.;
datalines4;
11/01/2017,14522778554,Yes
12/01/2017,A,No
01/01/2018,14522778556,Yes
02/01/2018,14522778557,No
03/01/2018,14522778558,Yes
04/01/2018,14522778559,No
05/01/2018,14522778560,Yes
06/01/2018,,No
07/01/2018,14522778562,Yes
08/01/2018,14522778563,No
09/01/2018,14522778564,Yes
10/01/2018,14522778565,No
11/01/2018,14522778566,Yes
;;;;
Frequent Contributor
Posts: 77

Re: Filtering out entries that do not have a certain scheme

I seem to have found a way to do this on this sample dataset, but when I apply it to a bigger dataset, it somehow erases all entries.

 

I am using this particular code:

Data mydata.sampleidsCLEAN;
Set mydata.sampleids;
If verify(ID,"0123456789") NE 0 then delete;
Run;

 

This seems to work on my sample dataset:

data MYDATA.SAMPLEIDs;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:$11.;
datalines4;
11/01/2017,14522778554
12/01/2017,A
01/01/2018,14522778556
02/01/2018,14522778557
03/01/2018,14522778558
04/01/2018,14522778559
05/01/2018,14522778560
06/01/2018,
07/01/2018,14522778562
08/01/2018,14522778563
09/01/2018,14522778564
10/01/2018,14522778565
11/01/2018,14522778566
12/01/2018,
01/01/2019,
02/01/2019,sdf
03/01/2019,12a
04/01/2019,32r234a
05/01/2019,443bbb
06/01/2019,bbbbbbb
;;;;

Data mydata.sampleidsCLEAN;
Set mydata.sampleids;
If verify(ID,"0123456789") NE 0 then delete;
Run;

 

But when I apply it to a much bigger dataset, it is just erasing all entries:

 

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 61         
 62         Data mydata.originaldatasetDUPIDCLEAN;
 63         Set mydata.originaldatasetDUPCLEAN;
 64         If verify(nationalID,"0123456789") NE 0 then delete;
 65         Run;
 
 NOTE: There were 74691 observations read from the data set MYDATA.ORIGINALDATASETDUPCLEAN.
 NOTE: The data set MYDATA.ORIGINALDATASETDUPIDCLEAN has 0 observations and 45 variables.
 NOTE: DATA statement used (Total process time):
       real time           1.92 seconds
       cpu time            1.14 seconds
       
 
 66         
 67         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 80         
Super User
Posts: 7,796

Re: Filtering out entries that do not have a certain scheme

Are you sure that nationalID contains the same type of data as ID in your example dataset?

 

You can use the fourth parameter in the macro to restrict the number of observations that gets into the example dataset datastep, so you could post just the first 10 obs from your real dataset MYDATA.ORIGINALDATASETDUPCLEAN.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Frequent Contributor
Posts: 77

Re: Filtering out entries that do not have a certain scheme

Posted in reply to KurtBremser

Thanks. My main dataset MYDATA.ORIGINALDATASETDUPCLEAN has the following parameters

 

Label: NationalID

Name: NationalID

Length: 20

Type: Char

Format: $20

Informat: $20

 

This appears to be different than my sample dataset which has the following properties:

Label: ID

Name: ID

Length: 11

Type: Char

Format: 

Informat: 

 

There is some sensitive patient information from my original dataset, let me spend a few minutes of making a new dataset out of this to delete some of these sensitive variables. Thank you.

Frequent Contributor
Posts: 77

Re: Filtering out entries that do not have a certain scheme

Okay, I have a modified version of my original file.

 

data MYDATA.ORIGINALSAMPLE;
  infile datalines dsd truncover;
  input DCDeathDate:DATETIME16. NationalID:$20.;
datalines4;
01NOV16:00:00:00,
17JAN17:00:00:00,0000000000
28NOV16:00:00:00,00000000000
16NOV16:00:00:00,004131103
21DEC16:00:00:00,00438162
12DEC16:00:00:00,01001000012
30NOV16:00:00:00,01001000029
23DEC16:00:00:00,01001000142
06NOV16:00:00:00,01001000234
18DEC16:00:00:00,01001000275
23NOV16:00:00:00,01001000320
22NOV16:00:00:00,01001000392
26DEC16:00:00:00,01001000399
18NOV16:00:00:00,01001000554
09DEC16:00:00:00,01001000608
30NOV16:00:00:00,01001000657
17NOV16:00:00:00,01001000881
27DEC16:00:00:00,01001000951
28DEC16:00:00:00,01001001048
04NOV16:00:00:00,01001001804
10DEC16:00:00:00,01001001853
04NOV16:00:00:00,01001001978
20DEC16:00:00:00,01001002032
27DEC16:00:00:00,01001002219
05DEC16:00:00:00,01001002348
04NOV16:00:00:00,01001002451
16DEC16:00:00:00,01001002597
19NOV16:00:00:00,01001002603
10NOV16:00:00:00,01001002666
26NOV16:00:00:00,01001002761
04NOV16:00:00:00,01001002854
25DEC16:00:00:00,01001002870
18NOV16:00:00:00,01001002882
25NOV16:00:00:00,01001002950
31DEC16:00:00:00,01001003003
01NOV16:00:00:00,01001003057
21NOV16:00:00:00,01001003198
02NOV16:00:00:00,01001003231
02DEC16:00:00:00,01001003234
05DEC16:00:00:00,01001003253
26DEC16:00:00:00,01001003284
05NOV16:00:00:00,01001003312
15DEC16:00:00:00,01001003370
02DEC16:00:00:00,01001003414
14DEC16:00:00:00,01001003476
04DEC16:00:00:00,01001003583
10DEC16:00:00:00,01001003609
28DEC16:00:00:00,01001003788
30NOV16:00:00:00,01001003803
10DEC16:00:00:00,01001003913
;;;;
Solution
‎03-14-2017 10:01 AM
Super User
Posts: 7,796

Re: Filtering out entries that do not have a certain scheme


byeh2017 wrote:

Okay, I have a modified version of my original file.

 

data MYDATA.ORIGINALSAMPLE;
  infile datalines dsd truncover;
  input DCDeathDate:DATETIME16. NationalID:$20.;
datalines4;
01NOV16:00:00:00,
17JAN17:00:00:00,0000000000
28NOV16:00:00:00,00000000000
16NOV16:00:00:00,004131103
21DEC16:00:00:00,00438162
12DEC16:00:00:00,01001000012
30NOV16:00:00:00,01001000029
23DEC16:00:00:00,01001000142
06NOV16:00:00:00,01001000234
18DEC16:00:00:00,01001000275
23NOV16:00:00:00,01001000320
22NOV16:00:00:00,01001000392
26DEC16:00:00:00,01001000399
18NOV16:00:00:00,01001000554
09DEC16:00:00:00,01001000608
30NOV16:00:00:00,01001000657
17NOV16:00:00:00,01001000881
27DEC16:00:00:00,01001000951
28DEC16:00:00:00,01001001048
04NOV16:00:00:00,01001001804
10DEC16:00:00:00,01001001853
04NOV16:00:00:00,01001001978
20DEC16:00:00:00,01001002032
27DEC16:00:00:00,01001002219
05DEC16:00:00:00,01001002348
04NOV16:00:00:00,01001002451
16DEC16:00:00:00,01001002597
19NOV16:00:00:00,01001002603
10NOV16:00:00:00,01001002666
26NOV16:00:00:00,01001002761
04NOV16:00:00:00,01001002854
25DEC16:00:00:00,01001002870
18NOV16:00:00:00,01001002882
25NOV16:00:00:00,01001002950
31DEC16:00:00:00,01001003003
01NOV16:00:00:00,01001003057
21NOV16:00:00:00,01001003198
02NOV16:00:00:00,01001003231
02DEC16:00:00:00,01001003234
05DEC16:00:00:00,01001003253
26DEC16:00:00:00,01001003284
05NOV16:00:00:00,01001003312
15DEC16:00:00:00,01001003370
02DEC16:00:00:00,01001003414
14DEC16:00:00:00,01001003476
04DEC16:00:00:00,01001003583
10DEC16:00:00:00,01001003609
28DEC16:00:00:00,01001003788
30NOV16:00:00:00,01001003803
10DEC16:00:00:00,01001003913
;;;;

Use this:

data mydata.clean;
set MYDATA.ORIGINALSAMPLE;
if verify(trim(NationalID),"1234567890") = 0 and length(trim(NationalID)) = 11;
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Frequent Contributor
Posts: 77

Re: Filtering out entries that do not have a certain scheme

Posted in reply to KurtBremser

This works! Thank you very much for your help.

 

Just trying to understand how this works. The original code was 0123456789, how come moving the code to 1234567890 made all the difference? And what exactly does the trim function do? Thank you again.

Super User
Posts: 7,796

Re: Filtering out entries that do not have a certain scheme

The difference between 0123456789 and 1234567890 is just the order of digits on the keyboard, it makes no difference with regard to the verify() function.

 

The trim() function removes all trailing blanks from strings, strip() removes leading and trailing blanks.

 

So, after using trim(), one can easily check for length and digits-only.

 

You could also use the notdigit() function instead of verify().

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Regular Contributor
Posts: 233

Re: Filtering out entries that do not have a certain scheme

Another solution is to use a regular expression

 

if prxmatch("/^\d{11}$/",trim(ID));
Super User
Posts: 7,796

Re: Filtering out entries that do not have a certain scheme

If your NationalID only contains the same 11-digit numbers like your sample ID, the verify() function will cough up on the blanks that pad the longer variable.

Then your condition should look like

if verify(trim(NationalID),"1234567890") ne 0 or length(trim(NationalID)) ne 11 then delete;

To make it prettier, I'd use a subsetting if:

if verify(trim(NationalID),"1234567890") = 0 and length(trim(NationalID)) = 11;

If the ID's are not left-aligned, use strip() instead of trim().

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 190 views
  • 3 likes
  • 3 in conversation