BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
byeh2017
Quartz | Level 8

I have a dataset with IDs that are both valid and invalid. All valid IDs are numeric only and contain exactly 11 digits, which also MAY include 0’s in the front but not an absolute requirement. How do I write a sas code to delete all entries that do not fall into the 11 numeric only scheme? I have included a sample dataset on excel. Thank you so much.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

@byeh2017 wrote:

Okay, I have a modified version of my original file.

 

data MYDATA.ORIGINALSAMPLE;
  infile datalines dsd truncover;
  input DCDeathDate:DATETIME16. NationalID:$20.;
datalines4;
01NOV16:00:00:00,
17JAN17:00:00:00,0000000000
28NOV16:00:00:00,00000000000
16NOV16:00:00:00,004131103
21DEC16:00:00:00,00438162
12DEC16:00:00:00,01001000012
30NOV16:00:00:00,01001000029
23DEC16:00:00:00,01001000142
06NOV16:00:00:00,01001000234
18DEC16:00:00:00,01001000275
23NOV16:00:00:00,01001000320
22NOV16:00:00:00,01001000392
26DEC16:00:00:00,01001000399
18NOV16:00:00:00,01001000554
09DEC16:00:00:00,01001000608
30NOV16:00:00:00,01001000657
17NOV16:00:00:00,01001000881
27DEC16:00:00:00,01001000951
28DEC16:00:00:00,01001001048
04NOV16:00:00:00,01001001804
10DEC16:00:00:00,01001001853
04NOV16:00:00:00,01001001978
20DEC16:00:00:00,01001002032
27DEC16:00:00:00,01001002219
05DEC16:00:00:00,01001002348
04NOV16:00:00:00,01001002451
16DEC16:00:00:00,01001002597
19NOV16:00:00:00,01001002603
10NOV16:00:00:00,01001002666
26NOV16:00:00:00,01001002761
04NOV16:00:00:00,01001002854
25DEC16:00:00:00,01001002870
18NOV16:00:00:00,01001002882
25NOV16:00:00:00,01001002950
31DEC16:00:00:00,01001003003
01NOV16:00:00:00,01001003057
21NOV16:00:00:00,01001003198
02NOV16:00:00:00,01001003231
02DEC16:00:00:00,01001003234
05DEC16:00:00:00,01001003253
26DEC16:00:00:00,01001003284
05NOV16:00:00:00,01001003312
15DEC16:00:00:00,01001003370
02DEC16:00:00:00,01001003414
14DEC16:00:00:00,01001003476
04DEC16:00:00:00,01001003583
10DEC16:00:00:00,01001003609
28DEC16:00:00:00,01001003788
30NOV16:00:00:00,01001003803
10DEC16:00:00:00,01001003913
;;;;

Use this:

data mydata.clean;
set MYDATA.ORIGINALSAMPLE;
if verify(trim(NationalID),"1234567890") = 0 and length(trim(NationalID)) = 11;
run;

View solution in original post

13 REPLIES 13
Kurt_Bremser
Super User

Please post your example data in a data step.

An easy to use macro for converting your SAS dataset into a data step is found here:

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

Why?

- Excel files can carry malware

- therefore many institutional firewalls block their download

- and they cannot preserve SAS data as it is (attributes - like variable types and formats - are lost)

Posting the data step enables your possible helpers to exactly recreate your data with a simple copy/paste and run, without any kind of danger.

byeh2017
Quartz | Level 8

Sorry. Attached is the sas dataset.

Kurt_Bremser
Super User

Please reread my post. Use the macro provided in the link to create the datastep code in one easy step, and then post the resulting code here, using the "little running man" or {i} button.

byeh2017
Quartz | Level 8

Sorry, here it is:

 

data MYDATA.SAMPLEINVALIDID;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:$11. Screened:$3.;
datalines4;
11/01/2017,14522778554,Yes
12/01/2017,A,No
01/01/2018,14522778556,Yes
02/01/2018,14522778557,No
03/01/2018,14522778558,Yes
04/01/2018,14522778559,No
05/01/2018,14522778560,Yes
06/01/2018,,No
07/01/2018,14522778562,Yes
08/01/2018,14522778563,No
09/01/2018,14522778564,Yes
10/01/2018,14522778565,No
11/01/2018,14522778566,Yes
;;;;
byeh2017
Quartz | Level 8

I seem to have found a way to do this on this sample dataset, but when I apply it to a bigger dataset, it somehow erases all entries.

 

I am using this particular code:

Data mydata.sampleidsCLEAN;
Set mydata.sampleids;
If verify(ID,"0123456789") NE 0 then delete;
Run;

 

This seems to work on my sample dataset:

data MYDATA.SAMPLEIDs;
  infile datalines dsd truncover;
  input Date:MMDDYY10. ID:$11.;
datalines4;
11/01/2017,14522778554
12/01/2017,A
01/01/2018,14522778556
02/01/2018,14522778557
03/01/2018,14522778558
04/01/2018,14522778559
05/01/2018,14522778560
06/01/2018,
07/01/2018,14522778562
08/01/2018,14522778563
09/01/2018,14522778564
10/01/2018,14522778565
11/01/2018,14522778566
12/01/2018,
01/01/2019,
02/01/2019,sdf
03/01/2019,12a
04/01/2019,32r234a
05/01/2019,443bbb
06/01/2019,bbbbbbb
;;;;

Data mydata.sampleidsCLEAN;
Set mydata.sampleids;
If verify(ID,"0123456789") NE 0 then delete;
Run;

 

But when I apply it to a much bigger dataset, it is just erasing all entries:

 

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 61         
 62         Data mydata.originaldatasetDUPIDCLEAN;
 63         Set mydata.originaldatasetDUPCLEAN;
 64         If verify(nationalID,"0123456789") NE 0 then delete;
 65         Run;
 
 NOTE: There were 74691 observations read from the data set MYDATA.ORIGINALDATASETDUPCLEAN.
 NOTE: The data set MYDATA.ORIGINALDATASETDUPIDCLEAN has 0 observations and 45 variables.
 NOTE: DATA statement used (Total process time):
       real time           1.92 seconds
       cpu time            1.14 seconds
       
 
 66         
 67         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 80         
Kurt_Bremser
Super User

Are you sure that nationalID contains the same type of data as ID in your example dataset?

 

You can use the fourth parameter in the macro to restrict the number of observations that gets into the example dataset datastep, so you could post just the first 10 obs from your real dataset MYDATA.ORIGINALDATASETDUPCLEAN.

byeh2017
Quartz | Level 8

Thanks. My main dataset MYDATA.ORIGINALDATASETDUPCLEAN has the following parameters

 

Label: NationalID

Name: NationalID

Length: 20

Type: Char

Format: $20

Informat: $20

 

This appears to be different than my sample dataset which has the following properties:

Label: ID

Name: ID

Length: 11

Type: Char

Format: 

Informat: 

 

There is some sensitive patient information from my original dataset, let me spend a few minutes of making a new dataset out of this to delete some of these sensitive variables. Thank you.

byeh2017
Quartz | Level 8

Okay, I have a modified version of my original file.

 

data MYDATA.ORIGINALSAMPLE;
  infile datalines dsd truncover;
  input DCDeathDate:DATETIME16. NationalID:$20.;
datalines4;
01NOV16:00:00:00,
17JAN17:00:00:00,0000000000
28NOV16:00:00:00,00000000000
16NOV16:00:00:00,004131103
21DEC16:00:00:00,00438162
12DEC16:00:00:00,01001000012
30NOV16:00:00:00,01001000029
23DEC16:00:00:00,01001000142
06NOV16:00:00:00,01001000234
18DEC16:00:00:00,01001000275
23NOV16:00:00:00,01001000320
22NOV16:00:00:00,01001000392
26DEC16:00:00:00,01001000399
18NOV16:00:00:00,01001000554
09DEC16:00:00:00,01001000608
30NOV16:00:00:00,01001000657
17NOV16:00:00:00,01001000881
27DEC16:00:00:00,01001000951
28DEC16:00:00:00,01001001048
04NOV16:00:00:00,01001001804
10DEC16:00:00:00,01001001853
04NOV16:00:00:00,01001001978
20DEC16:00:00:00,01001002032
27DEC16:00:00:00,01001002219
05DEC16:00:00:00,01001002348
04NOV16:00:00:00,01001002451
16DEC16:00:00:00,01001002597
19NOV16:00:00:00,01001002603
10NOV16:00:00:00,01001002666
26NOV16:00:00:00,01001002761
04NOV16:00:00:00,01001002854
25DEC16:00:00:00,01001002870
18NOV16:00:00:00,01001002882
25NOV16:00:00:00,01001002950
31DEC16:00:00:00,01001003003
01NOV16:00:00:00,01001003057
21NOV16:00:00:00,01001003198
02NOV16:00:00:00,01001003231
02DEC16:00:00:00,01001003234
05DEC16:00:00:00,01001003253
26DEC16:00:00:00,01001003284
05NOV16:00:00:00,01001003312
15DEC16:00:00:00,01001003370
02DEC16:00:00:00,01001003414
14DEC16:00:00:00,01001003476
04DEC16:00:00:00,01001003583
10DEC16:00:00:00,01001003609
28DEC16:00:00:00,01001003788
30NOV16:00:00:00,01001003803
10DEC16:00:00:00,01001003913
;;;;
Kurt_Bremser
Super User

@byeh2017 wrote:

Okay, I have a modified version of my original file.

 

data MYDATA.ORIGINALSAMPLE;
  infile datalines dsd truncover;
  input DCDeathDate:DATETIME16. NationalID:$20.;
datalines4;
01NOV16:00:00:00,
17JAN17:00:00:00,0000000000
28NOV16:00:00:00,00000000000
16NOV16:00:00:00,004131103
21DEC16:00:00:00,00438162
12DEC16:00:00:00,01001000012
30NOV16:00:00:00,01001000029
23DEC16:00:00:00,01001000142
06NOV16:00:00:00,01001000234
18DEC16:00:00:00,01001000275
23NOV16:00:00:00,01001000320
22NOV16:00:00:00,01001000392
26DEC16:00:00:00,01001000399
18NOV16:00:00:00,01001000554
09DEC16:00:00:00,01001000608
30NOV16:00:00:00,01001000657
17NOV16:00:00:00,01001000881
27DEC16:00:00:00,01001000951
28DEC16:00:00:00,01001001048
04NOV16:00:00:00,01001001804
10DEC16:00:00:00,01001001853
04NOV16:00:00:00,01001001978
20DEC16:00:00:00,01001002032
27DEC16:00:00:00,01001002219
05DEC16:00:00:00,01001002348
04NOV16:00:00:00,01001002451
16DEC16:00:00:00,01001002597
19NOV16:00:00:00,01001002603
10NOV16:00:00:00,01001002666
26NOV16:00:00:00,01001002761
04NOV16:00:00:00,01001002854
25DEC16:00:00:00,01001002870
18NOV16:00:00:00,01001002882
25NOV16:00:00:00,01001002950
31DEC16:00:00:00,01001003003
01NOV16:00:00:00,01001003057
21NOV16:00:00:00,01001003198
02NOV16:00:00:00,01001003231
02DEC16:00:00:00,01001003234
05DEC16:00:00:00,01001003253
26DEC16:00:00:00,01001003284
05NOV16:00:00:00,01001003312
15DEC16:00:00:00,01001003370
02DEC16:00:00:00,01001003414
14DEC16:00:00:00,01001003476
04DEC16:00:00:00,01001003583
10DEC16:00:00:00,01001003609
28DEC16:00:00:00,01001003788
30NOV16:00:00:00,01001003803
10DEC16:00:00:00,01001003913
;;;;

Use this:

data mydata.clean;
set MYDATA.ORIGINALSAMPLE;
if verify(trim(NationalID),"1234567890") = 0 and length(trim(NationalID)) = 11;
run;
byeh2017
Quartz | Level 8

This works! Thank you very much for your help.

 

Just trying to understand how this works. The original code was 0123456789, how come moving the code to 1234567890 made all the difference? And what exactly does the trim function do? Thank you again.

Kurt_Bremser
Super User

The difference between 0123456789 and 1234567890 is just the order of digits on the keyboard, it makes no difference with regard to the verify() function.

 

The trim() function removes all trailing blanks from strings, strip() removes leading and trailing blanks.

 

So, after using trim(), one can easily check for length and digits-only.

 

You could also use the notdigit() function instead of verify().

gamotte
Rhodochrosite | Level 12

Another solution is to use a regular expression

 

if prxmatch("/^\d{11}$/",trim(ID));
Kurt_Bremser
Super User

If your NationalID only contains the same 11-digit numbers like your sample ID, the verify() function will cough up on the blanks that pad the longer variable.

Then your condition should look like

if verify(trim(NationalID),"1234567890") ne 0 or length(trim(NationalID)) ne 11 then delete;

To make it prettier, I'd use a subsetting if:

if verify(trim(NationalID),"1234567890") = 0 and length(trim(NationalID)) = 11;

If the ID's are not left-aligned, use strip() instead of trim().

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1326 views
  • 3 likes
  • 3 in conversation