BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
A_Swoosh
Quartz | Level 8

Hi SAS Community,

 

I am importing pipe delimited .txt and .csv files into SAS and want to run a series of checks:

1. Count the number of records

2. Count the number of distinct records for ID

3. Whether the variables match a list of set variables

4. Are the values expected or in the correct format

 

Is there a way to output 1 dataset with all this information? As it currently stands,I produce these results into a text file that are the results of multiple proc frequencies and proc sql.

These are my expected variables:

data VARS;
input vars $;
datalines;
ID
BDAY
GENDER
ADDRESS1
ADDRESS2
CITY
STATE
ZIP
ZIP4
COUNTY_CODE
;
run;

This is my sample data:

data sample;
infile datalines dlm="|" missover dsd; 
input ID $2 BDAY $10 GENDER $ ADDRESS1 $ ADDRESS2 $10. CITY $ STATE $ ZIP $ ZIP4 $ COUNTY_CODE $;
datalines;
A1|20200420|M|123 Main St.|Suite 201|Juneau|AK|99802||02112
B2|4/20/2020|Male|124 Main St.||Juneau|AK|99802|Juneau
C3|4/20/2020|M|125 Main St.||Juneau|AK|99802||02112
4-1|20200420|Male|126 Main St.|Suite 101|Juneau|AK|99802||Juneau
;
run;

 Conditions:

ID:should be alpha numeric and 2 characters

BDAY: should be YYYYMMDD format

Gender: should be 1 characters

Address1 and Address 2 should be character

City should character

State should be 2 characters

zip should be 5 characters

zip4 should be no more than 4 characters 

county should be the fips code which is 5 characters

 

Results:

Line 1: accurate and meets all criteria

Line 2: dob is not in correct format, gender is not 2 characters, does not have zip4 variable, county does not have 5 character

Line 3: dob is not in correct format

Line 4: id is not alpha numeric with no special characters, and gender is not 2 characters

 

 

So far I have:

/*Variables match list? */
		proc sql;
			create table plan_vars as
			select strip(upcase(name)) as vars 
			from sashelp.vcolumn
			where libname='WORK' and memname =SAMPLE;
		quit;


		proc sql;
			create table comparevar_&file. as
			select a.*, b.*,
			case when a.vars = b.varsm then 'Match'
			else 'No' end as var_match
			from VARS as a
			full join plan_vars(rename=(vars=varsm)) as b
			on a.vars=b.varsm;
		quit;

/*output records */
		ods rtf file="&output.\FileReview.rtf";  
		ods noptitle; 
		options nodate nonumber;

		proc sql;
			create table ctrltot_&file. as
			select count(*) as Total_number_of_records,
				count(distinct ID) as Total_number_of_unique_ID,
				count (ID) as tot_ID
			from sample;
		quit;

		proc freq data = sample;
		   	tables _all_;
		   	format _numeric_ _character_ $miss.;
		run;

		proc freq data = ctrltot_&file.;
		tables _all_;
		run;

		proc freq data=comparevar_&file.;
			tables _all_;
			where var_match = 'No';
		run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Something like this?

proc sql;
   select count(*)                                    as NB_RECORDS
         ,count(distinct ID)                          as NB_UNQ_ID
         ,sum(^prxmatch('/^\w\w$/',ID))               as NB_BAD_ID
         ,sum(^input(BDAY,yymmdd8.))                  as NB_BAD_BDAY
         ,sum(^prxmatch('/^[MF]$/i',GENDER))          as NB_BAD_GENDER
         ,sum(^prxmatch('/^[A-Z ]*$/i',CITY))         as NB_BAD_CITY
         ,sum(^prxmatch('/^\d{5} *$/i',ZIP))          as NB_BAD_ZIP
         ,sum(^prxmatch('/^\d{1,4} *$/i',ZIP4))       as NB_BAD_ZIP4
         ,sum(^prxmatch('/^\d{5} *$/i',COUNTY_CODE )) as NB_BAD_CC
   from TABLE;

 

View solution in original post

7 REPLIES 7
ChrisNZ
Tourmaline | Level 20

> Line 1: accurate and meets all criteria

How?

ID=A1 is not 7 characters, GENDER=M is not 2 characters etc

A_Swoosh
Quartz | Level 8

Sorry, my apologies. It was a sample and I forgot to change the length. It should be 2 and 1, respectively. 

ChrisNZ
Tourmaline | Level 20

Something like this?

proc sql;
   select count(*)                                    as NB_RECORDS
         ,count(distinct ID)                          as NB_UNQ_ID
         ,sum(^prxmatch('/^\w\w$/',ID))               as NB_BAD_ID
         ,sum(^input(BDAY,yymmdd8.))                  as NB_BAD_BDAY
         ,sum(^prxmatch('/^[MF]$/i',GENDER))          as NB_BAD_GENDER
         ,sum(^prxmatch('/^[A-Z ]*$/i',CITY))         as NB_BAD_CITY
         ,sum(^prxmatch('/^\d{5} *$/i',ZIP))          as NB_BAD_ZIP
         ,sum(^prxmatch('/^\d{1,4} *$/i',ZIP4))       as NB_BAD_ZIP4
         ,sum(^prxmatch('/^\d{5} *$/i',COUNTY_CODE )) as NB_BAD_CC
   from TABLE;

 

A_Swoosh
Quartz | Level 8

Yes, this is exactly what I'm trying to accomplish. Thank you for the help.
I don't have much experience with Perl Regular Expressions but is there a way to find a series of values that is the length of the variable and if it matches then identify as bad? I want to match repeating but instead of n it's the length of the variable? I also want to identify an ID that is not alphanumeric. For example the ID is a description instead of the A235324.

 

Case 1:

ID: 0000000.

Case 2:

ID: Physician

Case 3:

ID: A235324

ChrisNZ
Tourmaline | Level 20
How can ID be 00000000 when you define it as $2?
A_Swoosh
Quartz | Level 8

I was trying to present another example so I'm clear about the syntax involved with perl expressions since I'm new to those expressions.

 

If I have another dataset where I'm trying to identify

data sample;
infile datalines dlm="|" missover dsd; 
input CATEGORY $ ID $;
datalines;
Physician|A123242
|0000000
PS220|A123456
run;

Case 3 has the proper format for each variable while case 2 has both wrong, and case 1 has CATEGORY wrong.

 

ChrisNZ
Tourmaline | Level 20

I want to match repeating but instead of n it's the length of the variable

prxmatch('/(\d)\1{4}/',STR);

matches a digit repeated 4 times (ie 5 identical digits)

Another way could be testing:

repeat(first(STR),4)=STR

 

For the rest you can look at the ANYxxxx and NOTxxxx functions

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2316 views
  • 2 likes
  • 2 in conversation