BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
A_Swoosh
Quartz | Level 8

Hi SAS Community,

 

I am importing pipe delimited .txt and .csv files into SAS and want to run a series of checks:

1. Count the number of records

2. Count the number of distinct records for ID

3. Whether the variables match a list of set variables

4. Are the values expected or in the correct format

 

Is there a way to output 1 dataset with all this information? As it currently stands,I produce these results into a text file that are the results of multiple proc frequencies and proc sql.

These are my expected variables:

data VARS;
input vars $;
datalines;
ID
BDAY
GENDER
ADDRESS1
ADDRESS2
CITY
STATE
ZIP
ZIP4
COUNTY_CODE
;
run;

This is my sample data:

data sample;
infile datalines dlm="|" missover dsd; 
input ID $2 BDAY $10 GENDER $ ADDRESS1 $ ADDRESS2 $10. CITY $ STATE $ ZIP $ ZIP4 $ COUNTY_CODE $;
datalines;
A1|20200420|M|123 Main St.|Suite 201|Juneau|AK|99802||02112
B2|4/20/2020|Male|124 Main St.||Juneau|AK|99802|Juneau
C3|4/20/2020|M|125 Main St.||Juneau|AK|99802||02112
4-1|20200420|Male|126 Main St.|Suite 101|Juneau|AK|99802||Juneau
;
run;

 Conditions:

ID:should be alpha numeric and 2 characters

BDAY: should be YYYYMMDD format

Gender: should be 1 characters

Address1 and Address 2 should be character

City should character

State should be 2 characters

zip should be 5 characters

zip4 should be no more than 4 characters 

county should be the fips code which is 5 characters

 

Results:

Line 1: accurate and meets all criteria

Line 2: dob is not in correct format, gender is not 2 characters, does not have zip4 variable, county does not have 5 character

Line 3: dob is not in correct format

Line 4: id is not alpha numeric with no special characters, and gender is not 2 characters

 

 

So far I have:

/*Variables match list? */
		proc sql;
			create table plan_vars as
			select strip(upcase(name)) as vars 
			from sashelp.vcolumn
			where libname='WORK' and memname =SAMPLE;
		quit;


		proc sql;
			create table comparevar_&file. as
			select a.*, b.*,
			case when a.vars = b.varsm then 'Match'
			else 'No' end as var_match
			from VARS as a
			full join plan_vars(rename=(vars=varsm)) as b
			on a.vars=b.varsm;
		quit;

/*output records */
		ods rtf file="&output.\FileReview.rtf";  
		ods noptitle; 
		options nodate nonumber;

		proc sql;
			create table ctrltot_&file. as
			select count(*) as Total_number_of_records,
				count(distinct ID) as Total_number_of_unique_ID,
				count (ID) as tot_ID
			from sample;
		quit;

		proc freq data = sample;
		   	tables _all_;
		   	format _numeric_ _character_ $miss.;
		run;

		proc freq data = ctrltot_&file.;
		tables _all_;
		run;

		proc freq data=comparevar_&file.;
			tables _all_;
			where var_match = 'No';
		run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Something like this?

proc sql;
   select count(*)                                    as NB_RECORDS
         ,count(distinct ID)                          as NB_UNQ_ID
         ,sum(^prxmatch('/^\w\w$/',ID))               as NB_BAD_ID
         ,sum(^input(BDAY,yymmdd8.))                  as NB_BAD_BDAY
         ,sum(^prxmatch('/^[MF]$/i',GENDER))          as NB_BAD_GENDER
         ,sum(^prxmatch('/^[A-Z ]*$/i',CITY))         as NB_BAD_CITY
         ,sum(^prxmatch('/^\d{5} *$/i',ZIP))          as NB_BAD_ZIP
         ,sum(^prxmatch('/^\d{1,4} *$/i',ZIP4))       as NB_BAD_ZIP4
         ,sum(^prxmatch('/^\d{5} *$/i',COUNTY_CODE )) as NB_BAD_CC
   from TABLE;

 

View solution in original post

7 REPLIES 7
ChrisNZ
Tourmaline | Level 20

> Line 1: accurate and meets all criteria

How?

ID=A1 is not 7 characters, GENDER=M is not 2 characters etc

A_Swoosh
Quartz | Level 8

Sorry, my apologies. It was a sample and I forgot to change the length. It should be 2 and 1, respectively. 

ChrisNZ
Tourmaline | Level 20

Something like this?

proc sql;
   select count(*)                                    as NB_RECORDS
         ,count(distinct ID)                          as NB_UNQ_ID
         ,sum(^prxmatch('/^\w\w$/',ID))               as NB_BAD_ID
         ,sum(^input(BDAY,yymmdd8.))                  as NB_BAD_BDAY
         ,sum(^prxmatch('/^[MF]$/i',GENDER))          as NB_BAD_GENDER
         ,sum(^prxmatch('/^[A-Z ]*$/i',CITY))         as NB_BAD_CITY
         ,sum(^prxmatch('/^\d{5} *$/i',ZIP))          as NB_BAD_ZIP
         ,sum(^prxmatch('/^\d{1,4} *$/i',ZIP4))       as NB_BAD_ZIP4
         ,sum(^prxmatch('/^\d{5} *$/i',COUNTY_CODE )) as NB_BAD_CC
   from TABLE;

 

A_Swoosh
Quartz | Level 8

Yes, this is exactly what I'm trying to accomplish. Thank you for the help.
I don't have much experience with Perl Regular Expressions but is there a way to find a series of values that is the length of the variable and if it matches then identify as bad? I want to match repeating but instead of n it's the length of the variable? I also want to identify an ID that is not alphanumeric. For example the ID is a description instead of the A235324.

 

Case 1:

ID: 0000000.

Case 2:

ID: Physician

Case 3:

ID: A235324

ChrisNZ
Tourmaline | Level 20
How can ID be 00000000 when you define it as $2?
A_Swoosh
Quartz | Level 8

I was trying to present another example so I'm clear about the syntax involved with perl expressions since I'm new to those expressions.

 

If I have another dataset where I'm trying to identify

data sample;
infile datalines dlm="|" missover dsd; 
input CATEGORY $ ID $;
datalines;
Physician|A123242
|0000000
PS220|A123456
run;

Case 3 has the proper format for each variable while case 2 has both wrong, and case 1 has CATEGORY wrong.

 

ChrisNZ
Tourmaline | Level 20

I want to match repeating but instead of n it's the length of the variable

prxmatch('/(\d)\1{4}/',STR);

matches a digit repeated 4 times (ie 5 identical digits)

Another way could be testing:

repeat(first(STR),4)=STR

 

For the rest you can look at the ANYxxxx and NOTxxxx functions

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1457 views
  • 2 likes
  • 2 in conversation