The SAS Output Delivery System and reporting techniques

Subsetting Issue

Reply
Occasional Contributor
Posts: 10

Subsetting Issue

Hello guys, i ran this little piece of code below in an attempt to subset some data from a csv file.

LIBNAME Catalog 'E:\Source Dataset';
LIBNAME Autopay 'E:\Breakdown\Autopay';
PROC IMPORT OUT= Catalog.June08
DATAFILE= "E:\Source data\B06 - Switch Summary_csv_2010_0608.csv"
DBMS=CSV REPLACE;
GETNAMES=NO;
DATAROW=2;
RUN;

DATA Autopay.GTB;
SET Catalog.june08;
WHERE var14 = 'Autopay Transactions';
RUN;

PROC PRINT DATA = Autopay.GTB;
TITLE 'GTB AUTOPAY TRXNS';
RUN;


My results below show that 28929 records were read from the Source data (B06 - Switch Summary_csv_2010_0608.csv) but no data was subset into the data set Autopay.GTB.
I have checked all possible errors
All data sets were initialized
Field names are correct
File paths are correct

Can anyone help me out here, the log code is shown below:

443 data CATALOG.JUNE08 ;
444 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
445 infile 'E:\Source data\B06 - Switch Summary_csv_2010_0608.csv' delimiter = ',' MISSOVER DSD
445! lrecl=32767 firstobs=2 ;
446 informat VAR1 $24. ;
447 informat VAR2 $10. ;
448 informat VAR3 $6. ;
449 informat VAR4 $16. ;
450 informat VAR5 $11. ;
451 informat VAR6 $4. ;
452 informat VAR7 $18. ;
453 informat VAR8 $19. ;
454 informat VAR9 $29. ;
455 informat VAR10 $22. ;
456 informat VAR11 $16. ;
457 informat VAR12 best32. ;
458 informat VAR13 $12. ;
459 informat VAR14 $50. ;
460 format VAR1 $24. ;
461 format VAR2 $10. ;
462 format VAR3 $6. ;
463 format VAR4 $16. ;
464 format VAR5 $11. ;
465 format VAR6 $4. ;
466 format VAR7 $18. ;
467 format VAR8 $19. ;
468 format VAR9 $29. ;
469 format VAR10 $22. ;
470 format VAR11 $16. ;
471 format VAR12 best12. ;
472 format VAR13 $12. ;
473 format VAR14 $50. ;
474 input
475 VAR1 $
476 VAR2 $
477 VAR3 $
478 VAR4 $
479 VAR5 $
480 VAR6 $
481 VAR7 $
482 VAR8 $
483 VAR9 $
484 VAR10 $
485 VAR11 $
486 VAR12
487 VAR13 $
488 VAR14 $
489 ;
490 if _ERROR_ then call symput('_EFIERR_',1); /* set ERROR detection macro variable */
491 run;

NOTE: Numeric values have been converted to character values at the places given by: (Line)Smiley SadColumn).
490:44
NOTE: The infile 'E:\Source data\B06 - Switch Summary_csv_2010_0608.csv' is:
File Name=E:\Source data\B06 - Switch Summary_csv_2010_0608.csv,
RECFM=V,LRECL=32767

NOTE: 28929 records were read from the infile 'E:\Source data\B06 - Switch Summary_csv_2010_0608.csv'.
The minimum record length was 129.
The maximum record length was 203.
NOTE: The data set CATALOG.JUNE08 has 28929 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 0.34 seconds
cpu time 0.32 seconds


28929 rows created in CATALOG.JUNE08 from E:\Source data\B06 - Switch
Summary_csv_2010_0608.csv.



NOTE: CATALOG.JUNE08 was successfully created.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.68 seconds
cpu time 0.62 seconds


492
493 DATA Autopay.GTB;
494 SET Catalog.june08;
495 WHERE var14 = 'Autopay Transactions';
496 RUN;

NOTE: There were 0 observations read from the data set CATALOG.JUNE08.
WHERE var14='Autopay Transactions';
NOTE: The data set AUTOPAY.GTB has 0 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
cpu time 0.01 seconds


497
498 PROC PRINT DATA = Autopay.GTB;
499 TITLE 'GTB AUTOPAY TRXNS';
500 RUN;

NOTE: No observations in data set AUTOPAY.GTB.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
SAS Super FREQ
Posts: 8,743

Re: Subsetting Issue

Hi:
I'd suggest, for a start, that you run a PROC FREQ on the variable of interest to make sure that the way you are specifying your string for the WHERE clause is correct (that everything is mixed case, instead of all uppercase or all lowercase, etc). For example, for purposes of the WHERE statement, there would be a difference between 'AUTOPAY TRANSACTIONS' and "autopay transactions" and "Autopay Transactions".

[pre]
PROC FREQ data=Catalog.June08;
tables VAR14;
title 'What is in Var14';
run;
[/pre]

This PROC FREQ would be run AFTER your PROC IMPORT, but before your DATA step program. Examining the results of the PROC FREQ should show you what values are in the VAR14 variable for selection purposes.

cynthia
Occasional Contributor
Posts: 10

Re: Subsetting Issue

Okay Cynthia i will do just that but the worrying thing for me is that even when i used an "IF" statement in the same context i observed the same result.

Plus i am of the understanding that SAS isn't case sensitive, one of the reasons we decided to use SAS was because of its ability to filter similar data into smaller data sets where we could apply whatever technique we wanted to without causing any major impact on our infrastructure.

What difference will the PROC FREQ make as the data will still remain the same.
SAS Super FREQ
Posts: 8,743

Re: Subsetting Issue

Hi:
SAS variable names are NOT case sensitive. For example, you could refer to your variable as var14, VaR14, VAR14, vAR14, or any combination thereof and the variable reference would be recognized.

However, for comparison purposes, such as IF and WHERE statements, case is an issue. There are however, programming techniques you can use to make case irrelevant. So, for example, this:

WHERE upcase(var14) = 'AUTOPAY TRANSACTIONS';

would cause the UPCASE function to be used for the purpose of the comparison, so that no matter what case was -inside- the variable value, your comparion would be performed on an uppercase value. Also, what if for the first 100 rows or so in your worksheet, folks were really good and typed in "Autopay Transactions" but then for about another 500 rows, they typed only "Autopay Transaction" (without the S) and then for another 1000 rows, they typed just "Autopay" and then for the rest of the rows, they entered "AT"??? In order to select all of these possible values, you will have to code the WHERE or IF statement accordingly. PROC FREQ will show you exactly what is in the VAR14 variable.

Do you know, unequivocally, that "Autopay Transactions" occurs -exactly- like that on every row that you want to select??? Proc Freq will show you -exactly- what is inside VAR14. What if you miscounted the columns and VAR14 does not contain 'Autopay Transactions' -- maybe it's VAR13 or VAR15. It is possible that the columns in Excel did not get read in as you expected or that the variable numbers are not as you expect.

What if you had this in the Excel sheet:
[pre]
A B C D E<--Excel col nums
1 VAR1 VAR2 VAR3 VAR4 VAR5
2 Regular Transactions
3 alan 111 111 111
4 bob 222 222 222
5 carl 333 333 333
6 Autopay Transactions
7 dave 444 444 444
8 edwin 555 555 555
^
|Excel row nums
[/pre]

In the above "sheet simulation" VAR1 shows a category of either Regular Transactions or Autopay Transactions. Rows 3, 4 and 5 are the Regular Transactions and rows 7 and 8 are the Autopay Transactions. However, when SAS reads in data that looks like this, SAS thinks that VAR1 is missing or blank for rows 3, 4, 5, 7 and 8.

Without knowing what was originally in your worksheet, it is impossible to tell why your WHERE statement is NOT working. However, using PROC FREQ will allow you to explore the imported data to make sure that the data got imported into the variables correctly and WHAT is in each field.

Consider the following PROC PRINT:
[pre]
proc print data=sashelp.class;
title 'Eighteen year old students';
where age = 18;
run;
[/pre]

When I run that PROC PRINT, the WHERE statement will result in this message in the log:
[pre]
207 proc print data=sashelp.class;
208 title 'Eighteen year old students';
209 where age = 18;
210 run;

NOTE: No observations were selected from data set SASHELP.CLASS.
NOTE: There were 0 observations read from the data set SASHELP.CLASS.
WHERE age=18;

[/pre]

But, if I had run a PROC FREQ before hand, I would have already known that nobody was age 18 in the data:
[pre]
LOG:
212 proc freq data=sashelp.class;
213 tables age;
214 run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: PROCEDURE FREQ used (Total process time):
real time 0.59 seconds
cpu time 0.00 seconds

OUTPUT:
Cumulative Cumulative
Age Frequency Percent Frequency Percent
--------------------------------------------------------
11 2 10.53 2 10.53
12 5 26.32 7 36.84
13 3 15.79 10 52.63
14 4 21.05 14 73.68
15 4 21.05 18 94.74
16 1 5.26 19 100.00
[/pre]

cynthia
Occasional Contributor
Posts: 10

Re: Subsetting Issue

Many thanks Cynthia, i will apply these techniques and let you know how it goes.
Ask a Question
Discussion stats
  • 4 replies
  • 249 views
  • 0 likes
  • 2 in conversation