BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
VS
Calcite | Level 5 VS
Calcite | Level 5

Hi,

I compare two data sets.

PROC COMPARE BASE=data1 COMPARE=data2 MAXPRINT = (30,50);

ID master_key_int  ;

RUN;

I get results about several character variables as shown below, where in BASE data the value is R and in COMPARE data it is either missing or a quote. However, when I check in the actual data the value is R.

Could someone advise what could have gone wrong and how to fix it?

For other affected variables, it is the same pattern: the values are reported as missing in COMPARE data but have a valid value in the actual data, or  double quotes pop up before/after the actual value and then as a result BASE and COMPARE do not match.

                        || Base    Compare

346871          ||  R                                                                            

381583          ||  R                     "                             

1118626        ||  "                                                                           

Thanks  a lot

Victoria

1 ACCEPTED SOLUTION

Accepted Solutions
Peter_C
Rhodochrosite | Level 12

@VS

loading CSV files should be very easy - there are options to help you define everything. (as little or as much as you need).

The trouble with proc import is that it is more difficult to progress from a "proc import" load to a more specific load where you define data types and column widths and conversions.

So it is very fortunate that there is a route which will support you as much as you wish to be specific. That route enables you to guarantee that each of a series of files will be loaded with the same "rules".

The place to start is this code for a short data step (the biggest part is "infile" defining options to handle your data file) :

data test1 ( compress= yes) ;

infile 'your file'

          dsd       /* expect delimiter separated data and will remove matched quotes around columns*/

          dlm=','  /* comma delimiter or whichever delimiter you need to define TAB is '09'x */

          lrecl= 10000 /* the default width of 256 is often too narrow*/

          truncover   /* if a line stops before all columns requested are filled, remaining columns will be empty. Without this option, subsequent lines will be read with a FLOWOVER from this line - sometimes that might be correct, but not with this simple load routine */

          firstobs=1 obs=1000  /* or remove this to read all lines including the header */

        ;

length col1 - col100  $100 ; /* simple definition of string columns */

input col1 - col100 ;     /* read all those columns */

run ;

When that data step has run, examine the table WORK.TEST1 with your favorite viewer (open table in Enterprise Guide explorer;  proc print, or viewtable in base SAS Display Manager)

That INFILE statement defines the source data file you want to load.

The INPUT statement presumes to load up to 100 columns (named col1 to col100).  If there are less than 100 columns,  unused columns will become blank. The "compress= yes" option ensures the table ("test1") won't be wasting space on columns that are emty or shorter than 100 wide (as defined by that LENGTH statement).

The idea is to create a quick view of the data once it is parsed into its columns, and provide a resource to check for data types

.

In the usual standards for a CSV file there will be a column heading line at the top of the file which this program will read as data. The FIRSTOBS= option of the INFILE statement could be changed from 1 to 2 to skip over that line, but when reviewing (just the first 1000 with thanks to that OBS= option) it is nice to see what the column heading actually is.

This review helps you decide the columns you want to load as values (not all numeric columns should be loaded as numbers, because like excel, SAS cannot reliably store the 16th digit of a credit card number if you load it as a number). Another decision that a review will help, is recognizing dates, times and datetime values or percentages.

SAS provides "informats" to enable your program to convert from the raw source into the various datatypes.

For example:

12345678901234567,1234rtyui45678x,1234.56,M,01/23/2014,ytr,hgfd,,"12,345,678.12",vcxz,"surname,first",,6543,,,,,,tre,fd...........,

Should you find the third column is a suitable column to load as a value like account_balance; a single column follows that containing mostly M, S or D ;  the next is a date in mm/dd/yyyy style:  then the LENGTH statement would adapt like:

LENGTH col1-col2 $100    col3  8  col4 $1 col5 8  col6-col100 $100 ;

(lengths defined beginning $ are string lengths  - without a $ the length implies a numeric of one type or another)

To define those conversions I would use both INFORMAT and FORMAT statements, like

     INFORMAT col3 comma18.  col5 mmddyy10.  ;

       FORMAT col3 comma18.2 col5 date11.  ;

Unless you want the presentation to look different from the default (default output is much like the data is stored) you don't need to define a format.

There are many ways of presenting values with formats - and (nearly) always there is an INFORMAT to convert the values from the raw input. [there is even a NEGPAREN. informat for converting values where (parentheses) indicate a negative value.]

After deciding the data types, you can make the column names more meaningful - like ac_bal and marital_status instead of col3 and col4. That makes the INPUT statement list of columns a little more complex. However, as long as you define the columns in the order they are present on the infile data, you can use an INPUT statement like

     INPUT col1 -- last_col ;

That is a double hyphen  (two dashes) between the names of the first column and the last column you INPUT .

 

What I hope this article will show, is how this "progressive load - review - adapt - repeat" lets you investigate the data while you can progressively refine the maximum width to load your strings and suitable informats for converting the data columns which need conversion. Once these are all defined, each of your tables can  be loaded by applying a variation of the same data step (just changing the output table name on each DATA statement.

peterC

a link to on-line doc INFILE Statement Example "Reading Delimited Data"  http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000146932.htm#a000...

View solution in original post

4 REPLIES 4
Peter_C
Rhodochrosite | Level 12

@VS 

we have a similar problem - not enough information - to help.

if your input DATAn are not in order, or have duplicates, messages will be reported in the SASlog. Have you checked for those messages?

It is difficult to see all of your problem from the clip you offer of results.

If you extend your syntax (not a lot) an output file of differences will indicate more. Add before maxprint:

OUT= DIFFS OUTNOEQUAL OUTALL

That creates the WORK.DIFFS dataset which will indicate from which points within matched cells the strings differ. It contains a _TYPE_ column that describes each of 4 lines BASE, COMPARE, DIF and PERCENT. With these details including the OBS column defining the row from which the differences are found, I expect you will be able to uncover where and what these differences mean.

Good luck

peterC

VS
Calcite | Level 5 VS
Calcite | Level 5

Hi Peter,

Thank you for your reply. I tried today to re-load my data (which was .csv files with double quotes for strings) removing double quotes around character variables. After that all "false positives" disappeared. So it was due to improper reading in the data.

However, the question still remains why SAS sometimes retains occasional double quotes when importing .csv file.

I suppose now I have to start a new discussion about this problem.

Thanks 

Peter_C
Rhodochrosite | Level 12

@VS

loading CSV files should be very easy - there are options to help you define everything. (as little or as much as you need).

The trouble with proc import is that it is more difficult to progress from a "proc import" load to a more specific load where you define data types and column widths and conversions.

So it is very fortunate that there is a route which will support you as much as you wish to be specific. That route enables you to guarantee that each of a series of files will be loaded with the same "rules".

The place to start is this code for a short data step (the biggest part is "infile" defining options to handle your data file) :

data test1 ( compress= yes) ;

infile 'your file'

          dsd       /* expect delimiter separated data and will remove matched quotes around columns*/

          dlm=','  /* comma delimiter or whichever delimiter you need to define TAB is '09'x */

          lrecl= 10000 /* the default width of 256 is often too narrow*/

          truncover   /* if a line stops before all columns requested are filled, remaining columns will be empty. Without this option, subsequent lines will be read with a FLOWOVER from this line - sometimes that might be correct, but not with this simple load routine */

          firstobs=1 obs=1000  /* or remove this to read all lines including the header */

        ;

length col1 - col100  $100 ; /* simple definition of string columns */

input col1 - col100 ;     /* read all those columns */

run ;

When that data step has run, examine the table WORK.TEST1 with your favorite viewer (open table in Enterprise Guide explorer;  proc print, or viewtable in base SAS Display Manager)

That INFILE statement defines the source data file you want to load.

The INPUT statement presumes to load up to 100 columns (named col1 to col100).  If there are less than 100 columns,  unused columns will become blank. The "compress= yes" option ensures the table ("test1") won't be wasting space on columns that are emty or shorter than 100 wide (as defined by that LENGTH statement).

The idea is to create a quick view of the data once it is parsed into its columns, and provide a resource to check for data types

.

In the usual standards for a CSV file there will be a column heading line at the top of the file which this program will read as data. The FIRSTOBS= option of the INFILE statement could be changed from 1 to 2 to skip over that line, but when reviewing (just the first 1000 with thanks to that OBS= option) it is nice to see what the column heading actually is.

This review helps you decide the columns you want to load as values (not all numeric columns should be loaded as numbers, because like excel, SAS cannot reliably store the 16th digit of a credit card number if you load it as a number). Another decision that a review will help, is recognizing dates, times and datetime values or percentages.

SAS provides "informats" to enable your program to convert from the raw source into the various datatypes.

For example:

12345678901234567,1234rtyui45678x,1234.56,M,01/23/2014,ytr,hgfd,,"12,345,678.12",vcxz,"surname,first",,6543,,,,,,tre,fd...........,

Should you find the third column is a suitable column to load as a value like account_balance; a single column follows that containing mostly M, S or D ;  the next is a date in mm/dd/yyyy style:  then the LENGTH statement would adapt like:

LENGTH col1-col2 $100    col3  8  col4 $1 col5 8  col6-col100 $100 ;

(lengths defined beginning $ are string lengths  - without a $ the length implies a numeric of one type or another)

To define those conversions I would use both INFORMAT and FORMAT statements, like

     INFORMAT col3 comma18.  col5 mmddyy10.  ;

       FORMAT col3 comma18.2 col5 date11.  ;

Unless you want the presentation to look different from the default (default output is much like the data is stored) you don't need to define a format.

There are many ways of presenting values with formats - and (nearly) always there is an INFORMAT to convert the values from the raw input. [there is even a NEGPAREN. informat for converting values where (parentheses) indicate a negative value.]

After deciding the data types, you can make the column names more meaningful - like ac_bal and marital_status instead of col3 and col4. That makes the INPUT statement list of columns a little more complex. However, as long as you define the columns in the order they are present on the infile data, you can use an INPUT statement like

     INPUT col1 -- last_col ;

That is a double hyphen  (two dashes) between the names of the first column and the last column you INPUT .

 

What I hope this article will show, is how this "progressive load - review - adapt - repeat" lets you investigate the data while you can progressively refine the maximum width to load your strings and suitable informats for converting the data columns which need conversion. Once these are all defined, each of your tables can  be loaded by applying a variation of the same data step (just changing the output table name on each DATA statement.

peterC

a link to on-line doc INFILE Statement Example "Reading Delimited Data"  http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000146932.htm#a000...

VS
Calcite | Level 5 VS
Calcite | Level 5

Thank you Peter, this is a fantastic peace of information!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 3667 views
  • 3 likes
  • 2 in conversation