BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
viola
Obsidian | Level 7

I'm looking for help with general strategy and not necessarily each step of my code, which is why I'm not including any specific code here. If anyone is able to talk me through how they might accomplish the following, I'd be so appreciative. Using Base SAS 9.4. 

 

Data description:

~60 CSV files that each contain a unique set of data

-Each file is a part of the whole. All of the variables, labels and formats are contained in one sheet, in columns, listed according to their dataset grouping. However, the order that the variables are listed in the master sheet are not the same order that the variables appear in the datasets. 

 

I need to:

-Import each CSV file, retaining the original file name for the SAS datasets

-Save each as a separate SAS dataset (they are not being combined)

-Label the variables

 

I've been wrestling with various strategies to accomplish this task, but can't figure out which one makes the most sense. At the end of the day, I need a solution that is dynamic and doesn't require me to read in each dataset individually and hard code the variables (I have to do this so much at my job that I don't have time for a manual process).

 

The piece that I'm struggling with the most is whether or not I need to chop up the master sheet in order to apply the labels. I can't figure out a solution to refer to the master sheet that only pulls out the labels that I need for any particular dataset.

 

Should I:

1. Use proc import to create temp datasets, then apply the labels in another step

2. Use a data step to read in the data and apply the labels in one step...guessing this would require variable sorting first

3. Something else?

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Use the sheet with the metadata to build a dataset with information about the variables.

Then for each of your 60 files follow this process.

1) Read the header row to see what variables are included and what order they appear.

2) Combine with the metadata to get the information on the variable definitions.

3) Use that to write a data step that can read the file.

4) Run the generated data step to convert the CSV file into a dataset.

 

A couple of questions about how complex your setup is.

1) Do the names used in the header rows of the CSV files always match the names in the metadata sheet?  or are these CSV files created adhoc with typos in the header rows?

2) Does the same variable name always mean the exact same variable definition (same type/length/format/label) even when it appears in different CSV files?  For example is the variable AGE always a number? Or is AGE used in some sheets for a character age category variable?

3) Can you tell from the name of the CSV file what variables should be in it, even if you don't the order they will appear?  Or do the 60 CSV files have adhoc or random sets of variables?

 

Assuming that the variables are defined consistently across all of the files you could just create a single template dataset that have all of the possible variables.  Make sure to attach informats to the variables (like DATE, TIME and DATETIME fields) that will need them. There is no need to attach informats to most character and numeric variables since SAS already knows how to read numbers and strings from a CSV file without you needing to instruct it to use a specific informat.

 

So assuming this template dataset is named TEMPLATE, then your program for processing one file is as simple as:

%let filename=physical csv filename;
%let dsname=want;

data _null_;
  infile "&filename" obs=1;
  input;
  call symputx('varlist',tranwrd(_infile_,',',' '));
run;

data &dsname ;
  if 0 then set template (keep=&varlist);
  infile "&filename" dsd firstobs=2 truncover ;
  input (&varlist) (+0);
run;

View solution in original post

9 REPLIES 9
ybolduc
Quartz | Level 8

Personnaly, I would go for option #1. Before applying the label and formatting, I'd probably include a step to validate that all expected columns are present and have the right type. This would allow you to discard the file right away and send a notification. Saving you the trouble of regularly scrubbing the logs in search of "ERROR:".

 

I try as much as possible to assume that the data I receive could be wrong and put in place the right validations.

Tom
Super User Tom
Super User

Use the sheet with the metadata to build a dataset with information about the variables.

Then for each of your 60 files follow this process.

1) Read the header row to see what variables are included and what order they appear.

2) Combine with the metadata to get the information on the variable definitions.

3) Use that to write a data step that can read the file.

4) Run the generated data step to convert the CSV file into a dataset.

 

A couple of questions about how complex your setup is.

1) Do the names used in the header rows of the CSV files always match the names in the metadata sheet?  or are these CSV files created adhoc with typos in the header rows?

2) Does the same variable name always mean the exact same variable definition (same type/length/format/label) even when it appears in different CSV files?  For example is the variable AGE always a number? Or is AGE used in some sheets for a character age category variable?

3) Can you tell from the name of the CSV file what variables should be in it, even if you don't the order they will appear?  Or do the 60 CSV files have adhoc or random sets of variables?

 

Assuming that the variables are defined consistently across all of the files you could just create a single template dataset that have all of the possible variables.  Make sure to attach informats to the variables (like DATE, TIME and DATETIME fields) that will need them. There is no need to attach informats to most character and numeric variables since SAS already knows how to read numbers and strings from a CSV file without you needing to instruct it to use a specific informat.

 

So assuming this template dataset is named TEMPLATE, then your program for processing one file is as simple as:

%let filename=physical csv filename;
%let dsname=want;

data _null_;
  infile "&filename" obs=1;
  input;
  call symputx('varlist',tranwrd(_infile_,',',' '));
run;

data &dsname ;
  if 0 then set template (keep=&varlist);
  infile "&filename" dsd firstobs=2 truncover ;
  input (&varlist) (+0);
run;
viola
Obsidian | Level 7

@Tom, thank you. To answer your questions: 

 

1) The names in the header rows will always match the names in the metadata sheet. The CSV files are generated by a standard export procedure that will always produce the same set of headers. 

2) The variable names will always meet the definitions set by the metadata sheet - if one variable appears in more than one of the datasets it will always be defined in the same way. 

3) The CSV files always have the same sets of variables. They are exports of data collection forms, where the variables are in the order that they appear on the form. However, in the metadata sheet, they are not listed in the order they appear on the collection form. No adhoc or random variables. 


@Tom wrote:

Use the sheet with the metadata to build a dataset with information about the variables.

Then for each of your 60 files follow this process.

1) Read the header row to see what variables are included and what order they appear.

2) Combine with the metadata to get the information on the variable definitions.

3) Use that to write a data step that can read the file.

4) Run the generated data step to convert the CSV file into a dataset.

 

A couple of questions about how complex your setup is.

1) Do the names used in the header rows of the CSV files always match the names in the metadata sheet?  or are these CSV files created adhoc with typos in the header rows?

2) Does the same variable name always mean the exact same variable definition (same type/length/format/label) even when it appears in different CSV files?  For example is the variable AGE always a number? Or is AGE used in some sheets for a character age category variable?

3) Can you tell from the name of the CSV file what variables should be in it, even if you don't the order they will appear?  Or do the 60 CSV files have adhoc or random sets of variables?

 




 

 

Tom
Super User Tom
Super User

See my example code in my updated original reply.

If you want to have the variable order in the dataset match the CSV file's order then add a RETAIN &VARLIST statement before the IF THEN statement. Otherwise the order will match the order in the TEMPLATE dataset.

 

If the metadata sheet is also variable then post an example of the metadata sheet to see how to automate the process of generating the TEMPLATE dataset from metadata.  Or instead of the IF 0 THEN SET statement you could generate a series of ATTRIB statements derived from the metadata and eliminate the need for the TEMPLATE dataset.

viola
Obsidian | Level 7

Thanks for the code example @Tom. I think this could work. The metadata sheet will look something like this: 

 

Variable    Group    Label     Format 

var1          1           label1     format1

var2          1           label 2    format2

var3          2           label 3    format3

var4          2           label 4    format 4     

 

Each of the 60 datasets will only need the variables in its respective group, if that makes sense. So somehow I need to subset the metadata sheet with the relevant variables for each dataset. 

 

@ballardw, I would be more willing to copy and paste the proc import code if I didn't have to do that 60 times over. Each of the CSV files has a different variable list. 

ballardw
Super User

@viola wrote:

@Tom, thank you. To answer your questions: 

 

1) The names in the header rows will always match the names in the metadata sheet. The CSV files are generated by a standard export procedure that will always produce the same set of headers. 

2) The variable names will always meet the definitions set by the metadata sheet - if one variable appears in more than one of the datasets it will always be defined in the same way. 

3) The CSV files always have the same sets of variables. They are exports of data collection forms, where the variables are in the order that they appear on the form. However, in the metadata sheet, they are not listed in the order they appear on the collection form. No adhoc or random variables. 


@Tom wrote:

Use the sheet with the metadata to build a dataset with information about the variables.

Then for each of your 60 files follow this process.

1) Read the header row to see what variables are included and what order they appear.

2) Combine with the metadata to get the information on the variable definitions.

3) Use that to write a data step that can read the file.

4) Run the generated data step to convert the CSV file into a dataset.

 

A couple of questions about how complex your setup is.

1) Do the names used in the header rows of the CSV files always match the names in the metadata sheet?  or are these CSV files created adhoc with typos in the header rows?

2) Does the same variable name always mean the exact same variable definition (same type/length/format/label) even when it appears in different CSV files?  For example is the variable AGE always a number? Or is AGE used in some sheets for a character age category variable?

3) Can you tell from the name of the CSV file what variables should be in it, even if you don't the order they will appear?  Or do the 60 CSV files have adhoc or random sets of variables?

 




 

 


Your responses sound as if the CSV files all have the same layout. It should be practical to create a basic data step that will read the data in the same format by changing the 1) the input file name and 2) the name of the output file name.

 

I would start with something like

 

Proc import datafile="c:\path\inputfilename.csv"

                    out= mylib.mydatasetname

                    dbms=CSV

                    replace

;

     guessingrows=max;

     getnames=yes;

run;

 

The log will contain a data step program created to read that file.

Copy the data step to the editor.

Remove line numbers if they appeared in the log.

Check with your metadata sheet to see if the variables have the appropriate length. Example: Metatdata says that variable Sitename should be 15 characters but the program shows an informat (which would set the length) of $12. because of the actual data read. Change the informat from $12. to $15. (If paranoid make it longer in case they sneak a data change in on your).

Verify that character type variables have a $ type informat. Values like identification numbers might be guessed to be numeric by import procedure so instead of "003456" you get a number value of 3456. Most likely that involves changing a BEST12. or BEST32. to $10. or similar.

Check that values that should be numeric are not read with a $ format. This may happen if an infrequently used column is missing for all of the records in your data. If that happens often the informat will be $1.

If a value represents a date, time or datetime it may help to check on the informat and format assigned to make sure they are read as such. If your data looks like an obvious date such as 12/24/2016 or 10JAN2017 then SAS will likely guess correctly. But if you have a value like 121288 (12 Dec 1988) that may not be guessed correctly. There are a largish number of possible date, time and datetime informats.

 

If the variable names you get don't quite match what you expected you can do text search and replaces to likely get them "right". At this point I also usually add LABEL statements so I have in the data a better description than "Val1". Labels let you have up to 256 characters to describe a variable such as :

Label val1="Measured groundwater temperature (C)";

Test you modified program to reread the data and see if it looks good.

Fix things if not.

When you are done you can read other files by replacing the infile name and the output data set.

There are even code steps that will let read a whole bunch of CSV files at one time.

viola
Obsidian | Level 7



Hi @Tom - I attempted to use your code below but am running into errors with the _infile_ piece. I'm getting a varlist that spits out as "VAR1" "VAR2" - why am I seeing quotes around the variables, and how can I get rid of them? 

 

147

148  data &dsname ;

149    retain &varlist;

NOTE: Line generated by the macro variable "VARLIST".

1     "STUDYID" "SITEID" "SITENAME" "SUBJID" "EPOCH" "VISITID" "VISIT" "VISITNUM"

      ---------

      22

      76

1  ! "VISITDY" "Visit Repeat Key" "DOMAIN" "Form Repeat Key" "Item Group Repeat Key" "Scheduled

1  ! Date" "Completed Date" "DMDAT" "BRTHDAT" "SEX" "CHILDBR" "RACE" "RACEOTH"

ERROR 22-322: Syntax error, expecting one of the following: a name, ;, _ALL_, _CHARACTER_,

              _CHAR_, _NUMERIC_.

 

ERROR 76-322: Syntax error, statement will be ignored.

 

 

 

%let filename=physical csv filename;
%let dsname=want;

data _null_;
  infile "&filename" obs=1;
  input;
  call symputx('varlist',tranwrd(_infile_,',',' '));
run;

data &dsname ;
  if 0 then set template (keep=&varlist);
  infile "&filename" dsd firstobs=2 truncover ;
  input (&varlist) (+0);
run

 

Tom
Super User Tom
Super User

In a CSV file quotes are only required around values that include the delimiter or quotes.  But extra quotes are allowed and will be stripped.  Looks like your source file had a header line like

"STUDYID","SITEID","SITENAME",...

Instead of what a normal CSV file would have of:

STUDYID,SITEID,SITENAME,...

You could try just removing them.  

call symputx('varlist',tranwrd(compress(_infile_,'"'),',',' '));

Or make that step more complicated.

ChrisNZ
Tourmaline | Level 20

The error is due to the retain statement not expecting quotes.

You can remove them on the fly when unneeded like this:

 

data &dsname. ;
  retain %sysfunc(compress(&varlist.,%str(%")));

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 3789 views
  • 2 likes
  • 5 in conversation