Read Rows as loading address to generate data sets

YKU · Posted 12-24-2019 11:03 AM

Hi

I am new to SAS and will need some bits of help.

I got a table with a data location address and want to read it as an input to generate contents tables from these addresses. Could anyone give me a hint?

Structure:

1. Each file has similar headers, but not identical. For example, some file contains var1-100 but others might only have var35- 95.

2. No file to define formate of each variable

File_Name sample

C:\this\is\where\data\located\abcd.csv

Ultimate Goal:

The goal is not to tack all data but build a data dictionary for all files, including var name, types, var range, filename, address, etc. Might be able to do through Proc Contents?

Many Thanks

ballardw · Posted 12-24-2019 11:59 AM

This sounds like you are requesting how to read the CSV files into one or more SAS data sets.

Are all of these CSV files of the same structure? Which means that the columns have the same order and content type (text, numeric, date, or what have you value.

Does each file have a single header row of column labels, no header row, or multiple rows?

Do you have document that describes what should be in each column? Such as maximum length of text or the format of date or time fields?

And are these data sets supposed to be combined in any way after creation?

These questions go into how much detail will be needed to be successful. A single file may not be much of an issue but if these are supposed to be related you would want a process that allows combining data and what may work for a single file might have issues when working with multiple files.

When you have the names of the files it is moderately easy to use them for this purpose but the details can matter as to how they are used.

YKU · Posted 12-24-2019 12:11 PM

Thank you so much, I updated my questions as above.

ballardw · Posted 12-24-2019 01:00 PM

It would be best if your data set with the file_name also had the dataset name you would like to use for it. We can attempt to build a data set name based on the input string but if you have more than one folder represented you may not have unique names in the "abcd.csv" part and if the length of the last part is too long then there may be similar issues with truncating to the maximum length that SAS allows for data set names. Also if the data is to go into different libraries then having that in the name would be best.

Here is an example of generating Proc Import code for each set since you say that you have differing file structures.

data _null_;
   set yourdataset;
   length fname $200. outname $ 41;
   fname= quote(strip(file_name));
   outname= catx('.','work',scan(file_name,2,'.\','b'));
   call execute ("Proc import data=");
   call execute (fname);
   call execute ('out=');
   call execute (outname);
   call execute ('Replace;
     delimiter=",";
     getnames=Yes;
     guessingrows=max;
     run;');                 
run;

Replace 'yourdataset' with the name of the data set that you have the file names in.

This uses file_name as the name of the varaible in that set. Change that if not actually the case.

The FNAME variable is created because there isn't any guarantee that your existing variable will allow adding two quote characters and still fit in the existing length. If you paths are long enough that 200 characters isn't long enough increase the 200.

The OUTNAME variable has the name of the output SAS data set. This attempts to make one from the lat bit of the file_name. If you actually provide a data set name in the input data then that isn't needed. If you want a different library then replace 'work' with the name you want.

Call execute places SAS code into a stack to execute after the data set completes. The ; in side the quotes are needed to delimit the statments submitted and it is really a good idea to provide the run statment.

Notice the last bit of "boiler plate" code that does not depend on the filename is all in one quoted string.

If your CSV file actually uses a character other than a comma for the separtor replace the comma in the delimiter="," to the one actually used.

The "guessingrows = max" of the code examines the entire data set before setting properties for variables such as variable type, length or informat and is a good idea to use when available such as for delimited files to avoid truncation.

Note that if your "numeric" variables have some values like "NULL" 'NA' or 'Missing' as text in the file then the variable will be created as character and not numeric.

The Proc import will generate data steps to read each of the files. You may want to save the log in case you encounter cases such as the "Null" mentioned above. Then you could copy the code from the log file, paste into the editor and clean it up and change the INFORMAT to an appropriate to get the correct reading of variables.

Good luck.

If the files all had the same structure and especially if you had documentation then we could use a similar approach to create the Data step code to read the file directly just changing the input name and the output data set. Of if the objective was to have all the files combined into a single one then it is possible to read multiple CSV files at one time into a single data set. Which is why I asked for some details.

YKU · Posted 12-24-2019 03:23 PM

Hi Ballardw

Thanks for your help, I modified a few things to get it run:

1. call execute ("Proc import datafile=") --> somehow I have to use datafile instead data to run
2. outname= catx('.','work',scan(file_name,2,'./-','b')); ---> This is where I get error I guess

The file_name is very long and its names are identical in the last parts. So I keep getting error as "The value 31 is not a valid SAS name."

The sample of the original filename:

PPP_Subscribers_APPLICATE-ACE_SOMETable_2016-01-01_2016-12-31

PDC_Subscribers_CHOOACE-ACE_OTHERTable_2016-01-01_2016-12-31

I tried other ways to change the output name but cannot find a valid way to do it.

Can you help me? Thanks

ballardw · Posted 12-24-2019 05:44 PM

@YKU wrote:

Hi Ballardw

Thanks for your help, I modified a few things to get it run:

1. call execute ("Proc import datafile=") --> somehow I have to use datafile instead data to run
2. outname= catx('.','work',scan(file_name,2,'./-','b')); ---> This is where I get error I guess

The file_name is very long and its names are identical in the last parts. So I keep getting error as "The value 31 is not a valid SAS name."

The sample of the original filename:

PPP_Subscribers_APPLICATE-ACE_SOMETable_2016-01-01_2016-12-31

PDC_Subscribers_CHOOACE-ACE_OTHERTable_2016-01-01_2016-12-31

I tried other ways to change the output name but cannot find a valid way to do it.

Can you help me? Thanks

My mistake on the datafile , typos happen.

Remember that I suggested adding the desired SAS data set names to your input data file incase of long names. Since you did not provide an example of an actual input file I provided a solution that might work. With that sort of actual name you have I strongly suggest that you add a desired name to the data file.

SAS data sets have a hard limit of 32 characters at the set name level. The name of a data set must start with either a letter or the underscore , _ , character. The SCAN code I provided was to use the slashes to delimit the path elements and the period to delimit the CSV extension so we could get the last "word" between \ and . The 'b' modifier is to parse backwards, or from the end. But YOU added a - which gets the LAST - resulting with a word, 31 in the example, that is not a valid SAS data set name.

I suggest that you manually create a name for each of your data sets.

I am not sure what the best way would be. How did you get the data set with the files to begin with?

Easiest may be to make a new dataset with the desired data set names and merge them with the original file names. Here is an example:

data work.filenames;
   input file_name :$40.;
datalines;
c:\folder\subfolder\name.csv
c:\folder\subfolder\othername.csv
;

data work.datasetname;
   input dsname :$32.;
datalines;
Firstdataset
Secondsdataset
;


data work.file_datasets;
   merge work.filenames
         work.datasetname
   ;
run;

You could add the library name to the dsname values using values like libname.setname. Make the dsname varaiable 41 using the input statement and the $41. informat.

The : in the input format helps to avoid, in this case, continuing to read from the next line in case your actual value is shorter.

The MERGE done this way matches the first line of the first data set with the second. So you would want to make sure the order matches.

Then use the work.file_datasets or whatever you create as input to the data _null_. Use the created the DSNAME value instead of the Scan code reading using the file_name variable.

Alternatively be unselective about taking the first 32 characters but if those are not unique then you will overwrite one or more data sets.

This demonstrates a way to get the first 32 characters and prefix with a library.

data work.junk;
  file_name="c:\folder\PPP_Subscribers_APPLICATE-ACE_SOMETable_2016-01-01_2016-12-31.csv";
  length outname $ 41;
  outname= catx('.','work',substr (scan(file_name,2,'.\','b'),1,32));
run;

In this case if the name after the last \ and before the period is shorter than 32 this is not quite correct because the string is too short for the SUBSTR function. You would get an outname value that may be workable but there will be notes in the log about "Invalid third argument to function SUBSTR".

Again if these won't be unique then you have to provide the name or a much cleaner example of the CSV file names and rules for creating your desired dataset name.

YKU · Posted 12-26-2019 04:18 PM

Thanks a lot. The problem is solved. I really appreciate your help!!

Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Re: Read Rows as loading address to generate data sets

Registration is open