BookmarkSubscribeRSS Feed
SamMad
Obsidian | Level 7

Being a New user, I apologize if my questions are basic in nature. I tried to search the KB but couldn't find a solution that matches my requirements. So here it is:

I would like to import a text file, or XLS or any other format. This is how I would like to set up my import routines in SAS if possible and let me know if SAS has a way to do it:

Step 1.  I have a CSV file with 100 Columns and each columns format is defined such as Character, Numeric etc. Lets Call this file as CSV100

Step 2. Before I import the file into SAS format, I would like to CREATE a Empty SAS (lets call this file as SAS105) table with predefined columns based on my CSV input file of 100 columns and with appropriate field lengths and import formats plus few more (5) columns to be added as i would like populate these 5 columns during import process or after importing the CSV100 file into SAS105 data set.

Step 3. Import the CSV file and map input fields from CSV100 file to SAS105 Table Template created in Step2. The last 5 columns are going to be empty as they will be calculated either during the import or run update statements after the CSV100 file import.

Step 4. I would like to Add Data Steps somewhere in the Import process, where I can Calculate/Update the 5 extra columns that I added in SAS105 ( 101-105). These Data steps will calculate/populate the 5 columns based on links from other SAS Tables or within imported 100 columns from CSV files.

Step 5. I would like to used the SAS105 Data Set to be continuously been updated on a monthly basis by adding more and more data and recalculating the those extra added columns every time new CSV100 files is appended/imported into SAS105 data set.

Step 6. Every time new Data is being appended to the SAS105 Dataset, I would a Query to be run and Refresh the reporting based on that data set.

 

Can all this be accomplished in SAS? I may have not explained it correctly based on SAS lingo and I will be happy to further discuss and provide more information/clarification to my above mentioned data import routines.

 

Thank you all in advance for reading this long post and helping me out. It is urgently needed!!!

SamMad

 

 

6 REPLIES 6
ballardw
Super User

Data step plus proc append.

 

There is no reason to make a separate empty template data set in SAS.

Read your first CSV as desired with a data step. If you  know the order of your variables and the types then it is pretty simple to write the data step. Suppose I have 5 variables that I want to read, the first should be character with a maximum length of 15 characters, the second is a DATE  value that will be in mm/dd/yyyy in the CSV (m=month, d=day of month, y=year), and 3 numeric values  then an example data step to read that could be:

Data want;
   infile "<path to csv>\filename.csv" dlm=',' lrec=32000 dsd ;
   informat 
      firsttext $15.
      date      mmddyy10.
      num1 num2 num3 best12.
    ;
   format date mmddyy10. ;
   input 
      firsttext
      date
      num1 
      num2
      num3
   ;
   /* ANY calculation involving single rows of values can go here*/
   sum1_2 = Sum(num1, num2);
/* suggest adding LABEL statement to describe variables, provide
nice headings in output
*/
label firsttext = "My first column value"
date = "Date of record"
num1 = "Count of something"
num2 = "Count of something else"
sum1_2 = "Sum of two variables" run;

The properly created import statement sets the properties for the values. If you want to import multiple files then you set the properties such as the informat to handle the longest value you expect to encounter.

If columns are not going to be consistent, such a "dates" in different layouts, or some including times and others not then you run into the realm of having to parse data to read properly.

 

 

Proc append base=SAS105 append = want; /* or that CSV100 or whatever for the append*/

run;

 

The first time the proc append is run the output set would be created. Afterwards the data will be added.

 

 

Note: Any attempt to do this by IMPORTING XLSX or similar is almost certain doomed to failure in the long run  because the IMPORT procedure cannot control the properties of anything and variables will tend to change lengths, variable types and with some typical Excel users the variable names.

 

Suggest creating a permanent library and placing the SAS105 or whatever in that library. You have a choice of reading the CSV into a temporary data set in the work library and then appending or into a permanent library.

 

If your added columns involve something where the value on a row comes from multiple rows then you will need to provide a lot more detail about the specific calculations you expect to do.

SamMad
Obsidian | Level 7
Thanks ballardw.

So based on your response above, I would like to share more details with you on the file I am importing.
But where will this data be imported? Can I mention in the sample code you provided above and specify the destination of the SAS data set will be? I don't want it to go to WORK location as it is temp folder and it gets deleted by the admins.
Also, my CSV file in this case doesn't have the header but I do have the names and length/formats of all the columns in a separate documents and will use that to import it. Is there a command that I need to write for SAS to import entire dataset and doesn't skip the row at the top?
My CSV file has 1247 columns, and I will only be importing 141 of those. Do I need to define all 1247 columns in the SAS Code example you gave in the statement above and then write another code to pick and choose the 141 of those to import?

Thanks again for you help !
Tom
Super User Tom
Super User

When you read from a delimited (like a CSV file) you have to read each field to figure out where it is located on the line.  So you while you don't have to read all 1247 fields, you do need to read up to the last one you want to keep.  If there are some in the middle you want to skip you can drop those variables so they are not written to the dataset you are creating.  You can even read all of the fields you are skipping into the same variable so there is just one variable you need to drop.  So for example if your file has 20 fields but you only want the first, second and eleventh your code might look like this:

data want;
  infile 'myfile.csv' dsd truncover ;
  length dummy $1 ;
   input var1 var2 8*dummy var11 ;
   drop dummy;
run;

The FIRSTOBS= option on the INFILE statement does exactly what it sounds like.  It tells the INFILE statement which line to start reading from.  If your file does not have a header record you can code FIRSTOBS=1 or just not add that option as the default is to read starting from the first line.

Tom
Super User Tom
Super User

Sounds like you will get a series of incremental text files that you want to read and append to a main dataset. So write a data step that reads one new file and performs any calculations.  Then add a step to append the new file to the main dataset.

data new ;
  infile 'newfile.csv' .... ;
  ....
run;
proc append data=new base=mylib.myds force;
run;

There is no need create the dataset in advance.  Just write a normal data step and the dataset will be created with the same structure every time.  If the derived variables are simple calculations based on the values in the current observation then just calculate them in the data step that reads the text file.  If the calculations are more complex then you can use multiple steps to go from source to format needed to match the main dataset.

For example your data step might be as simple as this:

data new;
  infile 'newfile.csv' dsd firstobs=2 truncover;
  input id :$20 var1-var100;
  calc1 = var2/var5;
  calc2 = mean(of var6-var20);
run;

 

 

SamMad
Obsidian | Level 7
OK so here is the scenario of the data I am importing.
2020,33860,01,001,020111,0,T,N,DM,59162,46523,122.93,65700,1948,508,697,1008,940
2020,33860,01,001,020111,0,T,N,DM,59162,46523,122.93,65700,1948,508,697,1008,940
2020,33860,01,001,020111,0,T,N,DM,59162,46523,122.93,65700,1948,508,697,1008,940
2020,33860,01,001,020111,0,T,N,DM,59162,46523,122.93,65700,1948,508,697,1008,940
2020,33860,01,001,020111,0,T,N,DM,59162,46523,122.93,65700,1948,508,697,1008,940
The sample is header less CSV file.
Metadata for the file is as follows:
FieldName Length DataType Import_in_SAS
YEAR 4 AN Y
MSA 5 AN Y
STATE 2 AN Y
COUNTY 3 AN Y
TRACT 6 AN Y
CITY 1 AN Y
SMALL_COUNTY 1 AN N
TRACT_FLAG 1 AN Y
COUNTY_FLAG 1 AN N
URBAN_RURAL 1 AN Y
INCOME_MSA 8 NUM Y
INCOME_FAMILY 8 NUM Y
INCOME_TRACT 8 (DECIMAL 6,2) Y
INCOME_FFIEC 8 NUM Y
TOTAL_PERSONS 8 NUM Y
TOTAL_FAMILIES 8 NUM Y
TOTAL_HHLDS 8 NUM Y
TOTAL_FEMALES 8 NUM Y
TOTAL_MALE 8 NUM N

In the above data set and metadata of the dataset, I do not want to import certain columns that have ended with N above when I import CSV file into SAS Dataset. There are a total of 19 fields that in the CSV file and I want to name the columns as I mentioned above in the metadata/file format/data types and want to skip importing 4 columns below:
FieldName Length DataType Import_in_SAS
SMALL_COUNTY 1 AN N
COUNTY_FLAG 1 AN N
TOTAL_MALE 8 NUM N

Believe me I tried my best to import but I couldn't import the CSV file . CSV file itself has not headers so I provided the fieldnames and length and datatype above.
The above same of data is a subset of very large file, 1247 columns by 76,000 rows. If this can be imported by your code on your machine and then I can take a look and compare what am I doing wrong on the larger file that I am dealing with. I have 4 files like that that I want to import and keep appending to the same SAS data set.
Thanks,
SamMad
Tom
Super User Tom
Super User

Changing the language you use might help a little understanding how to approach this problem.  You are not "importing" the file. Instead you just need to read the file.  Use the work "import" for when moving data from a structured file/system into SAS dataset.  The PROC IMPORT procedure has to guess how to read data from a CSV file.  The only metadata a CSV file has is the optional header row and it sounds like your file does not even have that.

 

Since you have the information needed to define the variables you can simply use that to define the variables. So take your table of metadata and put it into a SAS dataset so you have something to code with.

data metadata;
  infile cards dsd dlm='|';
  input FieldName :$32. Length DataType :$20. Import_in_SAS :$1.;
cards;
YEAR|4|AN|Y
MSA|5|AN|Y
STATE|2|AN|Y
COUNTY|3|AN|Y
TRACT|6|AN|Y
CITY|1|AN|Y
SMALL_COUNTY|1|AN|N
TRACT_FLAG|1|AN|Y
COUNTY_FLAG|1|AN|N
URBAN_RURAL|1|AN|Y
INCOME_MSA|8|NUM|Y
INCOME_FAMILY|8|NUM|Y
INCOME_TRACT|8|(DECIMAL 6,2)|Y
INCOME_FFIEC|8|NUM|Y
TOTAL_PERSONS|8|NUM|Y
TOTAL_FAMILIES|8|NUM|Y
TOTAL_HHLDS|8|NUM|Y
TOTAL_FEMALES|8|NUM|Y
TOTAL_MALE|8|NUM|N
;

So the first thing you need to do is convert that metadata into something that is useful for representing a SAS dataset.  SAS has just two data types, floating point numbers and fixed length character strings.  So your 'NUM' and '(DECIMAL 6,2)' datatypes will probably be mapped to numbers. Let's assume your 'AN' datatype should be character.   In SAS the LENGTH of a variable defines how many bytes it uses when stored in the dataset.  But normally for metadata about a text file you want to know the number of bytes the value will take in the text file, which is more like the width you might need for a format to display the data as text or for an informat to read that text into data. It is not clear what LENGTH in your metadata represents as the values are compatible with either meaning.  In SAS you want numeric variables to use LENGTH=8 since it takes 8 bytes to store the floating values SAS uses for numbers.  SAS variable names contain only letters, digits and underscores and are a length between 1 and 32 bytes. Plus they cannot start with a digit.   It looks like your example FIELDNAME values will work.  You don't have any information on what informat or format you might need attached to the variables.  Normally you only need those for date, time and datetime values, but you might want to use the F6.2 format for your '(DECIMAL 6,2)' data type so that it prints with only two decimal places. 

 

Your metadata does not have anything to indicate the order of the variables, but we could assume they are in the same order as they were displayed in your listing.  In more complete metadata you might need two order variables. One for the order you want the variables defined in your SAS dataset and one for the order they will appear in the text file you want to read.  Let's assume you want to use the same order for both.

 

So take the metadata and use it to generate code to define the variables.  For example this code will generate a series of ATTRIB statements to define the variables.  Then and INPUT statement to read all of the variables.  Then a KEEP statement to only keep the variables you want.  You can then use %INCLUDE to add those to a data step where you specify the dataset you want to create and the file you want to read.

filename code temp;
data _null_;
  file code ;
* Define the variables using ATTRIB statement ;
  do varnum=1 to nobs;
    set metadata point=varnum nobs=nobs;
    put '  attrib ' fieldname 'length=' @;
    if datatype='AN' then put '$' length @;
    else put '8 ' @;
    if datatype=:'(DECIMAL' then do;
      format=cats('F',tranwrd(scan(datatype,2,' )'),',','.'));
      put format= @;
    end;
    put ';';
  end;
* write the input statement ;
  put '  input ' @;
  varnum=1;
  set metadata point=varnum;
  put fieldname '-- ' @;
  varnum=nobs;
  set metadata point=varnum;
  put fieldname ';' ;
* Write KEEP statement ;
  put '  keep ' @;
  do varnum=1 to nobs;
    set metadata point=varnum nobs=nobs;
    if upcase(import_in_sas)='Y' then put fieldname @ ;
  end;
  put ';' ;
* Need an explicit stop since step will not read past input ;
  stop;
run;

data want;
  infile 'myfile.csv' dsd truncover ;
%include code / source2;
run;

So code might look like this :

691   data want;
692     infile 'myfile.csv' dsd truncover ;
693   %include code / source2;
NOTE: %INCLUDE (level 1) file CODE is file
      ...\#LN00072.
694  +  attrib YEAR length=$4 ;
695  +  attrib MSA length=$5 ;
696  +  attrib STATE length=$2 ;
697  +  attrib COUNTY length=$3 ;
698  +  attrib TRACT length=$6 ;
699  +  attrib CITY length=$1 ;
700  +  attrib SMALL_COUNTY length=$1 ;
701  +  attrib TRACT_FLAG length=$1 ;
702  +  attrib COUNTY_FLAG length=$1 ;
703  +  attrib URBAN_RURAL length=$1 ;
704  +  attrib INCOME_MSA length=8 ;
705  +  attrib INCOME_FAMILY length=8 ;
706  +  attrib INCOME_TRACT length=8 format=F6.2 ;
707  +  attrib INCOME_FFIEC length=8 ;
708  +  attrib TOTAL_PERSONS length=8 ;
709  +  attrib TOTAL_FAMILIES length=8 ;
710  +  attrib TOTAL_HHLDS length=8 ;
711  +  attrib TOTAL_FEMALES length=8 ;
712  +  attrib TOTAL_MALE length=8 ;
713  +  input YEAR -- TOTAL_MALE ;
714  +  keep YEAR MSA STATE COUNTY TRACT CITY TRACT_FLAG URBAN_RURAL
714 !+INCOME_MSA INCOME_FAMILY INCOME_TRACT INCOME_FFIEC TOTAL_PERSONS
714 !+TOTAL_FAMILIES TOTAL_HHLDS TOTAL_FEMALES ;
NOTE: %INCLUDE (level 1) ending.
715   run;

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 3032 views
  • 0 likes
  • 3 in conversation