Replace name of header line with string from number

Reply
New Contributor
Posts: 4

Replace name of header line with string from number

Hello. I am new to SAS (and programming in general) and I have been asked to write a script to extract and analyze data for one or more variables (columns) from hundreds of individual CSV files containing thousands of observations in each. The idea is to compile data from all of the .csv files into one so that can can alanyze it appropriately. The data is in the format where the first row contains a number that is accociated with a specific test. Because each test is listed with a result in one column and a unit in a second column, there are a large number of blank cells between each number associated with the test in the first row (see below for example):

Row 1:                                                                 11                        22

Row 2: Name          Date                    Comment     Result     Unit     Result     Unit

Row 3: Last, First    June 1, 2014        Words          1            mol/L   2             g/L

I have managed to get a workable script to import all of the .csv files using proc import, however I can't figure out how to replace the "11" and "22" with their appropriate test name. I think this is due to the fact that they are numerical values and therefore cannot be read in properly. The problem is that because some of the .csv files have the tests listed in different orders (i.e. 11 then 22 in some, 22 then 11 in others), I can't simply extract the data from the "var" that is associated with the column/variable upon import. Additionally, when I use the proc import function to get the data, and simply print that, the first digit in the numerical value is replaced by an "_" - listing the above example as:

Row 1: var1             var2                    var3              _1           var4     _2           var5

Row 2: Name          Date                    Comment     Result     Unit     Result     Unit

Row 3: Last, First    June 1, 2014        Words          1            mol/L   2             g/L

The script I have so far for importing the data is (in addition to this, I have been using the KEEP function to extract only the variables from the columns I know are the right ones (i.e. getnames=no; then in the data step KEEP var30; or var11 or whatever the var is associated with the data I want):

%let Path1 = "H:\Projects\QI\Raw Data\Earth\*.csv";

%macro Import;

    proc import datafile = &Path1

        dbms = csv

        out = testfile

        replace;

        getnames = no;

    run;

%mend Import;

%Import;

So my question is: is it possible to rename the values in the header row and then only extract those values that I want to analyze from each .csv file, and then deposit that information into a new file that I can then use for analysis (I've just been doing proc export to get the new file)?

Any help is greatly appreciated! Thank you.

Super User
Super User
Posts: 6,343

Re: Replace name of header line with string from number

Read it directly with a DATA step instead of PROC IMPORT.

How many rows are in the CSV files? Or are they limited to just three rows?

How many columns (or pairs of columns) are in the CSV files? Or does it vary?

If we assume that you do not know the number of columns and that the data in rows 4 to the end of the file is the same as row 3 then you could do something like this.

%let maxtest=10;

data want ;

  length name $50 date 8 labtest result 8 unit $20 comment $100 ;

  informat date anydtdte.;

  format date yymmdd10.;

  array test (&maxtest) _temporary_;

  infile 'c:\downloads\myfile.csv' dsd truncover ;

  if _n_=1 then do ;

    * skip first three columns;

    input name name name  @;

    do i=1 to &maxtest until(labtest=.);

       input labtest unit @ ;

       test(i)=labtest ;

    end;

    * read rest of line1 and line 2 ;

    input / ;

  end;

  input name date comment @;

  do i=1 to &maxtest;

    input result unit @;

    labtest=test(i);

    if result ne . or unit ne ' ' then output;

  end;

  input ;

run;

proc print;

run;

New Contributor
Posts: 4

Re: Replace name of header line with string from number

Thanks for the rapid response. I actually can't install SAS on my home computer (I have windows 8 and it appears incompatible with v. 9.3) but I will try this out tomorrow. Some notes on my dataset: they can be between 5 and 10,000 rows (depending on the number of tests done in that time frame), and the number of tests (columns) varys as well. If I'm reading what you have written correctly, I don't think this is what I need to do.

What I want is to extract the column data that is associated with the test number (i.e. the data in the column with the first row "11"), I don't really care about any of the other data just yet. I think the script you have written is taking the row data. Once I get the column names sorted out, I could use a similar script to extract that data as well. I am thinking more simplistic than what you have written. For example, if I could simply replace the number 11 in the first row with the test name I want (i.e. replace 11 with "Serum"), then I could just use the "keep" function to retain the info from that column and move on with analysis.

Thoughts?

Thanks

Respected Advisor
Posts: 3,834

Re: Replace name of header line with string from number

I believe the SAS University Edition is compatible with Windows 8 - and it's now available to all of us :-)

Super User
Super User
Posts: 6,343

Re: Replace name of header line with string from number

If you import the data with PROC IMPORT and tell it to start with row 3 and not try to find names then the variable names will be named F1,F2, etc.

Let's assume that you have done that successfully and the dataset is named HAVE.

Then you can read the first line of the CSV file to find the column with the test number in the first row.

You can then use that to rename the columns and tell SAS which columns to keep.

%let testnum=11;

%let testname=Serum;

data _null_ ;

  infile 'c:\downloads\myfile.csv' dsd truncover ;

  length dummy $200 labtest 8;

  input dummy dummy dummy  @;

  do i=4 by 2 until(labtest=&testnum);

    input labtest dummy @ ;

  end;

  call symputx('varname',cats('F',i));

  stop;

run;

data want ;

  set have ;

  rename F1=NAME F2=DATE F3=COMMENT &varname=&testname ;

  keep F1-F3 &varname ;

run;

New Contributor
Posts: 4

Re: Replace name of header line with string from number

Hello again,

I have tried your script to no avail. It seems the computer just hangs (which I can only assume is due to the DO loop). If I alter the DO loop to fit my dataset more appropriately (i.e. just put i=1 by 1; i=100 by 100; etc...) or the input variables, it changes nothing. I think the trouble is the "labtest" variable that is put in the line above following the "length" command. Where does the "dummy $200 labtest 8" come from? I'm not sure what the "dummy" should be in actuality, and I feel like the "input" line above is also not required. I stress that I don't need to keep the first three variables in the above format, I just need to extract one column of data. That is really my only question: how do I only extract the lab test results in the column associated with the 11? Keeping in mind sometimes the values associated with "11" appear in a different column.

A side note I make is that when I import the csv file and do not look for names (i.e. getnames = no), then it imports the variable names as "var1, var2", not F1, F2. I can only assume the part of the script you are writing above that includes the "symputx" function is supposed to do this, but that seems unnessecary, as I can extract something labelled var1 as easily as F1.

I can't help but feel like this is getting over-complicated. Is it not possible to simply replace a value in the first line with a name? When I imput: replace 11 = "Serum"  I get a syntax error, but there has to be some other way than by introducing other variable names. Is the problem because the header line is numerical?

I appologize for the delay, I have been asking around some to try to figure this out and nobody seems to have the answers.

Any input you have would be greatly appreciated, but if this doesn't work I'll have to look into a different way to extract this data.

Thanks for your time.

Super User
Super User
Posts: 6,343

Re: Replace name of header line with string from number

So the not stopping is because it did not find the value you were looking for.  If you set an upper bound on the number of columns that it looks for would stop it from getting into an infinite loop. 

do i=4 to 100 by 2 until(labtest=&testnum);

The rest of your questions make it seem that perhaps you need to post more examples so that we can see what the actual problem is. The code posted solves the problem of reading a file that has three columns of row label information (NAME, DATE, and COMMENT) and then has pairs of columns that represent labtest results and lab units.  The column headers for the lab results consisted of a number (like 11) and there was no column header for the lab units.  The problem as presented was that the required lab test (11 in the example) was not always the first test.  So the program reads the headers looking for the one that has the desired value (the NUMBER 11) and the using that information to allow you read the actual lab tests and use a valid variable name (like SERUM) for the name of the variable.  If instead the column header will be something like SERUM then read the value as a string instead of number.

If you can assume that the column header that you want to search for could never be the same as one of the first three column headers then your program could be much simpler to search for which column you want to keep.

data _null_ ;

  infile 'c:\downloads\myfile.csv' dsd truncover ;

  length word $200 ;

  do i=1 to 1000 until(word="11") ;

    input word @ ;

    if word="11" then call symputx('varname',cats('var',i));

  end;

  stop;

run;

Esteemed Advisor
Esteemed Advisor
Posts: 7,217

Re: Replace name of header line with string from number

Well, assuming your data is all the same, you could do something like this (gives a results table with source, numbers and decodes - best to view the code in a bigger window than this editor as it all gets a bit squashed):

filename dir_list pipe 'dir "s:\temp\rob\tmp" /b';
data dlist;  /* Read in list of files in given directory */
  length buffer $2000.;
  infile dir_list dsd missover lrecl=32767;
  input buffer $;
run;

proc sql;  /* Create empty results table */
  create table RESULTS
  (
    FROM char(200),
    NUM1 char(200),
    NUM2 char(200),
    DECODE char(200)
  );
quit;
data decodes;  /* Updated this table for as many decodes as you like */
  length code decode $200.;
  code="11"; decode="Something"; output;
  code="22"; decode="Another"; output;
run;

options noquotelenmax;
data _null_;  /* This step generates the code for each file in the dirlist */
  set dlist;
  call execute('filename myread pipe '||"'"||'type "s:\temp\rob\tmp\'||strip(buffer)||'"'||"';");
  call execute('data temp; length buffer $2000.; infile myread dsd missover lrecl=32767; input buffer $;run;');
  call execute('proc sql;  
                  insert into RESULTS
                  set FROM="'||strip(buffer)||'",
                      NUM1=(select scan(strip(substr(buffer,10,200)),1," ") from TEMP where substr(BUFFER,1,6)="Row 1:"),
                      NUM2=(select scan(strip(substr(buffer,10,200)),2," ") from TEMP where substr(BUFFER,1,6)="Row 1:"),
                      DECODE=catx(",",
                                    (select DECODE from DECODES where CODE=(select scan(strip(substr(buffer,10,200)),1," ") from TEMP where substr(BUFFER,1,6)="Row 1:")),
                                    (select DECODE from DECODES where CODE=(select scan(strip(substr(buffer,10,200)),2," ") from TEMP where substr(BUFFER,1,6)="Row 1:")));
                quit;');

run;
options quotelenmax;

Super User
Super User
Posts: 6,343

Re: Replace name of header line with string from number

You seem to want to just read in a single column from the CSV file. You have two problems, how to find which column to read and how to read just that column.  You would need to provide two pieces of information, the value to find in the first row that indicates the column you want and the name to assign to the variable that you are creating to store the value.  Then you could do both things in one data step.

So in this code there are two macro variables for you to provide the information needed.  You could easily convert this to a macro if you want.

HEADERSTR is the value that should appear in the first row to indicate which column you want to read.

VARNAME is the SAS variable name to use for the values in that column. 

FILENAME is the quoted physical file name of the file to read.

It assumes the HEADERSTR value will be located in the first row and that the real data starts on the third row.  I have hard coded a limit of 100 columns and maximum line size of 30,000 characters. 

%let headerstr=11 ;

%let varname=serum ;

%let filename='c:\downloads\myfile.csv' ;

data want ;

  infile &filename dsd truncover lrecl=30000 ;

  array col (100) $32 ;

  if _n_=1 then do ;

   * Read Header rows and locate column of interest ;

    input col1-col100 / ;

  n = whichc("&headerstr",of col(*) );

    retain n;

    if n=. then do ;

      put 'ERROR: Could not find header string' ;

      stop;

    end;

  end;

  input col1-col100 ;

  &varname = input(col(n),??32.);

  keep &varname ;

run;

New Contributor
Posts: 4

Re: Replace name of header line with string from number

Hi again,

I figured out how to get the data I wanted, so I figured I would pass along the solution. As I said before, this was a very simple problem made complicated.

proc import datafile = "C:\datafile.csv"

    dbms = csv

    out = GO

    replace;

    getnames = yes;

run;

data Extract_Serum (keep = Serum);

    set GO (firstobs = 2);

    rename 11 = Serum;

run;

Thanks for all your input!

Ask a Question
Discussion stats
  • 9 replies
  • 1135 views
  • 0 likes
  • 4 in conversation