Solved: Import only selected variables from csv files and reference these vari...

USCSS_Nostromo · Posted 08-25-2018 07:57 AM

Hello,

In brief, I wish to do the following:

1.) import data into SAS from ca. 20 csv files with varying number of columns

2.) identify data sets for individual participants in these files using variable names which are their dates of participation and are known to me

3.) create a column called "ID" containing the participant ID, e.g., HAL101

4.) create a column called "Device" containing the csv file name, e.g., RJ Macready

5.) the individual files should have the name of the participant whose data they contain, e.g., HAL101, in the SAS library

Here is a more elaborate description of what I want to do.

These individual files do not have the same structure, with the exception of the first column of each file, which is called "Time" and contains clock time data (0:00 - 23:59; each minute of the 24-hour day. There are thus always 1440 rows per csv file). All columns besides this clock time column contain minute-to-minute numeric data and vary in number from file to file. These columns have dates as variable names. These dates follow this example format: m/dd/yyyy or mm/d/yyyy

I have attached a sample data set for your reference (RJ Macready.csv).

Altogether, these ca. 20 csv files contain data from ca. 100 individual participants, from who we have continuously collected data over the course of a one month period. Each csv file could contain data for several participants, however, since the data collection devices were reused throughout the course of the project.

The dates allow me to identify a data set for a given participant. For instance, participant HAL101 participated from 15/7/2017 to 16/8/2017.

I would like SAS to somehow reference the dates of a given participant and then create a file consisting only of the data collected on these dates.

The names of the individual csv files I have carry the name of the device used to collect the heart rate data, e.g., RJ Macready.csv. Because this information is relevent, I wish to include this file name as an observation in the data set of the participant.

In the end, I need for each participant, a file in the SAS library which has the following structure.

Time ID Device 7/15/2017 . . . 8/16/2017

0:00 HAL101 RJ Macready 25 . . . 53

0:01 HAL101 RJ Macready 41 . . . 62

0:02 HAL101 RJ Macready 31 . . . 43

etc.

The file name in the SAS library should have the name of the participant, e.g., HAL101

I have had success importing these data in a dumb and very labor-intensive way. Here is what I have done so far:

1.) In the excel spreadsheet, I have manually removed empty rows, added a column containing an abbreviation of the device (e.g., Mac), renamed all variables, such that they can be imported with the INFILE statement (e.g., 7/15/2017 becomes d_7_15_2017), and replaced any empty cells with a period ( . )

2.) I then copy all variables names into my INFILE statement and KEEP only those I need for the individual participant. It looks like this per participant:

Data HAL101;

infile

"C:\data\RJ Macready.csv"

dlm=',' firstobs=2 ;

input

Time$ device$ d_7_18_2017 d_7_19_2017 d_7_20_2017 d_7_21_2017 d_7_22_2017 d_7_23_2017 d_7_24_2017 d_7_25_2017 d_7_26_2017 d_7_27_2017 d_7_28_2017 d_7_29_2017 d_7_30_2017 d_7_31_2017 d_8_1_2017 d_8_2_2017 d_8_3_2017 d_8_4_2017 d_8_5_2017 d_8_6_2017 d_8_7_2017 d_8_8_2017 d_8_9_2017 d_8_10_2017 d_8_11_2017 d_8_12_2017 d_8_13_2017 d_8_14_2017 d_8_15_2017 d_8_16_2017 d_8_17_2017 d_8_18_2017 d_8_19_2017 d_8_20_2017 d_8_21_2017 d_8_22_2017 d_8_23_2017 d_8_24_2017 d_8_25_2017 d_8_26_2017 d_8_27_2017 d_8_28_2017 d_8_29_2017 d_8_30_2017 d_8_31_2017 d_9_1_2017 d_9_2_2017 d_9_3_2017 d_9_4_2017 d_9_5_2017 d_9_6_2017 d_9_7_2017 d_9_8_2017 d_9_9_2017 d_9_10_2017 d_9_11_2017 d_9_12_2017 d_9_13_2017 d_9_14_2017 d_9_15_2017 d_9_16_2017 d_9_17_2017 d_9_18_2017 d_9_19_2017 d_9_20_2017 d_9_21_2017 d_9_22_2017 d_9_23_2017 d_9_24_2017 d_9_25_2017 d_9_26_2017 d_9_27_2017 d_9_28_2017 d_9_29_2017 d_9_30_2017 d_10_1_2017 d_10_2_2017 d_10_3_2017 d_10_4_2017 d_10_5_2017 d_10_6_2017 d_10_7_2017 d_10_8_2017 d_10_9_2017 d_10_10_2017 d_10_11_2017 d_10_12_2017 d_10_13_2017 d_10_14_2017 d_10_15_2017 d_10_16_2017 d_10_17_2017 d_10_18_2017 d_10_19_2017 d_10_20_2017 d_10_21_2017 d_10_22_2017 d_10_23_2017 d_10_24_2017 d_10_25_2017 d_10_26_2017 d_10_27_2017 d_10_28_2017 d_10_29_2017 d_10_30_2017 d_10_31_2017 d_11_1_2017 d_11_2_2017 d_11_3_2017 d_11_4_2017 d_11_5_2017 d_11_6_2017 d_11_7_2017 d_11_8_2017 d_11_9_2017 d_11_10_2017 d_11_11_2017 d_11_12_2017 d_11_13_2017 d_11_14_2017 d_11_15_2017 d_11_16_2017 d_11_17_2017 d_11_18_2017 d_11_19_2017 d_11_20_2017 d_11_21_2017 d_11_22_2017 d_11_23_2017 d_11_24_2017 d_11_25_2017 d_11_26_2017 d_11_27_2017 d_11_28_2017 d_11_29_2017 d_11_30_2017 d_12_1_2017 d_12_2_2017 d_12_3_2017 d_12_4_2017 d_12_5_2017 d_12_6_2017 d_12_7_2017 d_12_8_2017 d_12_9_2017 d_12_10_2017 d_12_11_2017 d_12_12_2017 d_12_13_2017 d_12_14_2017 d_12_15_2017 d_12_16_2017 d_12_17_2017 d_12_18_2017 d_12_19_2017 d_12_20_2017 d_12_21_2017 d_12_22_2017 d_12_23_2017 d_12_24_2017 d_12_25_2017 d_12_26_2017 d_12_27_2017 d_12_28_2017 d_12_29_2017 d_12_30_2017 d_12_31_2017 d_1_1_2018 d_1_2_2018 d_1_3_2018 d_1_4_2018 d_1_5_2018 d_1_6_2018 d_1_7_2018 d_1_8_2018 d_1_9_2018 d_1_10_2018 d_1_11_2018 d_1_12_2018 d_1_13_2018 d_1_14_2018 d_1_15_2018 d_1_16_2018 d_1_17_2018 d_1_18_2018 d_1_19_2018 d_1_20_2018 d_1_21_2018 d_1_22_2018 d_1_23_2018 d_1_24_2018 d_1_25_2018 d_1_26_2018 d_1_27_2018 d_1_28_2018 d_1_29_2018 d_1_30_2018 d_1_31_2018 d_2_1_2018 d_2_2_2018 d_2_3_2018 d_2_4_2018 d_2_5_2018 d_2_6_2018 d_2_7_2018 d_2_8_2018 d_2_9_2018 d_2_10_2018 d_2_11_2018 d_2_12_2018 d_2_13_2018 d_2_14_2018 d_2_15_2018 d_2_16_2018 d_2_17_2018 d_2_18_2018 d_2_19_2018 d_2_20_2018 d_2_21_2018 d_2_22_2018 d_2_23_2018 d_2_24_2018 d_2_25_2018 d_2_26_2018 d_2_27_2018 d_2_28_2018 d_3_1_2018 d_3_2_2018 d_3_3_2018 d_3_4_2018 d_3_5_2018 d_3_6_2018 d_3_7_2018 d_3_8_2018 d_3_9_2018 d_3_10_2018 d_3_11_2018 d_3_12_2018 d_3_13_2018 d_3_14_2018 d_3_15_2018 d_3_16_2018 d_3_17_2018 d_3_18_2018 d_3_19_2018 d_3_20_2018 d_3_21_2018 d_3_22_2018 d_3_23_2018 d_3_24_2018 d_3_25_2018 d_3_26_2018 d_3_27_2018 d_3_28_2018 d_3_29_2018 d_3_30_2018 d_3_31_2018 d_4_1_2018 d_4_2_2018 d_4_3_2018 d_4_4_2018 d_4_5_2018 d_4_6_2018 d_4_7_2018 d_4_8_2018 d_4_9_2018 d_4_10_2018 d_4_11_2018 d_4_12_2018 d_4_13_2018 d_4_14_2018 d_4_15_2018 d_4_16_2018 d_4_17_2018 d_4_18_2018 d_4_19_2018 d_4_20_2018 d_4_21_2018 d_4_22_2018 d_4_23_2018 d_4_24_2018 d_4_25_2018 d_4_26_2018 d_4_27_2018 d_4_28_2018 d_4_29_2018 d_4_30_2018 d_5_1_2018 d_5_2_2018 d_5_3_2018 d_5_4_2018 d_5_5_2018 d_5_6_2018 d_5_7_2018 d_5_8_2018 d_5_9_2018 d_5_10_2018 d_5_11_2018 d_5_12_2018 d_5_13_2018 d_5_14_2018 d_5_15_2018 d_5_16_2018 d_5_17_2018 d_5_18_2018 d_5_19_2018 d_5_20_2018 d_5_21_2018 d_5_22_2018 d_5_23_2018 d_5_24_2018 d_5_25_2018 d_5_26_2018 d_5_27_2018 d_5_28_2018 d_5_29_2018

;If device = "Mac" then ID = "HAL101";

keep time device ID d_7_27_2017 d_7_28_2017 d_7_29_2017 d_7_30_2017 d_7_31_2017 d_8_1_2017 d_8_2_2017 d_8_3_2017 d_8_4_2017 d_8_5_2017 d_8_6_2017 d_8_7_2017 d_8_8_2017 d_8_9_2017 d_8_10_2017 d_8_11_2017 d_8_12_2017 d_8_13_2017 d_8_14_2017 d_8_15_2017 d_8_16_2017 d_8_17_2017 d_8_18_2017 d_8_19_2017 d_8_20_2017 d_8_21_2017 d_8_22_2017 d_8_23_2017 d_8_24_2017 d_8_25_2017 d_8_26_2017 d_8_27_2017 d_8_28_2017

;

run;

The above example works (note: it will not work on the file I have attached to this post. This must first be modified per step 1 above), but I would like to find a more elegant and less labor-intensive way to do what I need to do. I know SAS has the power to do it. I just don't know how to harness that power at the moment.

I would appreciate any help. I am prepared to answer any questions you may have.

I am running SAS 9.4

Tom · Posted 09-06-2018 09:46 AM

Change the index passed to SCAN() function. -1 one is the last word. Sounds like you want -2 instead.

View solution in original post

AMSAS · Posted 08-29-2018 09:01 AM

Hi,

Please take a look at the attached SAS code, I believe it will do what you need, or get you very close.

I used your supplied "RJ Macready.csv" file and created a similar copy of it "J Bloggs.csv" (it's a copy with fewer columns) for testing and it appears to be working as you require.

I used SAS/Marco to solve the problem and I have lightly documented the code. Hopefully, there's enough information that you can use this to figure out what is going on.

I was uncertain about how to generate the ID field, so the code is generating a simple ID<n> where <n> is a count of the input files.

You will need to change just a single line (see below) replace the directory path, with a directory path that contains all your source files and nothing else.

%loadData(directoryPath=C:\Users\MYuser\Downloads\temp) ;

Thanks

Andy

USCSS_Nostromo · Posted 09-03-2018 10:00 AM

Dear Andy,

I apologize for my late reply to your message. I would like to say thank you so much and that I really appreciate all your help.

Your code works.

I needed to make some small changes to my files so they can be imported, though, such as deleting the final empty column in each csv file to be imported. Apparently, if there is a column with a variable name but no content at the end of the data set, the file will not be imported. I must also manually spread the column widths in excel, such that they reveal the entire variable names and not the hashtags "#####" which excel displays when the width of a cell is not wide enough to reveal its entire contents. I found that to be a pretty unusual problem but one that is easy to solve.

Now that my data is all in SAS, I would like to somehow create a program that can cross reference a table containing each participant ID and the dates of their participation, such that SAS automatically creates individual data sets by participant by extracting only those dates (columns) of data for the one-month period of their participation. For instance, only dates 7/18/2018 to 8/20/2018 would be removed for participant HAL101 and a data set created for that participant which I could then combine with all other data sets for the other participants, whose data sets have been created in the same manner. It would be even better if the device name could be referenced and a column created for that as well. Each csv file to be imported is named after the device used to collect the data. To my knowledge, there is a way to retain the name of the original csv file name and use it to name the SAS file in the work library, but I am not sure how to implement it.

The following code shows how I change the structure of the data once imported by your code in SAS, i.e., I add a column with the device name and then a column with the participant number. It works on the file that your code creates after importing. I then merge and transpose each individual file and if necessary replace all the "999" data entries with ".", which indicate missing data. It gets me where I need to be, but it is not elegant and is still somewhat labor intensive, as I must do this for ca. 100 participants and ca. 15 devices. The example below would be code for one device / participant file.

data Macready;

set Id0015;

length

Device $3;

Device = "MAC"; output;

run;

/*The above code is copied ca. 15 times for the different devices used to collect data. */

data HAL101;

set Macready;

length

ID $6;

ID = "HAL101"; output;

keep Time device ID d_07272017 - d_08282017;

run;

/*There would be ca. 100 copies of the above code, varying only by the SET (as it would be possibly a different device), ID (as it would be a different participant), and the KEEP function (as each participant has a different range of dates that they participated and provided data to us). The ID is what ends up creating the need to write so many copies of the code, because there are ca. 100 participants. */

data AllDataTransposed(keep= data Time ID device orig_varname orig_obsnum);

set HAL:;

array myvars _numeric_;

do i = 1 to dim(myvars);

orig_varname = vname(myvars(i));

orig_obsnum = _n_;

data = myvars(i);

array myvar$ _;

output;

end;

run;

/*The above code transposes and merges all the data sets of the individual participants by the prefix "HAL" in the work library. */

data AllDataTransposed_trimmed;

set AllDataTransposed;

array Nums[*] _numeric_;

array Chars[*] _character_;

do i = 1 to dim(Nums);

if Nums[i] = 999 then Nums[i] = .;

end;

do i = 1 to dim(Chars);

Chars[i] = upcase(Chars[i]);

end;

drop i;

run;

data FinalDataSet;

set AllDataTransposed_trimmed;

if orig_varname = "TIME" then delete;

run;

/*These sections of code above remove the "999" and replace with "." and then remove the observation "TIME" which appears after transposing and is not needed. (There is already a variable/column called "Time" which is populated with time data.)

Instead of doing the above to select data sets by participant by their one-month participation period, I have tried to use a macro as follows to select only certain series of variables from the data instead of cross referencing a table, but it doesn't work for me yet. I am not sure it is the right way to go about doing this.

%macro selectVars(libname=work,datanm=Macready,range=d_07272017 - d_08282017,pattern=%,separateby=%str( ));

data tmp;

set &datanm.;

run;

%if "&range." ^=%str() %then %do;

data tmp;

set tmp;

keep &range.;

run;

%end;

%global lstVars;

%let lstVars=%str();*reset the macro variable lstVars;

proc sql noprint;

select name into :lstVars separated by "&separateby."

from dictionary.columns

where libname=upcase("&libname") and memname = upcase("tmp")

and name like "&pattern." escape '#';

drop table tmp;

quit;

%mend;

proc sql;

select &lstVars.

from Macready;

quit;

Attached is a pdf of where I found the above code.

If you have any ideas on how I could get the same outcome but with more efficiency, then please let me know. I am already grateful for all your help, however, so if not that is okay.

Sincerely,

Ian [USCSS_Nostromo]

Tom · Posted 09-03-2018 11:22 AM

Contrary to your claim in our post it looks to me like all of the file have the SAME structure. You are just not reading them in a way that makes that apparent.

It will be much easier to read the data into a tall skinny structure instead. Your variable names will not change and you will not have data (dates) stored in metadata (variable names).

Time Date HR
0:00 7/15/2017 25
...
0:00 8/16/2017 53
0:01 7/15/2017 41

You can use the FILENAME option on the INFILE statement to retrieve the ID from the filename.

You can read the dates from the fist line and store them into a temporary array so that you can later attach the right date to the right HR reading.

%let pathname=C:\data;
data tall ;
  length filename fname $256 ;
  infile "&pathname\*.csv" dsd truncover filename=fname ;
  array d (1000) _temporary_ ;
  length Time Date HR 8 ;
  informat time time. date mmddyy. ;
  format time time5. date yymmdd10. ;
  length dummy $8 ;
  input @;
  filename=scan(fname,-1,'./\') ;
  if filename ne lag(filename) then do ;
    input dummy @ ;
    do n=1 by 1 until (date=.);
      input date @;
      d(n)=date ;
    end;
    n=n-1;
    input;
  end;
  input time @;
  do i=1 to n ;
     input hr @;
     date=d(i);
     if hr not in (.,999) then output;
  end;
  retain n ;
  drop dummy i n ;
run;

Now you just need to create another file that has these FILENAME values and the date ranges that you want map to particular IDS.

data id_dates ;
  length id $10 filename $256 start stop 8;
  format start stop yymmdd10.;
  informat start stop yymmdd10.;
  infile cards dsd ;
  input id filename start stop;
cards;
HAL101,RJ Macready,2017-07-15,2017-08-16
;

Then just join the two lists.

proc sql noprint;
create table want as
select a.id,a.filename,b.date,b.time,b.hr
from id_dates a
inner join tall b
on a.filename = b.filename
and b.date between a.start and a.stop
order by a.id,a.filename,b.date,b.time
;
quit;

USCSS_Nostromo · Posted 09-06-2018 09:03 AM

Dear Tom,

Thank you so much for your reply to my post and for all your work.

I can't get the first part of the code to work.

I tested this code with the RJ Macready.csv file that I uploaded in my initial post on this thread. When I run the code, I receive an error message which I am posting below. I have narrowed the error message down to the following part of the code by systematically omitting sections of the do loop until I receive the error message:

input time @;

do i=1 to n ;

input hr @;

date=d(i);

if hr not in (.,999) then output;

end;

Here is the full error message including all relevant code:

483

484 %let pathname=P:\PTSD\18. Analysis\Test/* C:\data */;

485 data tall ;

486 length filename fname $256 ;

487 infile "&pathname\*.csv" dsd truncover filename=fname ;

488 array d (1000) _temporary_ ;

489 length Time Date HR 8 ;

490 informat time time. date mmddyy. ;

491 format time time5. date yymmdd10. ;

492 length dummy $8 ;

493 input @;

494 filename=scan(fname,-1,'./\') ;

495 if filename ne lag(filename) then do ;

496 input dummy @ ;

497 do n=1 by 1 until (date=.);

498 input date @;

499 d(n)=date ;

500 end;

501 n=n-1;

502 input;

503 end;

504 input time @;

505 do i=1 to n ;

506 input hr @;

507 date=d(i);

508 if hr not in (.,999) then output;

509 end;

510 run;

NOTE: The infile "P:\PTSD\18. Analysis\Test\*.csv" is:

Filename=P:\PTSD\18. Analysis\Test\RJ Macready.csv,

RECFM=V,LRECL=32767

ERROR: Invalid DO loop control information, either the INITIAL or TO expression is missing or the

BY expression is missing, zero, or invalid.

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+---

3 0:00,999,999,999,999,999,999,999,999,999,999,83,100,62,66,69,74,73,70,82,75,67,78,61,77,

89 68,70,86,86,79,63,92,67,83,73,68,60,64,59,58,83,110,999,999,999,999,999,999,999,999,999,

177 999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,

265 999,999,999,999,999,999,999,999,999,999,999,999,999,999,82,999,52,53,83,63,54,52,59,47,6

353 2,55,68,69,55,65,53,69,49,59,77,76,59,59,70,54,59,60,69,63,53,66,45,51,56,999,999,999,99

441 9,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,99

529 9,999,999,999,999,999,84,999,59,57,64,59,78,58,61,64,60,65,67,72,133,105,58,78,61,67,65,

617 63,62,61,58,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,999,

705 999,999,999,999,999,999,999,999,999,999,999,999,57,999,999,999,999,999,999,999,999,999,9

793 99,999,70,56,67,51,60,67,98,72,56,61,55,51,56,64,83,70,62,64,55,58,56,60,60,84,65,66,66,

881 70,70,70,70,70,70,70,999,999,999,999,999,999,999,114,103,63,70,59,115,78,76,76,65,63,64,

969 72,100,70,73,63,77,59,68,63,69,64,96,66,82,76,64,71,61,75,86,70,58,65,71,69,63,68,72,92,

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+---

1057 67,85,69,67,73,74,77,75,63,69,61,64, 1092

filename=csv fname=P:\PTSD\18. Analysis\Test\RJ Macready.csv Time=0:00 Date=. HR=. dummy= n=. i=1

_ERROR_=1 _N_=2

NOTE: 3 records were read from the infile "P:\PTSD\18. Analysis\Test\*.csv".

The minimum record length was 315.

The maximum record length was 3156.

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set WORK.TALL may be incomplete. When this step was stopped there were 0

observations and 7 variables.

WARNING: Data set WORK.TALL was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

real time 0.03 seconds

cpu time 0.00 seconds

I would really appreciate any help you may have. I am trying to solve the problem myself. Thanks again for your help.

Sincerely,

Ian

Tom · Posted 09-06-2018 09:09 AM

You didn't include the RETAIN statement for the variable N. So after the first observation SAS will set it back to missing. That is what is causing the error you posted.

USCSS_Nostromo · Posted 09-06-2018 09:14 AM

Hello Tom,

This worked! Thank you!

Best,

Ian

USCSS_Nostromo · Posted 09-06-2018 09:22 AM

Hi Tom,

Sorry to be a pest, but when I import the data, the file names for all of the files imported are "csv". I need the file names. Do you have a solution for this? I suspect your original intent was to import the file names, e.g., "RJ Macready", and not the file type.

Thanks for any help!

Best,

Ian

Tom · Posted 09-06-2018 09:46 AM

Change the index passed to SCAN() function. -1 one is the last word. Sounds like you want -2 instead.

USCSS_Nostromo · Posted 09-06-2018 10:02 AM

Hello Tom,

This works.

Thank you so much for all your help and patience!

Best,

Ian

Import only selected variables from csv files and reference these variable to define participants

Re: Import only selected variables from csv files and reference these variable to define participant

Re: Import only selected variables from csv files and reference these variable to define participant

Re: Import only selected variables from csv files and reference these variable to define participant

Re: Import only selected variables from csv files and reference these variable to define participant

Re: Import only selected variables from csv files and reference these variable to define participant

Re: Import only selected variables from csv files and reference these variable to define participant

Re: Import only selected variables from csv files and reference these variable to define participant

Re: Import only selected variables from csv files and reference these variable to define participant

Re: Import only selected variables from csv files and reference these variable to define participant

Re: Import only selected variables from csv files and reference these variable to define participant

Register Today!