BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Job04
Quartz | Level 8

 

A blood  test was done on 120 patients. The output is in csv format and individually (120 spreadsheets). The output starts with few lines describing the procedure then gives  table with values.  They are 20 columns. I need only column1 (X) and column4 (Normal).  Column1 has numeric  values but actually these are the variables and they the same for all patients and in the same order. The real data has more than 200 variables (let’s say  200). When keeping only column1 and column4 the individual data is 200x2.

Column1 describes the peaks in ppm I want to change it  to var001 to var200.

 

The few lines describing procedure are not needed. The final data I need to get should have  dimension of  120 (patients) x 200 (variables).

 

I have these spreadsheets on a file and I usually use this code to import data in sas:

proc import datafile="C:\Users\hp\Desktop\data\T1.002.csv"
dbms=csv out=out;
run;

 

But this is a situation where all data needs to be imported and rearranged it together.

I generated same structure data for 2 patients.

 

Thank you so much

 

patient1:

 

Peak Pick / Integral Database,,,,,,,,,,,,,,,,,,,
Extracted from ANNOTATION on 15-FEB-2022 8:43:16,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
Peak Pick / Integral Database,,,,,,,,,,,,,,,,,,,
Extracted from ANNOTATION on 15-FEB-2022 8:43:16,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
X,X Fold,Class,Normal,Integral,Intensity,Slope,Offset,G/L Mix,Protons,Molarity,X Diam/J,Color,Spin X,Group X,Residue X,Seq1 X,Seq2 X,Seq3 X,Type X
17.3[ppm],0,MI,2.74775,0.0001,0,0.000[abn/ppm],0,0,1,2.25,0.017[ppm],2,0,,,0,0,0,
15.3[ppm],0,MI,6.5584,0.0002,0,0.000[abn/ppm],0,0,1,3.54,0.0015[ppm],2,0,,,0,0,0,
12[ppm],0,MI,44.64217,0.00003,0,0.000[abn/ppm],0,0,1,17.3,0.225[ppm],2,0,,,0,0,0,
10.49[ppm],0,MI,8.38014,0.00005,0,0.000[abn/ppm],0,0,1,2.6,0.040[ppm],2,0,,,0,0,0,
8.16[ppm],0,MI,8.48936,0.0006,0,0.000[abn/ppm],0,0,1,3.2,0.030[ppm],2,0,,,0,0,0,
5.34[ppm],0,MI,11.64487,0.00015,0,0.000[abn/ppm],0,0,1,7.7,0.030[ppm],2,0,,,0,0,0,

 

patient2:

 

Peak Pick / Integral Database,,,,,,,,,,,,,,,,,,,
Extracted from ANNOTATION on 15-FEB-2022 15:27:43,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
Peak Pick / Integral Database,,,,,,,,,,,,,,,,,,,
Extracted from ANNOTATION on 15-FEB-2022 15:27:43,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
X,X Fold,Class,Normal,Integral,Intensity,Slope,Offset,G/L Mix,Protons,Molarity,X Diam/J,Color,Spin X,Group X,Residue X,Seq1 X,Seq2 X,Seq3 X,Type X
17.3[ppm],0,MI,2.98273,0.0001,0,0.000[abn/ppm],0,0,1,2.25,0.017[ppm],2,0,,,0,0,0,
15.3[ppm],0,MI,6.73352,0.0002,0,0.000[abn/ppm],0,0,1,3.54,0.0015[ppm],2,0,,,0,0,0,
12[ppm],0,MI,48.3615,0.00003,0,0.000[abn/ppm],0,0,1,17.3,0.225[ppm],2,0,,,0,0,0,
10.49[ppm],0,MI,8.39814,0.00005,0,0.000[abn/ppm],0,0,1,2.6,0.040[ppm],2,0,,,0,0,0,
8.16[ppm],0,MI,8.19936,0.0006,0,0.000[abn/ppm],0,0,1,3.2,0.030[ppm],2,0,,,0,0,0,
5.34[ppm],0,MI,12.04487,0.00015,0,0.000[abn/ppm],0,0,1,7.7,0.030[ppm],2,0,,,0,0,0,

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

This reads all data into a single dataset:

data want;
length fname $200 pat_id $3;
retain filname;
infile "C:\Users\hp\Desktop\data\7-PROKIDN-*-T1*.csv" filename=fname dlm=",";
input @; /* this is there so that a change in fname can be detected */
if filname ne fname
then do i = 1 to 7; /* skips header when a new file is read */
  input;
end;
filname = fname;
input x :$10. x_fold class $ normal;
pat_id = substr(scan(fname,-1,"\"),11,3); /* extracts patient from filename */
drop x_fold class i filname;
run;

View solution in original post

8 REPLIES 8
Kurt_Bremser
Super User

First of all, this is by far not a task for PROC IMPORT. Instead you will need to write a DATA step that handles everything in one step.

  • Do the filenames have a certain structure, so we can extract a patient id from themß
  • Does the data always start on the same line, or must we determine that from context?
  • Is the order of variables consistent, and do we always take columns 1 and 4?
Job04
Quartz | Level 8

The name of the spreadsheet is long and consists of:

location-study's name-patient's number- time of measurement-date of measurement-buffer:

7-PROKIDN-001-T1-150222_buf12102020 

 

7-PROKIDN-001-T1: This part is the same for all patients with exception of patient's number (

for patient 17: 7-PROKIDN-017-T1). Everything else can change because of date of measurement and type of buffer. 

 

Patients are 120 but not in sequence numbering. 

Tables has identical structure, all variables are in the same order and only column 1 and 4 are needed . 

 

 

 

Kurt_Bremser
Super User

This reads all data into a single dataset:

data want;
length fname $200 pat_id $3;
retain filname;
infile "C:\Users\hp\Desktop\data\7-PROKIDN-*-T1*.csv" filename=fname dlm=",";
input @; /* this is there so that a change in fname can be detected */
if filname ne fname
then do i = 1 to 7; /* skips header when a new file is read */
  input;
end;
filname = fname;
input x :$10. x_fold class $ normal;
pat_id = substr(scan(fname,-1,"\"),11,3); /* extracts patient from filename */
drop x_fold class i filname;
run;
Job04
Quartz | Level 8

This imports  all variables and  patient's number  and not the column "normal". 

 

Many messages I get of this type:

 

NOTE: Invalid data for x_fold in line 8 14-16.

NOTE: Invalid data for normal in line 8 23-31.

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--

8         "8.472[ppm]","0","MI","3.58542","0.00113","0","0.000[abn/ppm]","0","0","1","3.58542","0.024[

      93  ppm]","1","0","","","0","0","0","" 126

fname=C:\Users\hp\Desktop\data\7-PROKIDN-002-T1_141020_buf121020.csv pat_id=002

filname=C:\Users\hp\Desktop\data\7-PROKIDN -002-T1_141020_buf121020.csv i=8 x="8.472[ppm x_fold=.

class="MI" normal=. _ERROR_=1 _N_=1

NOTE: Invalid data for x_fold in line 9 14-16.

NOTE: Invalid data for normal in line 9 23-31.

9         "8.215[ppm]","0","MI","6.38263","0.00202","0","0.000[abn/ppm]","0","0","1","6.38263","0.050[

      93  ppm]","1","0","","","0","0","0","" 126

fname=C:\Users\hp\Desktop\data\7-PROKIDN -002-T1_141020_buf121020.csv pat_id=002

filname=C:\Users\hp\Desktop\data\7-PROKIDN -002-T1_141020_buf121020.csv i=. x="8.215[ppm x_fold=.

class="MI" normal=. _ERROR_=1 _N_=2

 

Is it possible to make the variables horizontally?

 

 

 

 

 

 

Job04
Quartz | Level 8

One more thing please. 

the variables in x column appears  in quotation ( "6.231[ppm]"). Is it possible to not include the quotation ?

ballardw
Super User

@Job04 wrote:

The name of the spreadsheet is long and consists of:

location-study's name-patient's number- time of measurement-date of measurement-buffer:

7-PROKIDN-001-T1-150222_buf12102020 

 


An aside to the main topic: thinking of CSV files as "spreadsheets" can lead to all sorts of problems. CSV are simple text flat-files, spreadsheets are not. To make things potentially even worse, opening CSV in spreadsheet programs will cause the spreadsheet software to "interpret" values from the CSV and can display things that are quite different than the text. And if you SAVE the file will looking at it in a spreadsheet can permanently change values.

 

You mention one issue related to quotes around values, which do not appear in your example in the first post. I have to assume the pasted text is from looking at it in the spreadsheet. Dates and times a particularly in danger and some values, such as 5-15 (possibly a code entry or part number) can become a date value.

 

Also, using Proc Import with multiple files of the same structure can result in variables with different names, types and lengths making it difficult to combine data later. Each call to Proc Import can make different choices on examining the file. So different contents can lead to different results even though your desire is to have the variable names the same, lengths of character variables the same and variables as either numeric or character for the same column.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 908 views
  • 1 like
  • 3 in conversation