Solved: Irregular format of csv files and combing large number of data

Job04 · Posted 02-27-2022 03:00 AM

A blood test was done on 120 patients. The output is in csv format and individually (120 spreadsheets). The output starts with few lines describing the procedure then gives table with values. They are 20 columns. I need only column1 (X) and column4 (Normal). Column1 has numeric values but actually these are the variables and they the same for all patients and in the same order. The real data has more than 200 variables (let’s say 200). When keeping only column1 and column4 the individual data is 200x2.

Column1 describes the peaks in ppm I want to change it to var001 to var200.

The few lines describing procedure are not needed. The final data I need to get should have dimension of 120 (patients) x 200 (variables).

I have these spreadsheets on a file and I usually use this code to import data in sas:

proc import datafile="C:\Users\hp\Desktop\data\T1.002.csv"
dbms=csv out=out;
run;

But this is a situation where all data needs to be imported and rearranged it together.

I generated same structure data for 2 patients.

Thank you so much

patient1:

Peak Pick / Integral Database,,,,,,,,,,,,,,,,,,,
Extracted from ANNOTATION on 15-FEB-2022 8:43:16,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
Peak Pick / Integral Database,,,,,,,,,,,,,,,,,,,
Extracted from ANNOTATION on 15-FEB-2022 8:43:16,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
X,X Fold,Class,Normal,Integral,Intensity,Slope,Offset,G/L Mix,Protons,Molarity,X Diam/J,Color,Spin X,Group X,Residue X,Seq1 X,Seq2 X,Seq3 X,Type X
17.3[ppm],0,MI,2.74775,0.0001,0,0.000[abn/ppm],0,0,1,2.25,0.017[ppm],2,0,,,0,0,0,
15.3[ppm],0,MI,6.5584,0.0002,0,0.000[abn/ppm],0,0,1,3.54,0.0015[ppm],2,0,,,0,0,0,
12[ppm],0,MI,44.64217,0.00003,0,0.000[abn/ppm],0,0,1,17.3,0.225[ppm],2,0,,,0,0,0,
10.49[ppm],0,MI,8.38014,0.00005,0,0.000[abn/ppm],0,0,1,2.6,0.040[ppm],2,0,,,0,0,0,
8.16[ppm],0,MI,8.48936,0.0006,0,0.000[abn/ppm],0,0,1,3.2,0.030[ppm],2,0,,,0,0,0,
5.34[ppm],0,MI,11.64487,0.00015,0,0.000[abn/ppm],0,0,1,7.7,0.030[ppm],2,0,,,0,0,0,

patient2:

Peak Pick / Integral Database,,,,,,,,,,,,,,,,,,,
Extracted from ANNOTATION on 15-FEB-2022 15:27:43,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
Peak Pick / Integral Database,,,,,,,,,,,,,,,,,,,
Extracted from ANNOTATION on 15-FEB-2022 15:27:43,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
X,X Fold,Class,Normal,Integral,Intensity,Slope,Offset,G/L Mix,Protons,Molarity,X Diam/J,Color,Spin X,Group X,Residue X,Seq1 X,Seq2 X,Seq3 X,Type X
17.3[ppm],0,MI,2.98273,0.0001,0,0.000[abn/ppm],0,0,1,2.25,0.017[ppm],2,0,,,0,0,0,
15.3[ppm],0,MI,6.73352,0.0002,0,0.000[abn/ppm],0,0,1,3.54,0.0015[ppm],2,0,,,0,0,0,
12[ppm],0,MI,48.3615,0.00003,0,0.000[abn/ppm],0,0,1,17.3,0.225[ppm],2,0,,,0,0,0,
10.49[ppm],0,MI,8.39814,0.00005,0,0.000[abn/ppm],0,0,1,2.6,0.040[ppm],2,0,,,0,0,0,
8.16[ppm],0,MI,8.19936,0.0006,0,0.000[abn/ppm],0,0,1,3.2,0.030[ppm],2,0,,,0,0,0,
5.34[ppm],0,MI,12.04487,0.00015,0,0.000[abn/ppm],0,0,1,7.7,0.030[ppm],2,0,,,0,0,0,

Kurt_Bremser · Posted 02-27-2022 06:39 AM

This reads all data into a single dataset:

data want;
length fname $200 pat_id $3;
retain filname;
infile "C:\Users\hp\Desktop\data\7-PROKIDN-*-T1*.csv" filename=fname dlm=",";
input @; /* this is there so that a change in fname can be detected */
if filname ne fname
then do i = 1 to 7; /* skips header when a new file is read */
  input;
end;
filname = fname;
input x :$10. x_fold class $ normal;
pat_id = substr(scan(fname,-1,"\"),11,3); /* extracts patient from filename */
drop x_fold class i filname;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 02-27-2022 04:02 AM

First of all, this is by far not a task for PROC IMPORT. Instead you will need to write a DATA step that handles everything in one step.

Do the filenames have a certain structure, so we can extract a patient id from themß
Does the data always start on the same line, or must we determine that from context?
Is the order of variables consistent, and do we always take columns 1 and 4?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Job04 · Posted 02-27-2022 04:26 AM

The name of the spreadsheet is long and consists of:

location-study's name-patient's number- time of measurement-date of measurement-buffer:

7-PROKIDN-001-T1-150222_buf12102020

7-PROKIDN-001-T1: This part is the same for all patients with exception of patient's number (

for patient 17: 7-PROKIDN-017-T1). Everything else can change because of date of measurement and type of buffer.

Patients are 120 but not in sequence numbering.

Tables has identical structure, all variables are in the same order and only column 1 and 4 are needed .

Kurt_Bremser · Posted 02-27-2022 06:39 AM

This reads all data into a single dataset:

data want;
length fname $200 pat_id $3;
retain filname;
infile "C:\Users\hp\Desktop\data\7-PROKIDN-*-T1*.csv" filename=fname dlm=",";
input @; /* this is there so that a change in fname can be detected */
if filname ne fname
then do i = 1 to 7; /* skips header when a new file is read */
  input;
end;
filname = fname;
input x :$10. x_fold class $ normal;
pat_id = substr(scan(fname,-1,"\"),11,3); /* extracts patient from filename */
drop x_fold class i filname;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Job04 · Posted 02-27-2022 08:31 AM

This imports all variables and patient's number and not the column "normal".

Many messages I get of this type:

NOTE: Invalid data for x_fold in line 8 14-16.

NOTE: Invalid data for normal in line 8 23-31.

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--

8 "8.472[ppm]","0","MI","3.58542","0.00113","0","0.000[abn/ppm]","0","0","1","3.58542","0.024[

93 ppm]","1","0","","","0","0","0","" 126

fname=C:\Users\hp\Desktop\data\7-PROKIDN-002-T1_141020_buf121020.csv pat_id=002

filname=C:\Users\hp\Desktop\data\7-PROKIDN -002-T1_141020_buf121020.csv i=8 x="8.472[ppm x_fold=.

class="MI" normal=. _ERROR_=1 _N_=1

NOTE: Invalid data for x_fold in line 9 14-16.

NOTE: Invalid data for normal in line 9 23-31.

9 "8.215[ppm]","0","MI","6.38263","0.00202","0","0.000[abn/ppm]","0","0","1","6.38263","0.050[

93 ppm]","1","0","","","0","0","0","" 126

fname=C:\Users\hp\Desktop\data\7-PROKIDN -002-T1_141020_buf121020.csv pat_id=002

filname=C:\Users\hp\Desktop\data\7-PROKIDN -002-T1_141020_buf121020.csv i=. x="8.215[ppm x_fold=.

class="MI" normal=. _ERROR_=1 _N_=2

Is it possible to make the variables horizontally?

Job04 · Posted 02-27-2022 08:57 AM

One more thing please.

the variables in x column appears in quotation ( "6.231[ppm]"). Is it possible to not include the quotation ?

Kurt_Bremser · Posted 02-27-2022 09:19 AM

You did not show us the real contents of the file. Add the DSD option to the INFILE statement.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Job04 · Posted 02-27-2022 09:25 AM

Thank you so much

ballardw · Posted 02-27-2022 02:55 PM

@Job04 wrote:

The name of the spreadsheet is long and consists of:

location-study's name-patient's number- time of measurement-date of measurement-buffer:

7-PROKIDN-001-T1-150222_buf12102020

An aside to the main topic: thinking of CSV files as "spreadsheets" can lead to all sorts of problems. CSV are simple text flat-files, spreadsheets are not. To make things potentially even worse, opening CSV in spreadsheet programs will cause the spreadsheet software to "interpret" values from the CSV and can display things that are quite different than the text. And if you SAVE the file will looking at it in a spreadsheet can permanently change values.

You mention one issue related to quotes around values, which do not appear in your example in the first post. I have to assume the pasted text is from looking at it in the spreadsheet. Dates and times a particularly in danger and some values, such as 5-15 (possibly a code entry or part number) can become a date value.

Also, using Proc Import with multiple files of the same structure can result in variables with different names, types and lengths making it difficult to combine data later. Each call to Proc Import can make different choices on examining the file. So different contents can lead to different results even though your desire is to have the variable names the same, lengths of character variables the same and variables as either numeric or character for the same column.

Irregular format of csv files and combing large number of data

Re: Irregular format of csv files and combing large number of data

Re: Irregular format of csv files and combing large number of data

Re: Irregular format of csv files and combing large number of data

Re: Irregular format of csv files and combing large number of data

Re: Irregular format of csv files and combing large number of data

Re: Irregular format of csv files and combing large number of data

Re: Irregular format of csv files and combing large number of data

Re: Irregular format of csv files and combing large number of data

Re: Irregular format of csv files and combing large number of data

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away