I have 80 patients measured on 200 variables. The measurement was done using analytical chemistry equipment. Each patient has his own spreadsheet with name of variables and corresponding values. Variables are in rows. Spreadsheets are separate in csv format (80x2).
I need to transform all the spreadsheets in only one data with 80 patients in rows and variables in colums (80x200). This starts with importaning the 80 sheets.
I would appreciate if you help me with the coding of this.
Thanks
Assuming all files are in a single folder, something like this should work, untested:
%let path2files = C:\Users\hp\Deskotp\data\;
data import_all;
*make sure variables to store file name are long enough;
length filename txt_file_name $256;
*keep file name from record to record;
retain txt_file_name;
*Use wildcard in input;
infile "&path2files\*.csv" dsd eov=eov filename=filename truncover;
*Input first record and hold line;
input@;
*Check if this is the first record or the first record in a new file;
*If it is, replace the filename with the new file name and move to next line;
if _n_ eq 1 or eov then do;
txt_file_name = scan(filename, -1, "\");
eov=0;delete;
end;
*Otherwise go to the import step and read the files;
else input
col1 variable value;
;
format col1 z3.;
run;
proc sort data=import_all;
by txt_file_name col1;
run;
proc transpose data=import_all out=want prefix=VAR_;
by txt_file_name;
id col1;
var value;
run;
@Job04 wrote:
proc import datafile="C:\Users\hp\Desktop\data\T1.002.csv"
dbms=csv out=out;
run;
Please provide a couple sample csv files.
And ... do the 80 csv files have identical structure? Do they all have the same 200 rows/variables in the same order?
The same number of measurements and same variables for all patients and in the same order. The variables are numeric because they are the peaks in the spectrum. I want to change these variables to Var001,..., Var200.
The analytical equipment works with JEOL format, so when converting to csv the column col1 showed up but has no relevance. The data for 2 patients looks like this:
Patient1:
col1,variable,value
1,277 .5,28.3
2,276 .5,28
3,275 .5,27.7
4,274 .5,27.4
5,273 .5,27.1
6,272 .5,26.8
7,271 .5,26.5
8,270 .5,26.2
9,269 .5,25.9
10,268 .5,25.6
11,267 .5,25.3
12,266 .5,25
13,265 .5,24.7
14,264 .5,24.4
15,263 .5,24.1
16,262 .5,23.8
17,261 .5,23.5
18,260 .5,23.2
19,259 .5,22.9
20,258 .5,22.6
21,257 .5,22.3
23,256 .5,22
24,255 .5,21.7
25,254 .5,21.4
26,253 .5,21.1
27,252 .5,20.8
28,251 .5,20.5
29,250 .5,20.2
30,249 .5,19.9
31,248 .5,19.6
32,247 .5,19.3
33,246 .5,19
34,245 .5,18.7
35,244 .5,18.4
36,243 .5,18.1
37,242 .5,17.8
38,241 .5,17.5
39,240 .5,17.2
40,239 .5,16.9
41,238 .5,16.6
42,237 .5,16.3
43,236 .5,16
44,235 .5,15.7
45,234 .5,15.4
46,233 .5,15.1
47,232 .5,14.8
48,231 .5,14.5
49,230 .5,14.2
50,229 .5,13.9
51,228 .5,13.6
52,227 .5,13.3
53,226 .5,13
54,225 .5,12.7
55,224 .5,12.4
56,223 .5,12.1
57,222 .5,11.8
58,221 .5,11.5
59,220 .5,11.2
60,219 .5,10.9
61,218 .5,10.6
62,217 .5,10.3
63,216 .5,10
64,215 .5,9.7
65,214 .5,9.4
66,213 .5,9.1
67,212 .5,8.8
68,211 .5,8.5
69,210 .5,8.2
70,209 .5,7.9
71,208 .5,7.6
72,207 .5,7.3
73,206 .5,7
74,205 .5,6.7
75,204 .5,6.4
76,203 .5,6.1
77,202 .5,5.8
78,201 .5,5.5
79,200 .5,5.2
80,199 .5,4.9
81,198 .5,4.6
82,197 .5,4.3
83,196 .5,4
84,195 .5,3.7
85,194 .5,3.4
86,193 .5,3.1
87,192 .5,2.8
88,191 .5,2.5
89,190 .5,2.2
90,189 .5,1.9
91,188 .5,1.6
92,187 .5,1.3
93,186 .5,1
94,185 .5,0.7
95,184 .5,0.4
96,183 .5,0.1
97,182 .5,-0.2
98,181 .5,-0.5
99,180 .5,-0.8
100,179 .5,-1.1
101,178 .5,-1.4
102,177 .5,-1.7
103,176 .5,-2
104,175 .5,-2.3
105,174 .5,-2.6
106,173 .5,-2.9
107,172 .5,-3.2
108,171 .5,-3.5
109,170 .5,-3.8
110,169 .5,-4.1
111,168 .5,-4.4
112,167 .5,-4.7
113,166 .5,-5
114,165 .5,-5.3
115,164 .5,-5.6
116,163 .5,-5.9
117,162 .5,-6.2
118,161 .5,-6.5
119,160 .5,-6.8
120,159 .5,-7.1
121,158 .5,-7.4
122,157 .5,-7.7
123,156 .5,-8
124,155 .5,-8.3
125,154 .5,-8.6
126,153 .5,-8.9
127,152 .5,-9.2
128,151 .5,-9.5
129,150 .5,-9.8
130,149 .5,-10.1
131,148 .5,-10.4
132,147 .5,-10.7
133,146 .5,-11
134,145 .5,-11.3
135,144 .5,-11.6
136,143 .5,-11.9
137,142 .5,-12.2
138,141 .5,-12.5
139,140 .5,-12.8
140,139 .5,-13.1
141,138 .5,-13.4
142,137 .5,-13.7
143,136 .5,-14
144,135 .5,-14.3
145,134 .5,-14.6
146,133 .5,-14.9
147,132 .5,-15.2
148,131 .5,-15.5
149,130 .5,-15.8
150,129 .5,-16.1
151,128 .5,-16.4
152,127 .5,-16.7
153,126 .5,-17
154,125 .5,-17.3
155,124 .5,-17.6
156,123 .5,-17.9
157,122 .5,-18.2
158,121 .5,-18.5
159,120 .5,-18.8
160,119 .5,-19.1
161,118 .5,-19.4
162,117 .5,-19.7
163,116 .5,-20
164,115 .5,-20.3
165,114 .5,-20.6
166,113 .5,-20.9
167,112 .5,-21.2
168,111 .5,-21.5
169,110 .5,-21.8
170,109 .5,-22.1
171,108 .5,-22.4
172,107 .5,-22.7
173,106 .5,-23
174,105 .5,-23.3
175,104 .5,-23.6
176,103 .5,-23.9
177,102 .5,-24.2
178,101 .5,-24.5
179,100 .5,-24.8
180,99 .5,-25.1
181,98 .5,-25.4
182,97 .5,-25.7
183,96 .5,-26
184,95 .5,-26.3
185,94 .5,-26.6
186,93 .5,-26.9
187,92 .5,-27.2
188,91 .5,-27.5
189,90 .5,-27.8
190,89 .5,-28.1
191,88 .5,-28.4
192,87 .5,-28.7
193,86 .5,-29
194,85 .5,-29.3
195,84 .5,-29.6
196,83 .5,-29.9
197,82 .5,-30.2
198,81 .5,-30.5
199,80 .5,-30.8
200,79 .5,-31.1
201,78 .5,-31.4
Patient2:
col1,variable,value
1,277 .5,31.3
2,276 .5,31
3,275 .5,30.7
4,274 .5,30.4
5,273 .5,30.1
6,272 .5,29.8
7,271 .5,29.5
8,270 .5,29.2
9,269 .5,28.9
10,268 .5,28.6
11,267 .5,28.3
12,266 .5,28
13,265 .5,27.7
14,264 .5,27.4
15,263 .5,27.1
16,262 .5,26.8
17,261 .5,26.5
18,260 .5,26.2
19,259 .5,25.9
20,258 .5,25.6
21,257 .5,25.3
23,256 .5,25
24,255 .5,24.7
25,254 .5,24.4
26,253 .5,24.1
27,252 .5,23.8
28,251 .5,23.5
29,250 .5,23.2
30,249 .5,22.9
31,248 .5,22.6
32,247 .5,22.3
33,246 .5,22
34,245 .5,21.7
35,244 .5,21.4
36,243 .5,21.1
37,242 .5,20.8
38,241 .5,20.5
39,240 .5,20.2
40,239 .5,19.9
41,238 .5,19.6
42,237 .5,19.3
43,236 .5,19
44,235 .5,18.7
45,234 .5,18.4
46,233 .5,18.1
47,232 .5,17.8
48,231 .5,17.5
49,230 .5,17.2
50,229 .5,16.9
51,228 .5,16.6
52,227 .5,16.3
53,226 .5,16
54,225 .5,15.7
55,224 .5,15.4
56,223 .5,15.1
57,222 .5,14.8
58,221 .5,14.5
59,220 .5,14.2
60,219 .5,13.9
61,218 .5,13.6
62,217 .5,13.3
63,216 .5,13
64,215 .5,12.7
65,214 .5,12.4
66,213 .5,12.1
67,212 .5,11.8
68,211 .5,11.5
69,210 .5,11.2
70,209 .5,10.9
71,208 .5,10.6
72,207 .5,10.3
73,206 .5,10
74,205 .5,9.7
75,204 .5,9.4
76,203 .5,9.1
77,202 .5,8.8
78,201 .5,8.5
79,200 .5,8.2
80,199 .5,7.9
81,198 .5,7.6
82,197 .5,7.3
83,196 .5,7
84,195 .5,6.7
85,194 .5,6.4
86,193 .5,6.1
87,192 .5,5.8
88,191 .5,5.5
89,190 .5,5.2
90,189 .5,4.9
91,188 .5,4.6
92,187 .5,4.3
93,186 .5,4
94,185 .5,3.7
95,184 .5,3.4
96,183 .5,3.1
97,182 .5,2.8
98,181 .5,2.5
99,180 .5,2.2
100,179 .5,1.9
101,178 .5,1.6
102,177 .5,1.3
103,176 .5,1
104,175 .5,0.7
105,174 .5,0.4
106,173 .5,0.1
107,172 .5,-0.2
108,171 .5,-0.5
109,170 .5,-0.8
110,169 .5,-1.1
111,168 .5,-1.4
112,167 .5,-1.7
113,166 .5,-2
114,165 .5,-2.3
115,164 .5,-2.6
116,163 .5,-2.9
117,162 .5,-3.2
118,161 .5,-3.5
119,160 .5,-3.8
120,159 .5,-4.1
121,158 .5,-4.4
122,157 .5,-4.7
123,156 .5,-5
124,155 .5,-5.3
125,154 .5,-5.6
126,153 .5,-5.9
127,152 .5,-6.2
128,151 .5,-6.5
129,150 .5,-6.8
130,149 .5,-7.1
131,148 .5,-7.4
132,147 .5,-7.7
133,146 .5,-8
134,145 .5,-8.3
135,144 .5,-8.6
136,143 .5,-8.9
137,142 .5,-9.2
138,141 .5,-9.5
139,140 .5,-9.8
140,139 .5,-10.1
141,138 .5,-10.4
142,137 .5,-10.7
143,136 .5,-11
144,135 .5,-11.3
145,134 .5,-11.6
146,133 .5,-11.9
147,132 .5,-12.2
148,131 .5,-12.5
149,130 .5,-12.8
150,129 .5,-13.1
151,128 .5,-13.4
152,127 .5,-13.7
153,126 .5,-14
154,125 .5,-14.3
155,124 .5,-14.6
156,123 .5,-14.9
157,122 .5,-15.2
158,121 .5,-15.5
159,120 .5,-15.8
160,119 .5,-16.1
161,118 .5,-16.4
162,117 .5,-16.7
163,116 .5,-17
164,115 .5,-17.3
165,114 .5,-17.6
166,113 .5,-17.9
167,112 .5,-18.2
168,111 .5,-18.5
169,110 .5,-18.8
170,109 .5,-19.1
171,108 .5,-19.4
172,107 .5,-19.7
173,106 .5,-20
174,105 .5,-20.3
175,104 .5,-20.6
176,103 .5,-20.9
177,102 .5,-21.2
178,101 .5,-21.5
179,100 .5,-21.8
180,99 .5,-22.1
181,98 .5,-22.4
182,97 .5,-22.7
183,96 .5,-23
184,95 .5,-23.3
185,94 .5,-23.6
186,93 .5,-23.9
187,92 .5,-24.2
188,91 .5,-24.5
189,90 .5,-24.8
190,89 .5,-25.1
191,88 .5,-25.4
192,87 .5,-25.7
193,86 .5,-26
194,85 .5,-26.3
195,84 .5,-26.6
196,83 .5,-26.9
197,82 .5,-27.2
198,81 .5,-27.5
199,80 .5,-27.8
200,79 .5,-28.1
201,78 .5,-28.4
I want to replace these variables by:
Var001
Var002
Var003
Var004
Var005
Var006
Var007
Var008
Var009
Var010
Var011
Var012
Var013
Var014
Var015
Var016
Var017
Var018
Var019
Var020
Var021
Var022
Var023
Var024
Var025
Var026
Var027
Var028
Var029
Var030
Var031
Var032
Var033
Var034
Var035
Var036
Var037
Var038
Var039
Var040
Var041
Var042
Var043
Var044
Var045
Var046
Var047
Var048
Var049
Var050
Var051
Var052
Var053
Var054
Var055
Var056
Var057
Var058
Var059
Var060
Var061
Var062
Var063
Var064
Var065
Var066
Var067
Var068
Var069
Var070
Var071
Var072
Var073
Var074
Var075
Var076
Var077
Var078
Var079
Var080
Var081
Var082
Var083
Var084
Var085
Var086
Var087
Var088
Var089
Var090
Var091
Var092
Var093
Var094
Var095
Var096
Var097
Var098
Var099
Var100
Var101
Var102
Var103
Var104
Var105
Var106
Var107
Var108
Var109
Var110
Var111
Var112
Var113
Var114
Var115
Var116
Var117
Var118
Var119
Var120
Var121
Var122
Var123
Var124
Var125
Var126
Var127
Var128
Var129
Var130
Var131
Var132
Var133
Var134
Var135
Var136
Var137
Var138
Var139
Var140
Var141
Var142
Var143
Var144
Var145
Var146
Var147
Var148
Var149
Var150
Var151
Var152
Var153
Var154
Var155
Var156
Var157
Var158
Var159
Var160
Var161
Var162
Var163
Var164
Var165
Var166
Var167
Var168
Var169
Var170
Var171
Var172
Var173
Var174
Var175
Var176
Var177
Var178
Var179
Var180
Var181
Var182
Var183
Var184
Var185
Var186
Var187
Var188
Var189
Var190
Var191
Var192
Var193
Var194
Var195
Var196
Var197
Var198
Var199
Var200
Suggestion: take a small amount of that data and show what the expected result would be.
When I look at this "description I see 4 values and you have maybe addressed one.
col1,variable,value 1,277 .5,28.3 2,276 .5,28 3,275 .5,27.7 4,274 .5,27.4 5,273 .5,27.1 6,272 .5,26.8
On the first row I see what appears to be Col1 with a value of 1, which if I understand correctly, you don't actually want.
Then for a CSV file there is a value with a space in the middle that would be "277 .5". So how does this relate to your creation of variables? Why does every single one of those have a ' .5' component? Last, I would say that the value of interest is 28.3 .
I am not sure that you actually gain anything with your proposed Var001 to Var200 structure. It might make sense to use Var078 to Var277 since you have values related to 78,79,...; 277. Then the variable names would have an immediate link back to the source data.
This a short example of one way that may be useful if you really think you need a whole slew of variable but uses my proposed naming:
data example; infile datalines dlm=' ,' ; input col1 index junk$ value ; array vars(272:277) var272-var277; retain var272-var277; vars(index)=value; drop col1 index junk; datalines; 1,277 .5,28.3 2,276 .5,28 3,275 .5,27.7 4,274 .5,27.4 5,273 .5,27.1 6,272 .5,26.8 ;
The infile DLM= option says to also use a space as well as comma so it easy to read that index. The array statement uses the (272:277) to set the expected index values to match with the variables in the list.
The retain keeps all of the values from each line read. You would actually only want the last record but isn't quite as easy to do with inline data using INFILE Datalines.
Reading an external datafile, not the inline values I show here, you could add an INFILE option of end= that would only write the data out then the last record is read. Something like:
data example; infile yourfilename dlm=' ,' end=lastone; input col1 index junk$ value ; array vars(272:277) var272-var277; retain var272-var277; vars(index)=value; drop col1 index junk; if lastone then output; datalines; 1,277 .5,28.3 2,276 .5,28 3,275 .5,27.7 4,274 .5,27.4 5,273 .5,27.1 6,272 .5,26.8 ;
Obviously you should change the 272 to 079 above.
If all of the files are in the same folder your infile statement may be able to use a wildcard to read all of the files in one pass, add a FILENAME= option to the infile to capture which file the records are from. Maybe something like
data example; length patientfile source $ 200; infile "c:\folder\subfolder\patientfile*.csv" dlm=' ,' end=lastone filename=source; input col1 index junk$ value ; array vars(272:277) var272-var277; retain var272-var277; vars(index)=value; drop col1 index junk; Patientfile=source; if lastone then output; datalines; 1,277 .5,28.3 2,276 .5,28 3,275 .5,27.7 4,274 .5,27.4 5,273 .5,27.1 6,272 .5,26.8 ;
You should be able to read ALL of the files in one data step.
From your posted example you appear to have four variables, not three in your files. The middle one looks like it has two values separated by a space.
First let's make some example files we can use test out program. Let's call them pat1.csv and pat2.csv
%let path=%sysfunc(pathname(work));
data _null_;
file "&path/pat1.csv";
input;
put _infile_;
cards;
col1,variable,value
1,277 .5,31.3
2,276 .5,31
3,275 .5,30.7
4,274 .5,30.4
5,273 .5,30.1
;
data _null_;
file "&path/pat2.csv";
input;
put _infile_;
cards;
col1,variable,value
1,277 .5,28.3
2,276 .5,28
3,275 .5,27.7
4,274 .5,27.4
5,273 .5,27.1
;
Now let's read all of the files in that directory that start with pat and end with .csv.
When can test that first column's value to figure out when we have read a header row and discard it.
I am not sure what you want to call those .5's, let's call it measure just to have a name.
Let's tell SAS to use both comma and space as the possible delimiters.
You can use the filename as the patient id.
data raw ;
length fname $256;
infile "&path/pat*.csv" dsd dlm=', ' truncover filename=fname ;
length patient $30 col1 $5 varname $32 measure value 8;
input col1 @ ;
if col1='col1' then delete;
input varname measure value;
patient = scan(fname,-2,'./\');
drop col1;
run;
So now you have data like this:
Obs patient varname measure value 1 pat1 277 0.5 31.3 2 pat1 276 0.5 31.0 3 pat1 275 0.5 30.7 4 pat1 274 0.5 30.4 5 pat1 273 0.5 30.1 6 pat2 277 0.5 28.3 7 pat2 276 0.5 28.0 8 pat2 275 0.5 27.7 9 pat2 274 0.5 27.4 10 pat2 273 0.5 27.1
That might actually be the format you want. It really depends on what are going to do.
But if want to turn it into one observation per patient you could use PROC TRANSPOSE.
proc transpose data=raw out=want (drop=_name_) prefix=var;
by patient measure ;
id varname;
var value;
run;
Now you have this data:
Obs patient measure var277 var276 var275 var274 var273 1 pat1 0.5 31.3 31 30.7 30.4 30.1 2 pat2 0.5 28.3 28 27.7 27.4 27.1
Show the code how to import one file and then we can illustrate how to import all 80 files and combine them.
@Job04 wrote:
I have 80 patients measured on 200 variables. The measurement was done using analytical chemistry equipment. Each patient has his own spreadsheet with name of variables and corresponding values. Variables are in rows. Spreadsheets are separate in csv format (80x2).
I need to transform all the spreadsheets in only one data with 80 patients in rows and variables in colums (80x200). This starts with importaning the 80 sheets.
I would appreciate if you help me with the coding of this.
Thanks
proc import datafile="C:\Users\hp\Desktop\data\T1.002.csv"
dbms=csv out=out;
run;
The data actually has more than 200 variables. I was just making something looks like the real data.
@Job04 wrote:
The data actually has more than 200 variables. I was just making something looks like the real data.
5 to 10 lines are sufficient then.
Assuming all files are in a single folder, something like this should work, untested:
%let path2files = C:\Users\hp\Deskotp\data\;
data import_all;
*make sure variables to store file name are long enough;
length filename txt_file_name $256;
*keep file name from record to record;
retain txt_file_name;
*Use wildcard in input;
infile "&path2files\*.csv" dsd eov=eov filename=filename truncover;
*Input first record and hold line;
input@;
*Check if this is the first record or the first record in a new file;
*If it is, replace the filename with the new file name and move to next line;
if _n_ eq 1 or eov then do;
txt_file_name = scan(filename, -1, "\");
eov=0;delete;
end;
*Otherwise go to the import step and read the files;
else input
col1 variable value;
;
format col1 z3.;
run;
proc sort data=import_all;
by txt_file_name col1;
run;
proc transpose data=import_all out=want prefix=VAR_;
by txt_file_name;
id col1;
var value;
run;
@Job04 wrote:
proc import datafile="C:\Users\hp\Desktop\data\T1.002.csv"
dbms=csv out=out;
run;
I did this mistake just last week. and claimed great work done by someone else to myself.
When you answer worked I rushed to say thank you and replied to myself. So embarrassing. I'm so sorry.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.