Solved: Combining large number of data

Job04 · Posted 02-26-2022 02:39 PM

I have 80 patients measured on 200 variables. The measurement was done using analytical chemistry equipment. Each patient has his own spreadsheet with name of variables and corresponding values. Variables are in rows. Spreadsheets are separate in csv format (80x2).

I need to transform all the spreadsheets in only one data with 80 patients in rows and variables in colums (80x200). This starts with importaning the 80 sheets.

I would appreciate if you help me with the coding of this.

Thanks

Reeza · Posted 02-26-2022 04:26 PM

Assuming all files are in a single folder, something like this should work, untested:

%let path2files = C:\Users\hp\Deskotp\data\;

data import_all;
 
*make sure variables to store file name are long enough;
length filename txt_file_name $256;
 
*keep file name from record to record;
retain txt_file_name;
 
*Use wildcard in input;
infile "&path2files\*.csv" dsd eov=eov filename=filename truncover;
 
*Input first record and hold line;
input@;
 
*Check if this is the first record or the first record in a new file;
*If it is, replace the filename with the new file name and move to next line;
if _n_ eq 1 or eov then do;
txt_file_name = scan(filename, -1, "\");
eov=0;delete;
end;
 
*Otherwise  go to the import step and read the files;
else input
 col1 variable value;
 
;
format col1 z3.;
run;

proc sort data=import_all;
by txt_file_name col1;
run;

proc transpose data=import_all out=want prefix=VAR_;
by txt_file_name;
id col1;
var value;
run;

@Job04 wrote:

proc import datafile="C:\Users\hp\Desktop\data\T1.002.csv"
dbms=csv out=out;
run;

View solution in original post

mkeintz · Posted 02-26-2022 02:48 PM

Please provide a couple sample csv files.

And ... do the 80 csv files have identical structure? Do they all have the same 200 rows/variables in the same order?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Job04 · Posted 02-26-2022 04:01 PM

The same number of measurements and same variables for all patients and in the same order. The variables are numeric because they are the peaks in the spectrum. I want to change these variables to Var001,..., Var200.

The analytical equipment works with JEOL format, so when converting to csv the column col1 showed up but has no relevance. The data for 2 patients looks like this:

Patient1:

col1,variable,value
1,277 .5,28.3
2,276 .5,28
3,275 .5,27.7
4,274 .5,27.4
5,273 .5,27.1
6,272 .5,26.8
7,271 .5,26.5
8,270 .5,26.2
9,269 .5,25.9
10,268 .5,25.6
11,267 .5,25.3
12,266 .5,25
13,265 .5,24.7
14,264 .5,24.4
15,263 .5,24.1
16,262 .5,23.8
17,261 .5,23.5
18,260 .5,23.2
19,259 .5,22.9
20,258 .5,22.6
21,257 .5,22.3
23,256 .5,22
24,255 .5,21.7
25,254 .5,21.4
26,253 .5,21.1
27,252 .5,20.8
28,251 .5,20.5
29,250 .5,20.2
30,249 .5,19.9
31,248 .5,19.6
32,247 .5,19.3
33,246 .5,19
34,245 .5,18.7
35,244 .5,18.4
36,243 .5,18.1
37,242 .5,17.8
38,241 .5,17.5
39,240 .5,17.2
40,239 .5,16.9
41,238 .5,16.6
42,237 .5,16.3
43,236 .5,16
44,235 .5,15.7
45,234 .5,15.4
46,233 .5,15.1
47,232 .5,14.8
48,231 .5,14.5
49,230 .5,14.2
50,229 .5,13.9
51,228 .5,13.6
52,227 .5,13.3
53,226 .5,13
54,225 .5,12.7
55,224 .5,12.4
56,223 .5,12.1
57,222 .5,11.8
58,221 .5,11.5
59,220 .5,11.2
60,219 .5,10.9
61,218 .5,10.6
62,217 .5,10.3
63,216 .5,10
64,215 .5,9.7
65,214 .5,9.4
66,213 .5,9.1
67,212 .5,8.8
68,211 .5,8.5
69,210 .5,8.2
70,209 .5,7.9
71,208 .5,7.6
72,207 .5,7.3
73,206 .5,7
74,205 .5,6.7
75,204 .5,6.4
76,203 .5,6.1
77,202 .5,5.8
78,201 .5,5.5
79,200 .5,5.2
80,199 .5,4.9
81,198 .5,4.6
82,197 .5,4.3
83,196 .5,4
84,195 .5,3.7
85,194 .5,3.4
86,193 .5,3.1
87,192 .5,2.8
88,191 .5,2.5
89,190 .5,2.2
90,189 .5,1.9
91,188 .5,1.6
92,187 .5,1.3
93,186 .5,1
94,185 .5,0.7
95,184 .5,0.4
96,183 .5,0.1
97,182 .5,-0.2
98,181 .5,-0.5
99,180 .5,-0.8
100,179 .5,-1.1
101,178 .5,-1.4
102,177 .5,-1.7
103,176 .5,-2
104,175 .5,-2.3
105,174 .5,-2.6
106,173 .5,-2.9
107,172 .5,-3.2
108,171 .5,-3.5
109,170 .5,-3.8
110,169 .5,-4.1
111,168 .5,-4.4
112,167 .5,-4.7
113,166 .5,-5
114,165 .5,-5.3
115,164 .5,-5.6
116,163 .5,-5.9
117,162 .5,-6.2
118,161 .5,-6.5
119,160 .5,-6.8
120,159 .5,-7.1
121,158 .5,-7.4
122,157 .5,-7.7
123,156 .5,-8
124,155 .5,-8.3
125,154 .5,-8.6
126,153 .5,-8.9
127,152 .5,-9.2
128,151 .5,-9.5
129,150 .5,-9.8
130,149 .5,-10.1
131,148 .5,-10.4
132,147 .5,-10.7
133,146 .5,-11
134,145 .5,-11.3
135,144 .5,-11.6
136,143 .5,-11.9
137,142 .5,-12.2
138,141 .5,-12.5
139,140 .5,-12.8
140,139 .5,-13.1
141,138 .5,-13.4
142,137 .5,-13.7
143,136 .5,-14
144,135 .5,-14.3
145,134 .5,-14.6
146,133 .5,-14.9
147,132 .5,-15.2
148,131 .5,-15.5
149,130 .5,-15.8
150,129 .5,-16.1
151,128 .5,-16.4
152,127 .5,-16.7
153,126 .5,-17
154,125 .5,-17.3
155,124 .5,-17.6
156,123 .5,-17.9
157,122 .5,-18.2
158,121 .5,-18.5
159,120 .5,-18.8
160,119 .5,-19.1
161,118 .5,-19.4
162,117 .5,-19.7
163,116 .5,-20
164,115 .5,-20.3
165,114 .5,-20.6
166,113 .5,-20.9
167,112 .5,-21.2
168,111 .5,-21.5
169,110 .5,-21.8
170,109 .5,-22.1
171,108 .5,-22.4
172,107 .5,-22.7
173,106 .5,-23
174,105 .5,-23.3
175,104 .5,-23.6
176,103 .5,-23.9
177,102 .5,-24.2
178,101 .5,-24.5
179,100 .5,-24.8
180,99 .5,-25.1
181,98 .5,-25.4
182,97 .5,-25.7
183,96 .5,-26
184,95 .5,-26.3
185,94 .5,-26.6
186,93 .5,-26.9
187,92 .5,-27.2
188,91 .5,-27.5
189,90 .5,-27.8
190,89 .5,-28.1
191,88 .5,-28.4
192,87 .5,-28.7
193,86 .5,-29
194,85 .5,-29.3
195,84 .5,-29.6
196,83 .5,-29.9
197,82 .5,-30.2
198,81 .5,-30.5
199,80 .5,-30.8
200,79 .5,-31.1
201,78 .5,-31.4

Patient2:

col1,variable,value
1,277 .5,31.3
2,276 .5,31
3,275 .5,30.7
4,274 .5,30.4
5,273 .5,30.1
6,272 .5,29.8
7,271 .5,29.5
8,270 .5,29.2
9,269 .5,28.9
10,268 .5,28.6
11,267 .5,28.3
12,266 .5,28
13,265 .5,27.7
14,264 .5,27.4
15,263 .5,27.1
16,262 .5,26.8
17,261 .5,26.5
18,260 .5,26.2
19,259 .5,25.9
20,258 .5,25.6
21,257 .5,25.3
23,256 .5,25
24,255 .5,24.7
25,254 .5,24.4
26,253 .5,24.1
27,252 .5,23.8
28,251 .5,23.5
29,250 .5,23.2
30,249 .5,22.9
31,248 .5,22.6
32,247 .5,22.3
33,246 .5,22
34,245 .5,21.7
35,244 .5,21.4
36,243 .5,21.1
37,242 .5,20.8
38,241 .5,20.5
39,240 .5,20.2
40,239 .5,19.9
41,238 .5,19.6
42,237 .5,19.3
43,236 .5,19
44,235 .5,18.7
45,234 .5,18.4
46,233 .5,18.1
47,232 .5,17.8
48,231 .5,17.5
49,230 .5,17.2
50,229 .5,16.9
51,228 .5,16.6
52,227 .5,16.3
53,226 .5,16
54,225 .5,15.7
55,224 .5,15.4
56,223 .5,15.1
57,222 .5,14.8
58,221 .5,14.5
59,220 .5,14.2
60,219 .5,13.9
61,218 .5,13.6
62,217 .5,13.3
63,216 .5,13
64,215 .5,12.7
65,214 .5,12.4
66,213 .5,12.1
67,212 .5,11.8
68,211 .5,11.5
69,210 .5,11.2
70,209 .5,10.9
71,208 .5,10.6
72,207 .5,10.3
73,206 .5,10
74,205 .5,9.7
75,204 .5,9.4
76,203 .5,9.1
77,202 .5,8.8
78,201 .5,8.5
79,200 .5,8.2
80,199 .5,7.9
81,198 .5,7.6
82,197 .5,7.3
83,196 .5,7
84,195 .5,6.7
85,194 .5,6.4
86,193 .5,6.1
87,192 .5,5.8
88,191 .5,5.5
89,190 .5,5.2
90,189 .5,4.9
91,188 .5,4.6
92,187 .5,4.3
93,186 .5,4
94,185 .5,3.7
95,184 .5,3.4
96,183 .5,3.1
97,182 .5,2.8
98,181 .5,2.5
99,180 .5,2.2
100,179 .5,1.9
101,178 .5,1.6
102,177 .5,1.3
103,176 .5,1
104,175 .5,0.7
105,174 .5,0.4
106,173 .5,0.1
107,172 .5,-0.2
108,171 .5,-0.5
109,170 .5,-0.8
110,169 .5,-1.1
111,168 .5,-1.4
112,167 .5,-1.7
113,166 .5,-2
114,165 .5,-2.3
115,164 .5,-2.6
116,163 .5,-2.9
117,162 .5,-3.2
118,161 .5,-3.5
119,160 .5,-3.8
120,159 .5,-4.1
121,158 .5,-4.4
122,157 .5,-4.7
123,156 .5,-5
124,155 .5,-5.3
125,154 .5,-5.6
126,153 .5,-5.9
127,152 .5,-6.2
128,151 .5,-6.5
129,150 .5,-6.8
130,149 .5,-7.1
131,148 .5,-7.4
132,147 .5,-7.7
133,146 .5,-8
134,145 .5,-8.3
135,144 .5,-8.6
136,143 .5,-8.9
137,142 .5,-9.2
138,141 .5,-9.5
139,140 .5,-9.8
140,139 .5,-10.1
141,138 .5,-10.4
142,137 .5,-10.7
143,136 .5,-11
144,135 .5,-11.3
145,134 .5,-11.6
146,133 .5,-11.9
147,132 .5,-12.2
148,131 .5,-12.5
149,130 .5,-12.8
150,129 .5,-13.1
151,128 .5,-13.4
152,127 .5,-13.7
153,126 .5,-14
154,125 .5,-14.3
155,124 .5,-14.6
156,123 .5,-14.9
157,122 .5,-15.2
158,121 .5,-15.5
159,120 .5,-15.8
160,119 .5,-16.1
161,118 .5,-16.4
162,117 .5,-16.7
163,116 .5,-17
164,115 .5,-17.3
165,114 .5,-17.6
166,113 .5,-17.9
167,112 .5,-18.2
168,111 .5,-18.5
169,110 .5,-18.8
170,109 .5,-19.1
171,108 .5,-19.4
172,107 .5,-19.7
173,106 .5,-20
174,105 .5,-20.3
175,104 .5,-20.6
176,103 .5,-20.9
177,102 .5,-21.2
178,101 .5,-21.5
179,100 .5,-21.8
180,99 .5,-22.1
181,98 .5,-22.4
182,97 .5,-22.7
183,96 .5,-23
184,95 .5,-23.3
185,94 .5,-23.6
186,93 .5,-23.9
187,92 .5,-24.2
188,91 .5,-24.5
189,90 .5,-24.8
190,89 .5,-25.1
191,88 .5,-25.4
192,87 .5,-25.7
193,86 .5,-26
194,85 .5,-26.3
195,84 .5,-26.6
196,83 .5,-26.9
197,82 .5,-27.2
198,81 .5,-27.5
199,80 .5,-27.8
200,79 .5,-28.1
201,78 .5,-28.4

I want to replace these variables by:

Var001
Var002
Var003
Var004
Var005
Var006
Var007
Var008
Var009
Var010
Var011
Var012
Var013
Var014
Var015
Var016
Var017
Var018
Var019
Var020
Var021
Var022
Var023
Var024
Var025
Var026
Var027
Var028
Var029
Var030
Var031
Var032
Var033
Var034
Var035
Var036
Var037
Var038
Var039
Var040
Var041
Var042
Var043
Var044
Var045
Var046
Var047
Var048
Var049
Var050
Var051
Var052
Var053
Var054
Var055
Var056
Var057
Var058
Var059
Var060
Var061
Var062
Var063
Var064
Var065
Var066
Var067
Var068
Var069
Var070
Var071
Var072
Var073
Var074
Var075
Var076
Var077
Var078
Var079
Var080
Var081
Var082
Var083
Var084
Var085
Var086
Var087
Var088
Var089
Var090
Var091
Var092
Var093
Var094
Var095
Var096
Var097
Var098
Var099
Var100
Var101
Var102
Var103
Var104
Var105
Var106
Var107
Var108
Var109
Var110
Var111
Var112
Var113
Var114
Var115
Var116
Var117
Var118
Var119
Var120
Var121
Var122
Var123
Var124
Var125
Var126
Var127
Var128
Var129
Var130
Var131
Var132
Var133
Var134
Var135
Var136
Var137
Var138
Var139
Var140
Var141
Var142
Var143
Var144
Var145
Var146
Var147
Var148
Var149
Var150
Var151
Var152
Var153
Var154
Var155
Var156
Var157
Var158
Var159
Var160
Var161
Var162
Var163
Var164
Var165
Var166
Var167
Var168
Var169
Var170
Var171
Var172
Var173
Var174
Var175
Var176
Var177
Var178
Var179
Var180
Var181
Var182
Var183
Var184
Var185
Var186
Var187
Var188
Var189
Var190
Var191
Var192
Var193
Var194
Var195
Var196
Var197
Var198
Var199
Var200

ballardw · Posted 02-26-2022 05:46 PM

Suggestion: take a small amount of that data and show what the expected result would be.

When I look at this "description I see 4 values and you have maybe addressed one.

col1,variable,value
1,277 .5,28.3
2,276 .5,28
3,275 .5,27.7
4,274 .5,27.4
5,273 .5,27.1
6,272 .5,26.8

On the first row I see what appears to be Col1 with a value of 1, which if I understand correctly, you don't actually want.

Then for a CSV file there is a value with a space in the middle that would be "277 .5". So how does this relate to your creation of variables? Why does every single one of those have a ' .5' component? Last, I would say that the value of interest is 28.3 .

I am not sure that you actually gain anything with your proposed Var001 to Var200 structure. It might make sense to use Var078 to Var277 since you have values related to 78,79,...; 277. Then the variable names would have an immediate link back to the source data.

This a short example of one way that may be useful if you really think you need a whole slew of variable but uses my proposed naming:

data example;
  infile datalines dlm=' ,' ;
  input col1 index junk$ value ;
  array vars(272:277) var272-var277;
  retain var272-var277;
  vars(index)=value;
  drop col1 index junk;
datalines;
1,277 .5,28.3
2,276 .5,28
3,275 .5,27.7
4,274 .5,27.4
5,273 .5,27.1
6,272 .5,26.8
;

The infile DLM= option says to also use a space as well as comma so it easy to read that index. The array statement uses the (272:277) to set the expected index values to match with the variables in the list.

The retain keeps all of the values from each line read. You would actually only want the last record but isn't quite as easy to do with inline data using INFILE Datalines.

Reading an external datafile, not the inline values I show here, you could add an INFILE option of end= that would only write the data out then the last record is read. Something like:

data example;
  infile yourfilename dlm=' ,' end=lastone;
  input col1 index junk$ value ;
  array vars(272:277) var272-var277;
  retain var272-var277;
  vars(index)=value;
  drop col1 index junk;
  if lastone then output;
datalines;
1,277 .5,28.3
2,276 .5,28
3,275 .5,27.7
4,274 .5,27.4
5,273 .5,27.1
6,272 .5,26.8
;

Obviously you should change the 272 to 079 above.

If all of the files are in the same folder your infile statement may be able to use a wildcard to read all of the files in one pass, add a FILENAME= option to the infile to capture which file the records are from. Maybe something like

data example;
  length patientfile source $ 200;
  infile "c:\folder\subfolder\patientfile*.csv" dlm=' ,' end=lastone filename=source;
  input col1 index junk$ value ;
  array vars(272:277) var272-var277;
  retain var272-var277;
  vars(index)=value;
  drop col1 index junk;
  Patientfile=source;
  if lastone then output;
datalines;
1,277 .5,28.3
2,276 .5,28
3,275 .5,27.7
4,274 .5,27.4
5,273 .5,27.1
6,272 .5,26.8
;

Tom · Posted 02-26-2022 10:51 PM

You should be able to read ALL of the files in one data step.

From your posted example you appear to have four variables, not three in your files. The middle one looks like it has two values separated by a space.

First let's make some example files we can use test out program. Let's call them pat1.csv and pat2.csv

%let path=%sysfunc(pathname(work));

data _null_;
  file "&path/pat1.csv";
  input;
  put _infile_;
cards;
col1,variable,value
1,277 .5,31.3
2,276 .5,31
3,275 .5,30.7
4,274 .5,30.4
5,273 .5,30.1
;


data _null_;
  file "&path/pat2.csv";
  input;
  put _infile_;
cards;
col1,variable,value
1,277 .5,28.3
2,276 .5,28
3,275 .5,27.7
4,274 .5,27.4
5,273 .5,27.1
;

Now let's read all of the files in that directory that start with pat and end with .csv.

When can test that first column's value to figure out when we have read a header row and discard it.

I am not sure what you want to call those .5's, let's call it measure just to have a name.

Let's tell SAS to use both comma and space as the possible delimiters.

You can use the filename as the patient id.

data raw ;
  length fname $256;
  infile "&path/pat*.csv" dsd dlm=', ' truncover filename=fname ;
  length patient $30 col1 $5 varname $32 measure value 8;
  input col1 @ ;
  if col1='col1' then delete;
  input varname measure value;
  patient = scan(fname,-2,'./\');
  drop col1;
run;

So now you have data like this:

Obs    patient    varname    measure    value

  1     pat1        277        0.5       31.3
  2     pat1        276        0.5       31.0
  3     pat1        275        0.5       30.7
  4     pat1        274        0.5       30.4
  5     pat1        273        0.5       30.1
  6     pat2        277        0.5       28.3
  7     pat2        276        0.5       28.0
  8     pat2        275        0.5       27.7
  9     pat2        274        0.5       27.4
 10     pat2        273        0.5       27.1

That might actually be the format you want. It really depends on what are going to do.

But if want to turn it into one observation per patient you could use PROC TRANSPOSE.

proc transpose data=raw out=want (drop=_name_) prefix=var;
  by patient measure ;
  id varname;
  var value;
run;

Now you have this data:

Obs    patient    measure    var277    var276    var275    var274    var273

 1      pat1        0.5       31.3       31       30.7      30.4      30.1
 2      pat2        0.5       28.3       28       27.7      27.4      27.1

Reeza · Posted 02-26-2022 03:03 PM

Show the code how to import one file and then we can illustrate how to import all 80 files and combine them.

https://communities.sas.com/t5/SAS-Communities-Library/How-do-I-write-a-macro-to-import-multiple-tex...

@Job04 wrote:

I have 80 patients measured on 200 variables. The measurement was done using analytical chemistry equipment. Each patient has his own spreadsheet with name of variables and corresponding values. Variables are in rows. Spreadsheets are separate in csv format (80x2).

I need to transform all the spreadsheets in only one data with 80 patients in rows and variables in colums (80x200). This starts with importaning the 80 sheets.

I would appreciate if you help me with the coding of this.

Thanks

Job04 · Posted 02-26-2022 04:20 PM

proc import datafile="C:\Users\hp\Desktop\data\T1.002.csv"
dbms=csv out=out;
run;

Reeza · Posted 02-26-2022 04:21 PM

You know that the data shown has 201 variables?

Job04 · Posted 02-26-2022 04:25 PM

The data actually has more than 200 variables. I was just making something looks like the real data.

Reeza · Posted 02-26-2022 04:27 PM

@Job04 wrote:

The data actually has more than 200 variables. I was just making something looks like the real data.

5 to 10 lines are sufficient then.

Reeza · Posted 02-26-2022 04:26 PM

Assuming all files are in a single folder, something like this should work, untested:

%let path2files = C:\Users\hp\Deskotp\data\;

data import_all;
 
*make sure variables to store file name are long enough;
length filename txt_file_name $256;
 
*keep file name from record to record;
retain txt_file_name;
 
*Use wildcard in input;
infile "&path2files\*.csv" dsd eov=eov filename=filename truncover;
 
*Input first record and hold line;
input@;
 
*Check if this is the first record or the first record in a new file;
*If it is, replace the filename with the new file name and move to next line;
if _n_ eq 1 or eov then do;
txt_file_name = scan(filename, -1, "\");
eov=0;delete;
end;
 
*Otherwise  go to the import step and read the files;
else input
 col1 variable value;
 
;
format col1 z3.;
run;

proc sort data=import_all;
by txt_file_name col1;
run;

proc transpose data=import_all out=want prefix=VAR_;
by txt_file_name;
id col1;
var value;
run;

@Job04 wrote:

proc import datafile="C:\Users\hp\Desktop\data\T1.002.csv"
dbms=csv out=out;
run;

Job04 · Posted 02-26-2022 05:30 PM

I did this mistake just last week. and claimed great work done by someone else to myself.

When you answer worked I rushed to say thank you and replied to myself. So embarrassing. I'm so sorry.

Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Re: Combining large number of data

Registration is open

SAS Training: Just a Click Away