I received a raw data from other hospital and they combine individual their all lab dates, lab names, lab values, and lab units into one column and using "|" as delimiter
I am trying combine array and scan to separate them into columns but it's not working for me,
here is my coding,
ARRAY LabName(*) Lab_Name1-Lab_Name79;
DO I= 1 TO dim(LabName);
LabName(I)=scan(Lab_Name,I,"|");
END;
DROP I;
RUN;
here is the raw data looks like
ID | Lab Date | Lab Name | Lab Value | Lab Unit |
1 | 09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008 | Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Instr WBC | 39.1|13.1|25.3|33.5|75.3|8.1|200|5.19|18.0|9.6|9.6 | %|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL| |
2 | 10/12/2011|10/12/2011|10/12/2011|10/12/2011|10/12/2011|10/12/2011|10/12/2011|10/12/2011|10/12/2011|10/12/2011|10/12/2011 | Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Instr WBC | 36.2|12.4|25.8|34.3|75.0|8.9|208|4.83|18.4|5.1|5.1 | %|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL| |
3 | 10/05/2011|10/05/2011|10/05/2011|10/05/2011|10/05/2011|10/05/2011|10/05/2011|10/05/2011|10/05/2011|10/05/2011|10/05/2011|10/05/2011 | Cell Morphology|Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Instr WBC | Yes|40.0|13.7|25.4|34.1|74.6|8.7|214|5.37|19.0|5.6|5.6 | |%|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL| |
4 | 09/28/2012|09/28/2012|09/28/2012|09/28/2012|09/28/2012|09/28/2012|09/28/2012|09/28/2012|09/28/2012|09/28/2012|09/28/2012|09/28/2012 | Cell Morphology|Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Instr WBC | Yes|40.2|13.2|24.6|32.8|74.8|9.2|194|5.38|18.7|8.7|8.7 | |%|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL| |
5 | 11/19/2012|11/19/2012|11/19/2012|11/19/2012|11/19/2012|11/19/2012|11/19/2012|11/19/2012|11/19/2012|11/19/2012|11/19/2012|11/19/2012|11/19/2012|11/19/2012 | Cell Morphology|Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Differential Type|Instr WBC|LD | Yes|39.1|12.8|23.4|32.7|71.5|8.3|243|5.47|21.1|4.8|Yes|4.8|190 | |%|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL|||unit/L |
6 | 04/15/2013|04/15/2013|04/15/2013|04/15/2013|04/15/2013|04/15/2013|04/15/2013|04/15/2013|04/15/2013|04/15/2013|04/15/2013 | Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Instr WBC | 38.0|12.5|24.1|33.0|73.0|8.6|173|5.21|19.6|4.9|4.9 | %|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL| |
7 | 08/05/2013|08/05/2013|08/05/2013|08/05/2013|08/05/2013|08/05/2013|08/05/2013|08/05/2013|08/05/2013|08/05/2013|08/05/2013 | Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Instr WBC | 37.9|12.6|24.4|33.1|73.7|9.3|179|5.15|19.4|3.9|3.9 | %|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL| |
8 | 09/30/2013|09/30/2013|09/30/2013|09/30/2013|09/30/2013|09/30/2013|09/30/2013|09/30/2013|09/30/2013|09/30/2013|09/30/2013|09/30/2013 | Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Instr WBC|LD | 38.6|12.8|24.6|33.2|74.1|8.6|207|5.21|19.0|4.2|4.2|223 | %|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL||unit/L |
9 | 01/21/2014|01/21/2014|01/21/2014|01/21/2014|01/21/2014|01/21/2014|01/21/2014|01/21/2014|01/21/2014|01/21/2014|01/21/2014|01/21/2014 | Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Instr WBC|Nucleated Red Blood Cells | 37.9|12.8|24.0|33.9|70.9|8.5|181|5.34|18.7|5.0|5.0|0.2 | %|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL||/100 WBC |
10 | 06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014|06/02/2014 | Albumin|Alkaline Phosphatase|ALT|Anion Gap|AST|Calcium|Carbon Dioxide|Chloride|Glucose Serum|Potassium|Sodium|Total Bilirubin|Urea Nitrogen|CREATININE|Protein Total|Cell Morphology|Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Differential Type|Instr WBC|Nucleated Red Blood Cells|Absolute Retic|Percent Retic|Retic RBC | 3.8|83|23|6|16|8.8|27|104|117|3.9|137|0.5|12|0.80|7.9|Yes|38.6|13.0|23.8|33.7|70.7|8.2|201|5.47|19.4|4.1|Yes|4.1|0.1|109.4|2.0|5.47 | g/dL|unit/L|unit/L||unit/L|mg/dL|mmol/L|mmol/L|mg/dL|mmol/L|mmol/L|mg/dL|mg/dL|mg/dL|g/dL||%|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL|||/100 WBC|thou/mcL|%|x10(6)/mcL |
Do you need to split LabDate and LabValue, too?
If yes, then creating one observation for each value in the lists is recommended. This leads to cleaner data-structure, which is often easier to use, than a datasets with 100+ variables.
data want;
set have(
rename=(
LabDate = DateList
LabName = NameList
LabValue = ValueList
)
);
length LabDate 8 LabName $ 50 LabValue 8;
format LabDate mmddyyd10.;
do i = 1 to countw(NameList, '|');
LabDate = input(scan(DateList, i, '|'), mmddyy10.);
LabName = scan(NameList, i, '|');
LabValue = input(scan(ValueList, i, '|'), best32.);
output;
end;
keep Id Lab:;
run;
You cannot use the same name to mean two different things.
Also make sure to define the new variables type (and length).
data want;
set have;
array _name $30 Lab_Name1-Lab_Name79;
array _date 8 Lab_Date1-Lab_Date79;
format Lab_Date1-Lab_Date79 yymmdd10.;
array _value $30 Lab_Value1-Lab_Value79;
do I= 1 do dim(_name);
_name[i]=scan(Lab_Name,I,"|");
_date[i]=input(scan(LabDate,i,'|'),mmddyy10.);
_value[i]= scan(LabValue,i,'|');
end;
drop i LabName labDate labValue ;
run;
The original raw data is .csv file.
I use proc import to bring data to SAS.
Our affiliate IT team save those data files as csv format and using "|" as delimiter upload to our data warehouse.
If I directly import those .csv files to SAS, SAS will not read them correctly.
SAS will read those data files correctly only if I convert them to .xlsx format.
Actually the way the text posted in the original message looks it appears to be a TAB delimited file with three long character fields that have individual values delimited by pipe character. But that might have been caused by copying a pasting from an Excel file instead copying lines from the original file.
If the long fields always have the same number of pipe characters (instead of only including enough for the number of actual values) then you might be able to read the original file using two characters as the delimiters.
If you want help reading the original file then post the first few lines of the actual file (not the file as converted by being opened by Excel). Open the file in a text editor (the SAS program editor should work) and copy a few lines. Then use the Insert Code icon in for forum editor to get a pop-up text box where you can paste the lines. That way the forum editor won't try to convert the lines into paragraphs.
I use notepad to copy couple raw data and attached here.
Thank you for your help!!
I am looking forward a good solution to import these raw data files. I will receive them monthly and I am tired of keeping covert them to excel file and import them.
That's a weird file you have here.
You can read the whole original file in one go:
data _null_;
file "%sysfunc(pathname(WORK))\t.txt" lrecl=500;
put '12312231';
put '68879';
put '09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/19/2008|09/18/2008'
'09'x
'Hematocrit|Hemoglobin|MCH Blood|MCHC Blood|MCV Blood|MPV|Platelet Count|RBC|RDW|WBC|Instr WBC'
'09'x
'39.1|13.1|25.3|33.5|75.3|8.1|200|5.19|18.0|9.6|9.6'
'09'x
'%|g/dL|pg|g/dL|fL|fL|thou/mcL|x10(6)/mcL|%|thou/mcL|';
run;
data WANT;
infile "%sysfunc(pathname(WORK))\t.txt" lrecl=500 pad dlm='|' _infile_=BUFF;
input @ ;
_infile_ =translate(BUFF,'|','09'x);
if char(BUFF,3)='/' then
input @1 (DATE1-DATE11) (:mmddyy10.) (M1-M11) (:$25.) (R1-R11) (:best.) (V1-V11) (:$20.);
else input @1 NUM best.;
format DATE1-DATE11 date9.;
run;
Not too sure the structure is useful though:
NUM | DATE1 | DATE2 | M1 | M2 | R1 | R2 | V1 | V2 |
---|---|---|---|---|---|---|---|---|
12312231 | . | . | . | . | ||||
68879 | . | . | . | . | ||||
. | 19SEP2008 | 19SEP2008 | Hematocrit | Hemoglobin | 39.1 | 13.1 | % | g/dL |
The attached sample data really helps to understand what you're dealing with.
Below sample code reads your source data into a long format which is often an easier structure to deal with.
%let source_file=c:\temp\test file.txt;
data long(drop=_:);
attrib
id length=$15 informat=$15.
fin length=$15 informat=$15.
_lab_dates length=$400 informat=$400.
_lab_names length=$400 informat=$400.
_lab_values length=$400 informat=$400.
_lab_units length=$400 informat=$400.
lab_date length=8 informat=mmddyy10. format=date9.
lab_name length=$40
/* lab_value length=8*/
lab_value length=$40
lab_unit length=$40
;
infile "&source_file" dlm='09'x dsd truncover firstobs=2 lrecl=1650;
input id fin _lab_dates _lab_names _lab_values _lab_units;
/* start: if rows without lab data not required then remove below code */
if cmiss(_lab_dates, _lab_names, _lab_values, _lab_units)=4 then output;
else
/* end: of remove section */
do;
_loop_cnt=1+max(countc(_lab_dates,'|'), countc(_lab_names,'|'), countc(_lab_values,'|'), countc(_lab_units,'|'));
do _i=1 to _loop_cnt;
lab_date=input(scan(_lab_dates,_i,'|','M'),mmddyy10.);
lab_name=scan(_lab_names,_i,'|','M');
/* lab_value=input(scan(_lab_values,_i,'|'),best32.);*/
lab_value=scan(_lab_values,_i,'|','M');
lab_unit=scan(_lab_units,_i,'|','M');
output;
end;
end;
run;
I wasn't sure if you also need the rows without Lab data. If not then remove the section as per comment in the code.
Changes - new code version
1. change type of lab_value to character because there are "yes" strings in the source data (is this a DQ issue?)
2. add 'M' modifier to scan() function to read data correctly in case of missing data elements
3. add '1' to loop counter to read all the pipe delimited data elements
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.