- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have the following data in a txt format and I need to import it as a SAS dataset. I tried importing as tab delimited file but data is looking completely different in SAS.
The file is very big and even when I am trying to open it with any txt editor it is not showing me the correct format. Below is the link to download the file and detail of the variables.
https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/natality/Nat2023us.zip
https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/DVS/natality/UserGuide2023.pdf
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First thing you attached two files. Import which? The PDF describes the text file and you would use that to write a data step to read the text file. Import will almost certainly get stuff wrong.
Starting on Page 8 of the PDF and you see the first column with "Position" as the header shows that this is fixed column file. That means specific variables occupy specify column(s). Import will never get that correct as it expects something in the file to tell where values start/end. Typically tab or some other character if the file is text. In this case there is no such delimiter between variables.
I would strongly suggest searching the source website as CDC quite often has a section with SAS code to read the files and often instructions for the various versions. The documentation starting at page 8 runs more than 30 pages of file description. I have used such to write code to read files. It is part of my paid job. I am not going to that much effort on this forum where I volunteer.
Here is how to apply the information starting at page 8 to write a data step to read the file after you unzip the file. Unzip the file before running following code and use the path to the unzipped file and what ever the name of the file may be on the Infile statement. The $ means you are reading character values.
data datasetname; infile "<path>\filename.txt" truncover lrecl=32000; input filler $ 1-8 dob_yy 9-12 dob_mm 13-14 filler $ 15-18 dob_tt 19-22 ; drop filler; run;
The "filler" actually are not needed in the code but I included for continuity to follow the text in the document. You would only have one variable by that name in the output and since it is "blank" dropping it makes sense.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you can't find ready-made SAS data step code as suggested by @ballardw then you need to create this code by yourself based on the record description in the pdf.
Below some input which is hopefully helpful to you.
1. Using WSL (a Linux prompt for Windows) I've run below command to create the attached txt file which contains the first 2000 lines of your big file. This should allow you to inspect the source data in an editor.
head -n 2000 ./Nat2023PublicUS.c20240509.r20240724.txt > ./Nat2023PublicUS.c20240509.r20240724_part01.txt
I also created a part of the data step input statement with AI assistance (using the pdf record description as input). That's likely not fully correct but it should give you a start in case you can't find any ready made code.
/* Define the file path */
%let filepath = 'C:\temp\Nat2023us\Nat2023PublicUS.c20240509.r20240724_part01.txt';
/* Read the data */
data natality2023;
infile "&filepath" truncover;
input
FILLER1 $ 1-8
DOB_YY $ 9-12
DOB_MM $ 13-14
FILLER2 $ 15-18
DOB_TT $ 19-22
DOB_WK $ 23
FILLER3 $ 24-31
BFACIL $ 32
F_BFACIL $ 33
FILLER4 $ 34-49
BFACIL3 $ 50
FILLER5 $ 51-72
MAGE_IMPFLG $ 73
MAGE_REPFLG $ 74
MAGER 75-76
MAGER14 $ 77-78
MAGER9 $ 79
FILLER6 $ 80-83
MBSTATE_REC $ 84
FILLER7 $ 85-103
RESTATUS $ 104
MRACE31 $ 105-106
MRACE6 $ 107
MRACE15 $ 108-109
FILLER8 $ 110
MRACEIMP $ 111
MHISPX $ 112
FILLER9 $ 113-114
MHISP_R $ 115
F_MHISP $ 116
MRACEHISP $ 117
FILLER10 $ 118
MAR_P $ 119
DMAR $ 120
MAR_IMP $ 121
FILLER11 $ 122
F_MAR_P $ 123
MEDUC $ 124
FILLER12 $ 125
F_MEDUC $ 126
FILLER13 $ 127-141
**bleep**ERPT_FLG $ 142
FILLER14 $ 143-146
**bleep**ECOMB $ 147-148
**bleep**EREC11 $ 149-150
FRACE31 $ 151-152
FRACE6 $ 153
FRACE15 $ 154-155
FILLER15 $ 156-158
FHISPX $ 159
FHISP_R $ 160
F_FHISP $ 161
FRACEHISP $ 162
FEDUC $ 163
FILLER16 $ 164
f_FEDUC $ 165
FILLER17 $ 166-170
PRIORLIVE $ 171-172
PRIORDEAD $ 173-174
PRIORTERM $ 175-176
FILLER18 $ 177-178
LBO_REC $ 179
FILLER19 $ 180-181
TBO_REC $ 182
FILLER20 $ 183-197
ILLB_R $ 198-200
ILLB_R11 $ 201-202
FILLER21 $ 203-205
ILOP_R $ 206-208
ILOP_R11 $ 209-210
FILLER22 $ 211-213
ILP_R $ 214-216
ILP_R11 $ 217-218
FILLER23 $ 219-223
PRECARE $ 224-225
F_MPCB $ 226
PRECARE5 $ 227
FILLER24 $ 228-237
PREVIS $ 238-239
FILLER25 $ 240-241
PREVIS_REC $ 242-243
F_TPCV $ 244
FILLER26 $ 245-250
WIC $ 251
F_WIC $ 252
CIG_0 $ 253-254
CIG_1 $ 255-256
CIG_2 $ 257-258
CIG_3 $ 259-260
CIG0_R $ 261
CIG1_R $ 262
CIG2_R $ 263
CIG3_R $ 264
F_CIGS_0 $ 265
F_CIGS_1 $ 266
F_CIGS_2 $ 267
F_CIGS_3 $ 268
CIG_REC $ 269
F_TOBACO $ 270
FILLER27 $ 271-279
M_Ht_In $ 280-281
F_M_HT $ 282
BMI $ 283-286
BMI_R $ 287
FILLER28 $ 288-291
PWgt_R $ 292-294
F_PWGT $ 295
FILLER29 $ 296-298
DWgt_R $ 299-301
FILLER30 $ 302
F_DWGT $ 303
WTGAIN $ 304-305
WTGAIN_REC $ 306
F_WTGAIN $ 307
FILLER31 $ 308-312
RF_PDIAB $ 313
RF_GDIAB $ 314
RF_PHYPE $ 315
RF_GHYPE $ 316
RF_EHYPE $ 317
RF_PPTERM $ 318
F_RF_PDIAB $ 319
F_RF_GDIAB $ 320
F_RF_PHYPER $ 321
F_RF_GHYPER $ 322
F_RF_ECLAMP $ 323
F_RF_PPB $ 324
RF_INFTR $ 325
RF_FEDRG $ 326
RF_ARTEC $ 327
f_RF_INFT $ 328
F_RF_INF_DRG $ 329
F_RF_INF_ART $ 330
RF_CESAR $ 331
RF_CESARN $ 332-333
FILLER32 $ 334
F_RF_CESAR $ 335
F_RF_NCESAR $ 336
NO_RISKS $ 337
FILLER33 $ 338-342
IP_GON $ 343
IP_SYPH $ 344
IP_CHLAM $ 345
IP_HEPB $ 346
IP_HEPC $ 347
F_IP_GONOR $ 348
F_IP_SYPH $ 349
F_IP_CHLAM $ 350
F_IP_HEPATB $ 351
F_IP_HEPATC $ 352
NO_INFEC $ 353
FILLER34 $ 354-358
FILLER35 $ 359
OB_ECVS $ 360
OB_ECVF $ 361
FILLER36 $ 362
F_OB_SUCC $ 363
F_OB_FAIL $ 364
FILLER37 $ 365-382
LD_INDL $ 383
LD_AUGM $ 384
LD_STER $ 385
LD_ANTB $ 386
LD_CHOR $ 387
LD_ANES $ 388
F_LD_INDL $ 389
F_LD_AUGM $ 390
F_LD_STER $ 391
F_LD_ANTB $ 392
F_LD_CHOR $ 393
F_LD_ANES $ 394
NO_LBRDLV $ 395
FILLER38 $ 396-400
ME_PRES $ 401
ME_ROUT $ 402
ME_TRIAL $ 403
F_ME_PRES $ 404
F_ME_ROUT $ 405
F_ME_TRIAL $ 406
RDMETH_REC $ 407
DMETH_REC $ 408
F_DMETH_REC $ 409
FILLER39 $ 410-414
MM_MTR $ 415
MM_PLAC $ 416
MM_RUPT $ 417
MM_UHYST $ 418
MM_AICU $ 419
FILLER40 $ 420
F_MM_MTR $ 421
F_MM_PLAC $ 422
F_MM_RUPT $ 423
F_MM_UHYST $ 424
F_MM_AICU $ 425
FILLER41 $ 426
NO_MMORB $ 427
FILLER42 $ 428-432
ATTEND $ 433
MTRAN $ 434
PAY $ 435
PAY_REC $ 436
F_PAY $ 437
F_PAY_REC $ 438
FILLER43 $ 439-443
APGAR5 $ 444-445
APGAR5R $ 446
F_APGAR5 $ 447
APGAR10 $ 448-449
;
run;
And of course because the source data is positional it's easy to only read the columns you actually need into SAS which hopefully will significantly reduce the effort.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First thing is you cannot "import" that type of file. The data is arranged in fixed positions on each line. You have to actually write a data step to actual READ in the data from the file instead.
Please show the data step you wrote to read it and explain in what ways it did not work.
Second thing is that is a ZIP file and the ZIP engine in SAS 9.4m7 seems to have trouble with expanding it.
1 data _null_; 2 infile "C:\Downloads\Nat2023us.zip" zip member='*' 3 termstr=crlf obs=2; 4 input; 5 list; 6 run; ERROR: Open failure for C:\Downloads\Nat2023us.zip during attempt to create a local file handle. NOTE: The SAS System stopped processing this step because of errors. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
So you will probably need to first unzip the file before SAS can read it.
55 data _null_; 56 infile "%sysfunc(pathname(work))/Nat2023PublicUS.c20240509.r20240724.txt" 57 termstr=crlf obs=2; 58 input; 59 list; 60 run; NOTE: The infile "/Volumes/scratch/SAS_workxxx/Nat2023PublicUS.c20240509.r20240724.txt" is: Filename=/Volumes/scratch/SAS_workxxx/Nat2023PublicUS.c20240509.r20240724.txt, Owner Name=xxx,Group Name=xxx, Access Permission=-rw-rw-r--, Last Modified=24Jul2024:12:18:02, File Size (bytes)=4856044107 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 1 202301 01171 11 1 37116 1 101 103303 0 013 Y2 17 1 330501101 00115 1 010001 2 3 028 201 04 20208 02804 0311 12 071 N10000000000001111N1 64124.02 1401 15 301 6 11621 NNNNNN111111NXX111N00 111 NNNNN111111 NN 11 NNNNNN1111111 401 11X111111 NNNNN 11111 1 1N1111 0941885 1 9 F 04 2022 39072 39 501 0723625 083 NNNNNN 111111 1 NNNNNN111111NNNNNN1111111 NYY1 601 701 801 901 1001 1101 1201 1301 1345 2 202301 08594 11 1 29094 2 101 105514 0 015 X1 15 1 310505514 00155 1 010001 2 3 077 201 08 01302 01302 0311 07 051 Y10000000000001111N1 57132.74 1511 17 301 6 12531 NNNNNN111111NXX111N00 111 NNNNN111111 NN 11 NNNNNY1111110 401 11X111111 NNNNN 11111 1 1N5411 0941885 1 9 M 04 2022 39072 39 501 0723284 073 NNNNNN 111111 1 NNNNNN111111NNNNNN1111111 NYY1 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 601 701 801 901 1001 1101 1201 1301 1345 NOTE: 2 records were read from the infile "/Volumes/scratch/SAS_work6D6F00009B54_amraelp00008222.pfizer.com/Nat2023PublicUS.c20240509.r20240724.txt". The minimum record length was 1345. The maximum record length was 1345. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you everyone for your help. I was looking for another way other than data step to import this file since there are around 150 variables but I don't think there is any other way. So I will go forward and use the data step.
Thanks again