Solved: Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_ac...

Flexluthorella · Posted 02-17-2020 03:26 PM

I've downloaded the 2018 Birth data files (US data files only) which is supposedly 223 mb. When the download was completed on my pc, its over 5GB. Notepad can't read it so I cant view the dataset/variables. I attempted to PROC IMPORT into SAS but that is not working.

SuzanneDorinski · Posted 02-18-2020 12:27 AM

The page where you get the birth data files points users to a page on the National Bureau of Economic Research (NBER) website. If you go to that page on the NBER website and scroll down, you'll see a table for the United States birth data and documentation. Jean Roth at NBER has posted SAS, Stata, and SPSS code to read the ASCII file. She has also posted the file as a Stata file, a SAS data set. and CSV.

The bad news is that she has not done that for the 2018 file. However, I think the record layout for the 2017 file is the same as the record layout for the 2018 file. So, http://data.nber.org/natality/2017/natl2017.sas should help you get started reading the ASCII file into SAS.

I was able to modify the 2017 program to read the zipped version of the 2018 file. One odd note: while the PDF for the 2018 file shows the record length as 1330, the record length is really 1345. The NBER program for 2017 shows 15 variables in columns 1330 to 1345, but those columns are all missing in the 2018 file.

View solution in original post

mkeintz · Posted 02-17-2020 03:49 PM

@Flexluthorella wrote:

I've downloaded the 2018 Birth data files (US data files only) which is supposedly 223 mb. When the download was completed on my pc, its over 5GB. Notepad can't read it so I cant view the dataset/variables. I attempted to PROC IMPORT into SAS but that is not working.

When you say "not working" we have virtually no information to provide advice. Now NOTEPAD finds the downloaded file too big. How about WORDPAD (use it to view, but not save), or you could download many other editors, like Notepad++. These both likely have larger size limitations.

And if you downloaded something sized 223mb and got a 5GB file, it was more that a simple download. Try using a more capable editor to view the download. BTW, what it the url of the downloaded file? Maybe someone on this forum can take a quick look.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Flexluthorella · Posted 02-17-2020 03:56 PM

"Not working" meaning I can't open the file in notepad, notepad++ or wordpad, as they error message states file is "too big" or "failed to open". I am not sure what you mean by more capable editor. The url is https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Tools and its the 2018 Birth data file.

mkeintz · Posted 02-17-2020 04:13 PM

By more capable editor, I meant more capable than notepad, thinking either wordpad or notepad++ would do the job. But I see you have tried that.

BUT I also see you have unzipped the downloaded file, so you can do this to make a sample file to visually inspect:

filename filein "C:\Users\…..\Downloads\Nat2018us\Nat2018PublicUS.c20190509.r20190717.txt";

data _null_;
  infile filein;
  file 'c:\temp\sampledata.txt';
  input;
  put _infile_;
  if _n_>=10 then stop;
run;

Then take a look at sampledata.txt.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Flexluthorella · Posted 02-17-2020 04:23 PM

This gives me the first 10 lines of data. The sampledata.txt did not give variable names. I can't tell what I need to do from here. I can see a small fraction of the data.

mkeintz · Posted 02-17-2020 04:36 PM

If you go back to the url you provided, you will see a column to the left of your downloaded data. The column name is titled "User's Guide (.pdf files)". Clicking on the "2018 (1.7MB)" link in this self-descriptive column will provided a guide to the layout of the data in a pdf file.

This is a common practice with lots of demographic data files - one file with just data, and another file/codebook/user guide with the data layout description. Welcome to the demographic data world.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Flexluthorella · Posted 02-17-2020 04:57 PM

Right. Going back to my original issue, how do I get to read in ALL the data from the OG (large) file?

Tom · Posted 02-17-2020 08:45 PM

@Flexluthorella wrote:
Right. Going back to my original issue, how do I get to read in ALL the data from the OG (large) file?

First thing is leave the file zipped. No need to unzip it as SAS can unzip it on the file.

Second look at the description of the file and use that to write the code to read it.

So you will have something like this using column oriented reads.

data want;
  infile 'where I put the file.zip' zip truncover member='*' ;
  input var 1-10  var2 $11-12 .... ;
run;

Or perhaps you will want to use formatted mode instead.

data want;
  infile 'where I put the file.zip' zip truncover member='*' ;
  input var 1-10 10.  var2 $2. .... ;
run;

Or some mixture of the two.

Remember look at the data description to understand what data is in which columns. Whether the data is numbers or strings. Some variables that are coded only as digits you might want to read as strings since they are really categorical values and not numbers you could use in operations like MEAN().

mkeintz · Posted 02-17-2020 09:00 PM

@Flexluthorella wrote:
Right. Going back to my original issue, how do I get to read in ALL the data from the OG (large) file?

Use the layout in the pdf file to set up the necessary INPUT statement to read the data into a SAS data set. You don't need to see the entire raw data set in any editor to do that. And you could first do a test of your program using the 10-record (or some other small) subset of the original raw data.

The full reference to the input statement, including examples, is at Input Statement. There's another possibly useful sas link at Reading Raw Data with the SAS Input Statement

If you haven't done the INPUT statement before, this will be a (worthwhile) experience.

Good luck, and bring back your questions once you start trying to use it.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PGStats · Posted 02-17-2020 03:50 PM

Have you tried the tools provided on this NCHS site ?

PG

Flexluthorella · Posted 02-17-2020 03:57 PM

I do not know how to use the tools they provide. I did not think I could just start downloading tools to use with no idea how to use them.

Patrick · Posted 02-17-2020 08:31 PM

The .pdf User Guide provides the data dictionary/data layout. Why isn't that sufficient for you to write the SAS data step to read the data in the .txt file into a SAS data set?

There are text editors available which can also open .txt of multiple GB. Just Google for them.

I've used UltraEdit (which doesn't come for free) to open the text file.

To get you started I've copied the first 30 lines into the attached sample_2018.txt file.

Flexluthorella · Posted 02-18-2020 02:06 PM

thank you so much!

SuzanneDorinski · Posted 02-18-2020 12:27 AM

The page where you get the birth data files points users to a page on the National Bureau of Economic Research (NBER) website. If you go to that page on the NBER website and scroll down, you'll see a table for the United States birth data and documentation. Jean Roth at NBER has posted SAS, Stata, and SPSS code to read the ASCII file. She has also posted the file as a Stata file, a SAS data set. and CSV.

The bad news is that she has not done that for the 2018 file. However, I think the record layout for the 2017 file is the same as the record layout for the 2018 file. So, http://data.nber.org/natality/2017/natl2017.sas should help you get started reading the ASCII file into SAS.

I was able to modify the 2017 program to read the zipped version of the 2018 file. One odd note: while the PDF for the 2018 file shows the record length as 1330, the record length is really 1345. The NBER program for 2017 shows 15 variables in columns 1330 to 1345, but those columns are all missing in the 2018 file.

Flexluthorella · Posted 02-18-2020 05:34 PM

I am still having issues; I get 0 records read in. Can you share with me?

Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Download

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Re: Reading in 5GB birth dataset from https://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm#Down

Registration is open

SAS Training: Just a Click Away