Solved: Re: Read in html data from Census website

Cruise · Posted 01-02-2018 02:44 PM

How to read in html data like below from Census website? I appreciate if any demo code is available.

https://www2.census.gov/programs-surveys/demo/tables/geographic-mobility/2015/county-to-county-migra...

Reeza · Posted 01-03-2018 03:18 PM

Documentation appears to be here, page 24/25 seem to be what you need.

https://www.census.gov/topics/population/migration/guidance/county-to-county-migration-flows.html

If you scroll down the page, there's a section for documentation. It doesn't look like the file you're referencing is documented but the previous years layout seems correct.

You should have everything you need to now get this working.

And a key portion noted in the document:

As of 2016, all ACS migration flows data for counties, MCDs, and MSAs were made available in the Census application programming interface (API) through the Census Developers site.

I would probably look into that, which is likely to give you back a semi-structured file of some sort.

View solution in original post

Reeza · Posted 01-02-2018 02:57 PM

That's a text file so you can import it the same way you would any text file.

You can point to the text file directly using filename URL.

filename myfile URL 'link to file';

http://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.3&docsetId=lestmtsglobal&docsetTarget...

Cruise · Posted 01-02-2018 04:18 PM

Thank you Reeza.

What am I doing wrong here? Resulting myfile data was crushed into a column as shown in the image below. Do you also see variable names in the url? I don't know what to do without variables' names specified. Any census data experience here?

proc printto log= 'H:\log1.txt'; run;
filename myfile URL 'https://www2.census.gov/programs-surveys/demo/tables/geographic-mobility/2015/county-to-county-migration-2011-2015/county-to-county-migration-flows/MCDxMCD_ager_US.txt';
data myfile;
   infile myfile length=len dlm=' ' dsd obs=5;
   input record $varying500. len;
   if _n_>=0 then do; 
      put record $varying500. len;
   end;
run;

Reeza · Posted 01-02-2018 05:52 PM

This is a text file so it's coming in as a long string. You need to determine how to read the file. Likely you need to look at the file structure and determine the layout, delimiter (I'm guessing tab?) and how to read it in. I would probably recommend looking at this as a local file first. Download, save it locally, then determine how you would import the file. Then once you've figured that out, simply replace the file path with the URL one and it will work as expected.

Off topic:

If you have R there's a Census package directly, https://cran.r-project.org/web/packages/tidycensus/tidycensus.pdf

mkeintz · Posted 01-02-2018 06:05 PM

It's "crushed into a column" because you told SAS to read the entire record into a single field named RECORD. If you want multiple variables, they have to be specified on the input statement.

If you have a layout of the url page, identifying data fields corresponding columns, then you can do an INPUT statement listing all the incoming variables with corresponding formats. By examining the first 183 lines of data, it appears that this file, like most census data files I'm familiar with, uses a fixed-format layout of all the fields on each record, which means you don't need a "LEN=" or "DLM=" parameter.

If your task is simply to transcribe the url file to a local raw file, you can use something like this, which takes advantage of the automatic variable _INFILE_:

filename myfile URL 'https://www2.census.gov/programs-surveys/demo/tables/geographic-mobility/2015/county-to-county-migra...';

filename mytxt 'c:\temp\t.txt';

data myfile;

infile myfile ;

input;

file mytxt;

put _infile_;

run;

Then you can develop an input statement to read in all the fields from a local raw file.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Cruise · Posted 01-03-2018 02:28 PM

@mkeintz

Thanks much for demo code. I have following error in the log where code creating table with no variables.

91   filename myfile URL
91 ! 'https://www2.census.gov/programs-surveys/demo/tables/geographic-mobility/2015/county-to-count
91 ! y-migration-2011-2015/county-to-county-migration-flows/MCDxMCD_ager_US.txt';
92   filename mytxt 'D:\...\temp.txt';
93   data myfile;
94     infile myfile ;
95     input;
96     file mytxt;
97     put _infile_;
98   run;

NOTE: The infile MYFILE is:

      Filename=https://www2.census.gov/programs-surveys/demo/tables/geographic-mobility/2015/county
      -to-county-migration-2011-2015/county-to-county-migration-flows/MCDxMCD_ager_US.txt,
      Local Host Name=2ua7181m5x,
      Local Host IP addr=172.18.120.54,
      Service Hostname Name=a104-80-30-196.deploy.static.akamaitechnologies.com,
      Service IP addr=104.80.30.196,Service Name=N/A,
      Service Portno=443,Lrecl=32767,Recfm=Variable

NOTE: The file MYTXT is:
      Filename=D:\...\temp.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=0,
      Last Modified=03Jan2018:14:16:18,
      Create Time=03Jan2018:14:15:10

NOTE: 407867 records were read from the infile MYFILE.
      The minimum record length was 490.
      The maximum record length was 490.
NOTE: 407867 records were written to the file MYTXT.
      The minimum record length was 490.
      The maximum record length was 490.
NOTE: The data set WORK.MYFILE has 407867 observations and 0 variables.
NOTE: DATA statement used (Total process time):
      real time           8:57.08
      cpu time            13.71 seconds

Reeza · Posted 01-03-2018 02:38 PM

@Cruise The code from @mkeintz will download the data and create file locally. You still have to go through the steps I mentioned, basically, find the record layout file and write a data step to read the data.

Whether the file is stored on the web or locally will only affect the speed, not the code.

ballardw · Posted 01-02-2018 06:40 PM

That file looks to be fixed columns. Count the columns for the variables of interest and specify on the input statement:

Most of the census files also have a description document floating around that would tell you the contents of the file and which columns the values are in.

That document usually makes it very simple to write input statements as needed.

Reeza · Posted 01-02-2018 07:05 PM

You need the record layout because that first 'variable, 001001 is actually two variables.

Not sure where you get that, but worst case scenario email Census Bureau.

Cruise · Posted 01-03-2018 02:37 PM

Reeza, how about open data in notepad++ and read from there? Would that work? I'm asking because I go through administrative rights granting procedure to install external program such as notepad++.

Reeza · Posted 01-03-2018 02:41 PM

@Cruise wrote:
Reeza, how about open data in notepad++ and read from there? Would that work? I'm asking because I go through administrative rights granting procedure to install external program such as notepad++.

Nope, and you don't need admin privileges to install NotePad++, point it towards a location that you have write access to such as C:\_localdata\....

You need a record layout unless you understand exactly where each variable starts and stops. For example my initial assumption was that 001001 was a single variable when it's clearly two variables.

You could probably attempt to parse it out, but how do you have any idea what the numerical columns represent? The best way to guarantee the data is to get a record layout.

Reeza · Posted 01-03-2018 03:18 PM

Documentation appears to be here, page 24/25 seem to be what you need.

https://www.census.gov/topics/population/migration/guidance/county-to-county-migration-flows.html

If you scroll down the page, there's a section for documentation. It doesn't look like the file you're referencing is documented but the previous years layout seems correct.

You should have everything you need to now get this working.

And a key portion noted in the document:

As of 2016, all ACS migration flows data for counties, MCDs, and MSAs were made available in the Census application programming interface (API) through the Census Developers site.

I would probably look into that, which is likely to give you back a semi-structured file of some sort.

Cruise · Posted 01-05-2018 05:18 PM

@Reeza

Hi Reeza,

Thanks for pointing me out to the Census documentation where I found correct data layout in 2011-2015 Documentation, page 19. Now I have dimensions of fixed columns such as below. However, I have a log saying that "variables named ... contains more than 32 characters" yet I'd like to keep variable names as long as they're informative especially when data is shared among multiple people.

Any suggestions to solve this problem please?

proc printto log= 'D:\log1.txt'; run;
filename myfile URL 'https://www2.census.gov/programs-surveys/demo/tables/geographic-mobility/2015/county-to-county-migration-2011-2015/county-to-county-migration-flows/MCDxMCD_ager_US.txt';
data myfile; 
   infile myfile length=len dlm=' ' dsd obs=5 truncover;
   input record $varying500. len;
   if _n_>=0 then do; 
      put record $varying500. len;
   end;
run;

proc printto; run;

data temp;
infile myfile lrecl=29000 obs=100 truncover;
input var_cur_res_State_Code		1-3
var_cur_res_cty_Code		4-6
var_cur_res_MCD_Code		7-11
var_res_1_Year_Ago_State_US_Island_Foreign_Region_Code		12-14
res_1_Year_Ago_cty_Code		15-17
res_1_Year_Ago_MCD_Code		18-22
Age		24-25
cur_res_State_Name	$	26-55
cur_res_cty_Name	$	56-90
cur_res_MCD_Name	$	91-135
Population_cur_cty_MCD_Est		137-144
Population_cur_cty_MCD_MOE		146-153
Nonmovers_cur_cty_MCD_Est		155-161
Nonmovers_cur_cty_MCD_MOE		163-169
Movers_within_United_States_cur_Municipio_Est_		171-177
Movers_within_United_States_cur_Municipio_MOE		179-185
Movers_within_Same_cur_cty_MCD_Est		187-193
Movers_within_Same_cur_cty_MCD_MOE		195-201
"Movers_from_a_diff_cty_MCD_in_the_Same_State_for_cur
cty_MCD__Est"		203-209
"Movers_from_a_diff_cty_MCD_in_the_Same_State_for_cur
cty_MCD_MOE"		211-217
Movers_from_a_diff_State_for_cur_cty_MCD_Est		219-225
Movers_from_a_diff_State_for_cur_cty_MCD_MOE		227-233
Movers_from_Abroad_Est		235-241
Movers_from_Abroad_MOE		243-249
res_1_Year_Ago_State_Name_US_Island_Foreign_Region	$	251-280
res_1_Year_Ago_cty_Name	$	281-315
res_1_Year_Ago_MCD_Name	$	316-360
Population_Lived_in_cty_MCD_1_Year_Ago__Est		362-369
Population_Lived_in_cty_MCD_1_Year_Ago_MOE		371-378
Nonmovers_cty_MCD_of_res_1_Year_Ago_Est		380-386
Nonmovers_cty_MCD_of_res_1_Year_Ago_MOE		388-394
Movers_within_the_US_for_cty_MCD_of_res_1_Year_Est		396-402
Movers_within_the_US_for_cty_MCD_of_res_1_Year_MOE		404-410
"Movers_within_the_Same_cty_MCD_for_cty_MCD_of_res_1_Year
Ago_Est"		412-418
"Movers_within_the_Same_cty_MCD_for_cty_MCD_of_res_1_Year
Ago_MOE"		420-426
Movers_to_a_diff_cty_MCD_in_the_Same_State_for_cty_MCD_of res_1_Year_Ago__Est		428-434
"Movers_to_a_diff_cty_MCD_in_the_Same_State_for_cty_MCD_of
res_1_Year_Ago_MOE"		436-442
Movers_to_a_diff_State_for_cty_MCD_of_res_1_Year_Ago_Est		444-450
Movers_to_a_diff_State_for_cty_MCD_of_res_1_Year_Ago_MOE		452-458
Movers_to_PR_Est		460-466
Movers_to_PR_MOE		468-474
Movers_in_cty_MCDtocty_MCD_Flow_Est		476-482
Movers_in_cty_MCDtocty_MCD_Flow_MOE		484-490
;
run;

Reeza · Posted 01-05-2018 05:32 PM

You can't. SAS has a limit of variable name that's 32 characters. I would apply the correct label, but shorten the names. Then you still have the 'useful' information in the labels.

Tom · Posted 01-05-2018 05:54 PM

The current SAS limit is 32 characters, but that is really way longer than you need. We lived with 8 character names for a long time.

You can use the variable's label to store a longer descriptions of the variable's content.

Short names will be much easier to work with. You can fit more variable names on a line of code or on the screen in a tabular data display, etc. For example if you had names that were longer than 32 characters you could not have a statement using more than two variables without generating a line of code that needs to be wrapped into multiple lines because it is longer than the 65-75 characters that humans can easily read. Ever think about why newspapers and magazines separate the page into multiple columns?

Registration is open

SAS Training: Just a Click Away