On page 5 of the instructions they tell you "All files with an .sf1 extension must be changed to .txt files. Right click on the first file with a .sf1 extension. Choose “Rename” and change the .sf1 portion of the name to .txt and hit Enter. Repeat for each file with a .sf1 extension"
This is incredibly slow how about opening up a CMD window and typing in
ren *.sf1 *.txt
Next on page 7 they tell you to use Wordpad to find and replace text in several huge text files, turns out this method is incredibly slow. How about we do this using sed linux command:
cat tx000062010.txt | \
> sed -e 's/SF1ST,TX,000,06,//' > tx000062010mod.txt
This command finds the pattern found between the first two forward slashes and replaces it with the pattern between the 2nd and 3rd forward slashes(in this case nothing). This took about 4 seconds to process a 565mb files on a quadcore AMD machine with 8gb memory. It was going to take hours to do this using wordpad's "find replace" tool. Turns out you don't need to cat the file and pipe it to sed. A good friend with way more experience using unix tools and programming than I also say's that awk is easier to use and I have to agree.Count the number of fields in a comma delimited txt file with awk.
gawk -F"," '{ print NF ":" $0}' textfile.csv
sample output: 260:SF1ST,TX,000,45,0000438,0,0,0,0,0,0,0,0,1,0,1,0,1.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,1.00,0.00,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1 OR just awk -F"," '{print NF}' tx000452010.txt
sample output: 260 260 260 ....
This text file has 260 fields per line. Which I want to extract the first 239 fields from.
sample output: 260:SF1ST,TX,000,45,0000438,0,0,0,0,0,0,0,0,1,0,1,0,1.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,1.00,0.00,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1 OR just awk -F"," '{print NF}' tx000452010.txt
sample output: 260 260 260 ....
cut -d ',' -f1-239 tx000452010.txt > tx000452010part1.txt
Excluding a field range and writing to a new file
gawk -F"," -v f=6 -v t=239 '{ for (i=1; i<=260;i++) if( i>=f && i<=t) continue; else printf("%s%s", $i,(i!=260) ? OFS : ORS) }' tx000452010.txt > tx000452010part2.txt
replace the default seperator of space with comma
awk '{gsub(/ /,",");print}' tx000452010part2.txt > tx000452010part_2.txt
replace the default seperator of space with comma
awk '{gsub(/ /,",");print}' tx000452010part2.txt > tx000452010part_2.txt
This could be done in a single command if I knew how.
Import the text files into a postgresql database because msaccess has a 2gb file limit and to have all this data in one database you're looking at a 10+gb database easy.
add the new field for building the geoid
ALTER TABLE "SF1_Access2003_mdb"."SF1_00001" ADD COLUMN geoid text;
Concatenate the fields to build the geoid for the block summary level. Hint: If you take the left 12 characters of this result you get the geoid for the blockgroup level......etc..
UPDATE "SF1_00002" SET geoid = "Txgeo2010"."STATE" || "Txgeo2010"."COUNTY" || "Txgeo2010"."TRACT" || "Txgeo2010"."BLOCK" FROM "Txgeo2010" WHERE ("SF1_00002"."LOGRECNO" = "Txgeo2010"."LOGRECNO");
ALTER TABLE "SF1_Access2003_mdb"."SF1_00001" ADD COLUMN geoid text;
Concatenate the fields to build the geoid for the block summary level. Hint: If you take the left 12 characters of this result you get the geoid for the blockgroup level......etc..
UPDATE "SF1_00002" SET geoid = "Txgeo2010"."STATE" || "Txgeo2010"."COUNTY" || "Txgeo2010"."TRACT" || "Txgeo2010"."BLOCK" FROM "Txgeo2010" WHERE ("SF1_00002"."LOGRECNO" = "Txgeo2010"."LOGRECNO");
...to be continued
No comments:
Post a Comment