i'm having problems parsing sec edgar files
the end result want stuff between <xml>
, </xml>
format can access.
here code far doesn't work:
scud = open("http://sec.gov/archives/edgar/data/1475481/0001475481-09-000001.txt") full = scud.read full.match(/<xml>(.*)<\/xml>/)
ok, there couple of things wrong:
- sec.gov/archives/edgar/data/1475481/0001475481-09-000001.txt not xml, nokogiri of no use unless strip off garbage top of file, down true xml starts, trim off trailing tags keep xml correct. so, need attack problem first.
- you don't want file. without information can't recommend real solution. need take more time define question better.
here's quick piece of code retrieve page, strip garbage, , parse resulting content xml:
require 'nokogiri' require 'open-uri' doc = nokogiri::xml( open('http://sec.gov/archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\a.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '') ) puts doc.at('//schemaversion').text # >> x0603
Comments
Post a Comment