Parsing SEC Edgar XML file using Ruby into Nokogiri -


i'm having problems parsing sec edgar files

here example of file.

the end result want stuff between <xml> , </xml> format can access.

here code far doesn't work:

scud = open("http://sec.gov/archives/edgar/data/1475481/0001475481-09-000001.txt") full = scud.read full.match(/<xml>(.*)<\/xml>/) 

ok, there couple of things wrong:

  1. sec.gov/archives/edgar/data/1475481/0001475481-09-000001.txt not xml, nokogiri of no use unless strip off garbage top of file, down true xml starts, trim off trailing tags keep xml correct. so, need attack problem first.
  2. you don't want file. without information can't recommend real solution. need take more time define question better.

here's quick piece of code retrieve page, strip garbage, , parse resulting content xml:

require 'nokogiri' require 'open-uri'  doc = nokogiri::xml(   open('http://sec.gov/archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\a.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '') ) puts doc.at('//schemaversion').text # >> x0603 

Comments