GovTrack Wiki
Data Directory
From GovTrack.us
GovTrack's data directory (described generally here) has the following layout:
[edit]
General Notes
When I say "Session" I mean the two-year periods usually called "a Congress", like the 109th Congress for 2005-2006. For me, that's session 109.
[edit]
Directory Layout and Links to (Informal) Schemas
- us/: This is the main root for all Congressional data.
- bioguide[1,2,3].csv: A dump of the BioGuide database. Not regularly updated.
- bills.technorati.xml: A list of bills by popularity, according to their mentions in the blogosphere. Bill "id"s are the bill type, the Congress, a dash, and the number (see Bill XML for details).
- last_update: The YYYY-MM-DD date of the most recent Daily Digest update from THOMAS that GovTrack has gotten.
- liv.xml: An XML-ized version of the Legislative Indexing Vocabulary.
- bills.text/: A directory containing PDF, TXT, XML, and HTML texts of bills.
- 106...110/ (session number)
- h, hc, hj, hr, s, sc, sj, sr/ (bill type; see Bill XML)
- The names of these files are the bill type (same as the containing directory) and the bill number, optionally followed by a "status" code, then a dot and the extension (pdf, txt, html, and, where available, xml).
- When a status code is omitted, the file is a symbolic link to the latest version of the bill. Status codes are explained on the GPO website.
- The PDFs are downloaded from the GPO.
- Text files are generated by me from the PDFs using "pdftotext -layout -nopgbrk -enc utf8". (Before June 11, 2008 they were in Latin1 encoding but now have all been converted to UTF8.)
- XML files are downloaded from the House.
- HTML versions are grabbed from THOMAS.
- There are also .gen.html files which are the HTML files plus special markup that is auto-inserted by me, marking insertions, deletions, and changes from the previous version as well as references to the U.S. Code, and nid attributes are attached to paragraphs to uniquely identify the paragraphs, with identifiers that are persisted across versions to the extent possible.
- h, hc, hj, hr, s, sc, sj, sr/ (bill type; see Bill XML)
- 106...110/ (session number)
- 106...110 (session number): Primary area for legislative data for a session of Congress
- committees.xml: All current committees and committee membership.
- committeeschedule.xml: Upcoming committee meetings from this page for the Senate and from the Daily Digest on THOMAS for the House.
- votes.all.index.xml: A summary of all votes this session.
- bills/: Full status information for every bill.
- bills.amdt/: Full status information for every amendment.
- bills.summary/: XML-ized CRS summaries for bills.
- bills.cbo/: Congressional Budget Office bill reports, with extracted summaries.
- bills.cbo/: Office of Management and Budget bill reports, with extracted summaries.
- cr/: The Congressional Record.
- gen.rolls-[cart,geo,pca]/: Generated info for votes. Regular projection maps (geo), cartograms (cart), and analysis (pca; though no PCA statistics are currently done). The analysis txt files contain a simple summary of how the parties voted.
- rolls/: Roll call votes.
- photos/
- This directory contains jpeg images of Members of Congress, past and present. Not all MoC's have photos. The name of the photo is the GovTrack numeric identifier for the person followed by: nothing, for the largest original image available; 200px, 100px, 50px, for three sizes of the photo, by width; all followed by .jpeg.
- rdf/
- This directory contains an RDF dump of the other data. The RDF dump is not regularly updated right now.
