Skip to content

codefordemocracy/data

Repository files navigation

Data

This repository contains the code for ingesting the data used by Code for Democracy's platform.

Cloud Services

We use a variety of cloud services in order to manage our data ingestion pipeline:

  • Scrapy Cloud: runs scrapy spiders
  • Google Cloud Scheduler: schedules functions
  • Google Cloud Functions: executes the data loading tasks
  • Google Cloud Pub/Sub: coordinates the Cloud Functions
  • Google Cloud Firestore: stores queues used in ingestion
  • Google Cloud Storage: stores downloaded raw files

Databases

The data is stored in a few different databases:

  • Google Cloud BigQuery: stores the tabular data
  • Elasticsearch: stores the scraped documents
  • Neo4j: stores the knowledge graph

Functions

The work of downloading, transforming, and loading datasets is handled by these serverless functions:

News

Sources

  • news_sources_ingest_run_spiders: kicks off the allsides and mediabiasfactcheck spiders on Scrapy Cloud
  • news_sources_ingest_check_spiders: polls the spiders every 60 seconds to check for completion
  • news_sources_ingest_get_crawls: downloads the data from the crawls from Scrapy Cloud and queues records for processing
  • news_sources_ingest_verify_domain: checks domains to see if the DNS resolves and adds to Elasticsearch
  • news_sources_compute_load_graph: loads sources into Neo4j
  • news_sources_compute_merge_domains: connects domains with sources

Articles

  • news_articles_ingest_queue_domains: queues list of active domains from Elasticsearch
  • news_articles_ingest_get_paper: queues article urls for each domain and logs the scraper used in ElasticSearch
  • news_articles_ingest_get_articles: scrapes articles and sends to ElasticSearch if the article is new, updates the Cloud Firestore list of articles
  • news_articles_ingest_process_stragglers: queues remaining article urls for domains that errored out
  • news_articles_ingest_queue_duplicates: iterates through ElasticSearch and queues batches of duplicate articles
  • news_articles_ingest_delete_duplicate: deletes each duplicate document from ElasticSearch and Firestore
  • news_articles_ingest_get_url: function for scraping just one url, called when urls are found in other pipelines

Facebook

  • facebook_ingest_get_ads: indexes Facebook ads into ElasticSearch
  • facebook_compute_load_graph: takes Facebook ads from ElasticSearch and load into Neo4j

Reddit

  • reddit_ingest_get_ads: indexes Reddit ads into ElasticSearch

Twitter

  • twitter_ingest_queue_get: triggers the get functions
  • twitter_ingest_get_timeline: get timeline tweets from user id and update Firestore and ElasticSearch
  • twitter_ingest_get_tweets: get a list of tweets and update Firestore and ElasticSearch
  • twitter_compute_queue_graph: queue loading tweets from primary users and the loading queue
  • twitter_compute_load_graph: loads tweets from Elasticsearch into Neo4j
  • twitter_compute_extract_domains: extracts domains from links

Federal

FEC

  • federal_fec_ingest_queue_download: sends FEC file urls to be downloaded
  • federal_fec_ingest_download_zip: downloads FEC zip file into Google Cloud Storage
  • federal_fec_ingest_queue_import: sends FEC file paths from Google Cloud Storage to be imported
  • federal_fec_ingest_import_bigquery: imports the data from Cloud Storage into BigQuery
  • federal_fec_ingest_create_master_tables: creates the master candidates, committees, and contributions tables
  • federal_fec_ingest_get_financials: gets financials from FEC API and loads them into Elasticsearch
  • federal_fec_ingest_get_receipts: gets schedule a data from FEC API and load into Elasticsearch
  • federal_fec_compute_load_elastic_candidates: load candidates from BigQuery into Elasticsearch
  • federal_fec_compute_load_elastic_committees: load committees from BigQuery into Elasticsearch
  • federal_fec_compute_load_elastic_linkages: load linkages from BigQuery into Elasticsearch
  • federal_fec_compute_load_elastic_contributions: load contributions from BigQuery into Elasticsearch
  • federal_fec_compute_load_elastic_expenditures: load expenditures from BigQuery into Elasticsearch
  • federal_fec_compute_load_candidates: loads the FEC candidates into Neo4j
  • federal_fec_compute_load_committees: loads the FEC committees into Neo4j
  • federal_fec_compute_load_contributions: loads the FEC contributions into Neo4j
  • federal_fec_compute_load_expenditures: loads the FEC expenditures into Neo4j
  • federal_fec_ingest_unzip_gcs: automatically unzips .zip files in Google Cloud Storage

IRS

  • federal_irs_ingest_download_990s_index: downloads index file for 990s from IRS AWS server
  • federal_irs_ingest_get_990s: indexes IRS 990s into ElasticSearch

House

  • federal_house_lobbying_ingest_get_disclosures: indexes House lobbying disclosures into ElasticSearch
  • federal_house_lobbying_ingest_get_contributions: indexes House lobbying contributions into ElasticSearch

Senate

  • federal_senate_lobbying_ingest_get_disclosures: indexes Senate lobbying disclosures into ElasticSearch
  • federal_senate_lobbying_ingest_get_contributions: indexes Senate lobbying contributions into ElasticSearch

Principles

Except for nodes that represent extracted information (ie: domains), each dataset is loaded into Neo4j using a set of labels that are unique to the dataset so that it can be restructured or ripped out without damaging the rest of the knowledge graph.

About

ingestion pipelines

Resources

License

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •