Mongo DB Atlas and R

10 Sep 2018

At One Earth Future Research our mission is to utilize rigorous research methodologies in an effort to provide empirical assessment of the root-causes of political violence and the efficacy of governance and development programs. I will be starting a semi-weekly blog that highlights some of our work in both the conflict trends and forecasting tracks. Our collective goal is to utilize data to explore the mechanisms of conflict and the long-term shifts in conflict related phenomena (conflict trends) and to understand and predict short term outcomes in conflict and political violence (forecasting). Our organization utilizes field work and in-depth interviews alongside state of the art technical approaches found in inferential statistics and machine learning/artificial intelligence.

Setup and Data

As we went about the process of creating a centralized data repository for all organizational data, we stumbled upon MongoDB’s Atlas platform. Atlas utilizes the MongoDB document system and is accessible from the cloud (AWS, Google Cloud, or Azure). This blog provides some demonstration code for utilizing Atlas in the R programming environment with our soon-to-be released election violence data.

Our data

We have a data set of all major national elections from 1975 to 2018 that is based on original data gathering and the NELDA dataset put together by Hyde and Marinov¹ and our own REIGN dataset². For this demonstration, we will simply use the name ev_df for our new election violence data for the time being.

Let’s quickly inspect our data before moving it into Atlas.

summary(ev_df)

##       year                          country            dates     
##  Min.   :1975   Ecuador                 :  35   10/25/2015:   8  
##  1st Qu.:1990   United States of America:  33   10/24/1999:   6  
##  Median :2000   Iran                    :  32   10/23/2011:   5  
##  Mean   :1999   France                  :  31   10/31/1999:   5  
##  3rd Qu.:2009   Egypt                   :  29   11/28/2010:   5  
##  Max.   :2018   Guatemala               :  29   11/28/2011:   5  
##                 (Other)                 :2074   (Other)   :2229  
##  elecViolence l.elecViolence  anticipation     regimetenure   
##  vio  : 617   0:1645         Min.   :0.0000   Min.   :  1.00  
##  peace:1623   1: 618         1st Qu.:0.0000   1st Qu.: 27.00  
##  NA's :  23                  Median :1.0000   Median : 56.00  
##                              Mean   :0.6879   Mean   : 83.39  
##                              3rd Qu.:1.0000   3rd Qu.:109.00  
##                              Max.   :1.0000   Max.   :589.00  
##                                                               
##   lastelection       pcgdp              growth            logIMR      
##  Min.   :0.000   Min.   :     0.0   Min.   :0.09887   Min.   :0.6419  
##  1st Qu.:0.000   1st Qu.:   709.6   1st Qu.:0.99141   1st Qu.:2.3842  
##  Median :0.000   Median :  2143.7   Median :1.05674   Median :3.3464  
##  Mean   :1.068   Mean   :  7269.8   Mean   :1.06457   Mean   :3.2260  
##  3rd Qu.:2.441   3rd Qu.:  7933.4   3rd Qu.:1.12739   3rd Qu.:4.1558  
##  Max.   :8.098   Max.   :102573.7   Max.   :4.09959   Max.   :5.1728  
##                                                                       
##      lnpop2         logpredict             SPI               lexconst    
##  Min.   : 31.48   Min.   :0.0000221   Min.   :-2.561111   Min.   :1.000  
##  1st Qu.: 69.81   1st Qu.:0.0006450   1st Qu.:-0.428677   1st Qu.:3.000  
##  Median : 84.67   Median :0.0015977   Median : 0.023829   Median :5.000  
##  Mean   : 88.17   Mean   :0.0038759   Mean   :-0.001438   Mean   :4.803  
##  3rd Qu.:105.59   3rd Qu.:0.0037408   3rd Qu.: 0.428446   3rd Qu.:7.000  
##  Max.   :197.67   Max.   :0.1414839   Max.   : 2.682971   Max.   :7.000  
##                                                                          
##     lpolity2          lpolcomp     
##  Min.   :-10.000   Min.   : 1.000  
##  1st Qu.: -4.000   1st Qu.: 3.000  
##  Median :  6.000   Median : 7.000  
##  Mean   :  3.186   Mean   : 6.603  
##  3rd Qu.:  9.000   3rd Qu.: 9.000  
##  Max.   : 10.000   Max.   :10.000  
##

The data contains 17 variables and captures 2263 unique elections across the 1975 - 2018 time period.³

Setting up connection and uploading data to cluster

As of right now, we have no data housed on our Atlas cluster called OEFdata. To send data to our Atlas cluster, we can use the mongolite package.

Empty atlas database

library(mongolite)

#find cluster url with username and password replaced with SCRAM credentials
url_path = 'mongodb+srv://<username here>:<password here>!@<cluster url>/admin'

#make connection object that specifies new database and collection (dataset)
mongo1 <- mongo(collection = "EV_Forecast Data (1975 - 2018)", db = "OEFR", 
              url = url_path, 
              verbose = TRUE)

#show commands for connection
mongo1

## <Mongo collection> 'EV_Forecast Data (1975 - 2018)' 
##  $aggregate(pipeline = "{}", options = "{\"allowDiskUse\":true}", handler = NULL, pagesize = 1000, iterate = FALSE) 
##  $count(query = "{}") 
##  $disconnect(gc = TRUE) 
##  $distinct(key, query = "{}") 
##  $drop() 
##  $export(con = stdout(), bson = FALSE, query = "{}", fields = "{}", sort = "{\"_id\":1}") 
##  $find(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0, handler = NULL, pagesize = 1000) 
##  $import(con, bson = FALSE) 
##  $index(add = NULL, remove = NULL) 
##  $info() 
##  $insert(data, pagesize = 1000, stop_on_error = TRUE, ...) 
##  $iterate(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0) 
##  $mapreduce(map, reduce, query = "{}", sort = "{}", limit = 0, out = NULL, scope = NULL) 
##  $remove(query, just_one = FALSE) 
##  $rename(name, db = NULL) 
##  $replace(query, update = "{}", upsert = FALSE) 
##  $run(command = "{\"ping\": 1}", simplify = TRUE) 
##  $update(query, update = "{\"$set\":{}}", filters = NULL, upsert = FALSE, multiple = FALSE)

Use the $insert command to store the data in your Atlas collection.

mongo1$drop()
mongo1$insert(ev_df)

## 
Processed 1000 rows...
Processed 2000 rows...
Complete! Processed total of 2263 rows.

## List of 5
##  $ nInserted  : num 2263
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

Our election violence data (collection) is now housed in our OEFR database

Analysis commands and pulling data into R.

Now our original R dataframe is hosted in our Atlas cluster in the MongoDB document format and we can start to analyze and manipulate data that is already stored in our cluster.

#count number of documents (rows)

mongo1$count()

## [1] 2263

#import data frame back into R

ev_df2 <- mongo1$find()

## 
 Found 1000 records...
 Found 2000 records...
 Found 2263 records...
 Imported 2263 records. Simplifying into dataframe...

str(ev_df2)

## 'data.frame':    2263 obs. of  17 variables:
##  $ year          : int  1978 1982 1987 1991 1991 1992 1992 1996 1996 1997 ...
##  $ country       : chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ dates         : chr  "11/12/1978" "11/14/1982" "2/1/1987" "3/31/1991" ...
##  $ elecViolence  : chr  "peace" "peace" "peace" "vio" ...
##  $ l_elecViolence: chr  "0" "0" "0" "0" ...
##  $ anticipation  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ regimetenure  : num  409 457 23 72 73 84 84 50 51 63 ...
##  $ lastelection  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pcgdp         : num  628 798 807 410 410 ...
##  $ growth        : num  1.061 1.008 0.973 0.611 0.611 ...
##  $ logIMR        : num  4.29 4.03 3.71 3.52 3.52 ...
##  $ lnpop2        : num  61.6 62.9 64.7 65.6 65.6 ...
##  $ logpredict    : num  0.00273 0.00222 0.00236 0.00302 0.00329 ...
##  $ SPI           : num  0.00931 0.25154 -0.39167 -1.55961 -1.2355 ...
##  $ lexconst      : int  1 1 1 3 3 3 3 5 5 5 ...
##  $ lpolity2      : int  -9 -9 -9 1 1 1 1 5 5 0 ...
##  $ lpolcomp      : int  1 1 1 6 6 6 6 6 6 6 ...

The find function can also be used for making conditional queries from our Atlas cluster, and is especially useful for big data that may not play well with your local R instance. Lets demonstrate some ways to query/subset from our election violence data.

#use the query argument to subset by rows
#lets get all observations that experienced election violence

ev_df2 <- mongo1$find(query = '{"elecViolence":"vio"}')

## 
 Found 617 records...
 Imported 617 records. Simplifying into dataframe...

#inspect our subsetted data frame

nrow(ev_df2)

## [1] 617

#use the fields argument to subset by columns
#lets query the above data, but only getting country and date information

ev_df2 <- mongo1$find(query = '{"elecViolence":"vio"}',
                      fields = '{"country": true, "dates": true}')

## 
 Found 617 records...
 Imported 617 records. Simplifying into dataframe...

#inspect data frame variables

summary(ev_df2)

##      _id              country             dates          
##  Length:617         Length:617         Length:617        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

Now we can create a new cluster collection to host our subsetted data

#now lets upload our new subset into it's own Atlas collection
#make new connection within the OEFR database

mongo2 <- mongo(collection = "Election Violence Events (Country, Dates)", db = "OEFR", 
              url = url_path, 
              verbose = TRUE)

mongo2$insert(ev_df2)

## 
Complete! Processed total of 617 rows.

## List of 5
##  $ nInserted  : num 617
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

We now have a new collection of just election violence events!

If you need to delete a a collection (data frame) from your Atlas cluster, you can use the following.

#drop and delete collection tab in Atlas 

mongo2$drop()

#drop data frame, but keep collection tab in Atlas

mongo2$drop{'{}'}

Finally, we can query aggregate statistics from our Atlas cluster using MongoDB’s pipeline syntax. Aggregating at the group level is useful for getting more managable data sets when dealing with big data. Lets demonstrate this capability using our election violence data!

#Lets get the count of country elections and the average infant mortality rate and SPI (precipitation) by country

agg_stats <- mongo1$aggregate('[{"$group":{"_id":"$country", "election events": {"$sum":1}, 
                                "average IMF":{"$avg":"$logIMR"}, "average SPI":{"$avg":"$SPI"}}}]'
                              )

## 
 Found 152 records...
 Imported 152 records. Simplifying into dataframe...

#show first 25 rows of country aggregated data
head(agg_stats, n = 25)

##                         _id election events average IMF average SPI
## 1                  Zimbabwe              19    4.043437  0.33245826
## 2                 Venezuela              21    3.006033  0.20093259
## 3                Uzbekistan              12    3.817488 -0.02127261
## 4  United States of America              33    2.115447 -0.03168326
## 5            United Kingdom              10    1.810915 -0.07704831
## 6                   Ukraine              19    2.579029  0.21402413
## 7                  Thailand              22    3.090824  0.24227424
## 8                    Sweden              13    1.465030 -0.17207021
## 9                 Swaziland               6    4.205235 -0.12403276
## 10                  Surinam               2    3.020008 -0.92425866
## 11                    Sudan              12    4.218346 -0.11023233
## 12                    Spain              12    1.880806 -0.24571050
## 13             South Africa               9    4.058460  0.04487938
## 14                  Somalia               4    4.662905 -0.62259620
## 15          Solomon Islands               2    3.233632  0.18578571
## 16                    Syria              17    3.159167 -0.26130308
## 17                 Slovakia              13    2.104528  0.06276020
## 18                Singapore              13    1.447405  0.43001852
## 19      Serbia (Yugoslavia)               3    3.039749 -0.64691414
## 20                  Senegal              18    4.168331 -0.18662808
## 21                   Rwanda              13    4.243818  0.25100911
## 22                   Russia              14    2.686567 -0.01138183
## 23                  Romania              23    2.997278 -0.22977333
## 24                 Portugal              22    2.052871 -0.24904753
## 25                   Poland              22    2.230286  0.08542938

Susan D. Hyde and Nikolay Marinov, 2012, Which Elections Can Be Lost?, Political Analysis, 20(2), 191-201.↩
Bell, Curtis. 2016. The Rulers, Elections, and Irregular Governance Dataset (REIGN). Broomfield, CO: OEF Research. Available at oefresearch.org↩
Information about our election violence data and collection/analysis methodology will be explored in future blogs.↩

Through the data, darkly

Mongo DB Atlas and R

Setup and Data

Our data

Setting up connection and uploading data to cluster

Analysis commands and pulling data into R.

Related Posts

Kaggle Competition Part 1 18 Nov 2016

Test Python Code 20 Sep 2016

Intro to Bayesian methods 04 Jan 2016