fedspeak.Rmd

---
title: "Fedspeak Documentation"
author: "Darren Chang"
date: "`r Sys.Date()`"
output: pdf_document
latex_engine: xelatex
toc: true
toc depth: 3
number sections: true
keep_md: true
urlcolor: blue
---

```{r setup, include = F}
library(reticulate)
use_python('C:/Users/darre/Python/Python37/python.exe')
```
```{python include = F}
import pandas as pd
import numpy as np
import os
os.chdir('C:/Users/darre/Documents/_econ/fedspeak/text mining')
```

\newpage

## Introduction

Here is the link to the [Beige Book](https://www.federalreserve.gov/monetarypolicy/beige-book-default.htm), a qualitative report that the Federal Reserve releases 8 times a year. In past decades, the Beige Book has been released as often as once a month. Each Federal Reserve district writes their own report and a National Summary is prepared by Research Associates at a district bank on a rotating basis.

### Working

More information on the Beige Book and previous forecasting methods to come.

## Text Mining

Text mining was completed in `Python 3.7` using `BeautifulSoup`. Setup by importing `Pandas`, `BeauitfulSoup`, and `Requests`.

```{python eval = F}
## ---- setup
import pandas as pd
import numpy as np

import os #setwd
import time #timing

# scraping
from bs4 import BeautifulSoup
from contextlib import closing
import requests # website requests
from requests.exceptions import RequestException
from requests import get
```

I scrape the [Minneapolis Fed's Beige Book Archive](https://www.minneapolisfed.org/region-and-community/regional-economic-indicators/beige-book-archive), which hosts `html` files for each Beige Book back to 1970. The links to each website to be scraped is in the `links.csv` file in this repo. Import it as a dataframe.

```{python eval = T}
links = pd.read_csv('~/_econ/fedspeak/text mining/links.csv')
links.head(20)
links.tail(20)
```
Next, I define our scraping functions that iterate through the links, opens the url, and returns all of the text with the `<p>` tag. Much of the code for this portion was taken from Real Python's [guide to web scraping](https://realpython.com/python-web-scraping-practical-introduction/).

```{python eval = F}
## ---- scraping
# functions for getting URLs, with error logging
def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream = True)) as resp:
            if is_good_response(resp):
                soup = BeautifulSoup(resp.content, 'html.parser')
                results = soup.find_all('p')
                return results
            else:
                return None
    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)

def log_error(e):
    """
    log errors
    """
    print(e)
```

We define another function to do the actual work of scraping. This returns a dataframe with the metadata about the website we're scraping (date, year, bank, url) merged with lists of lists of the actual text in each page.

```{python eval = F}
# scraping a set of links
def scrape(links, #dataframe of urls and other info
):
    """
    function for scraping set of links to dataframe. 
    returns data frame of raw text in lists of lists
    """
    links_use = links['url'].values.tolist() # extract urls as list
    fed_text_raw = pd.DataFrame() #empty df

    for url in links_use:
        text = simple_get(url)
        df = pd.DataFrame({'url': url, 'text': [text]})
        fed_text_raw = fed_text_raw.append(df, ignore_index = True)
    fed_text_raw = pd.DataFrame(fed_text_raw)
    fed_text_raw.columns = fed_text_raw.columns.str.strip() #strip column names 

    return fed_text_raw
```

Finally, I scrape our links. The returned file was over 1GB in size when saved as a csv file and the process usually takes about 2.5 hours. I would recommend exporting it (to csv or whatever file storage format you prefer) to avoid scraping multiple times.

```{python eval = F}
fed_text_raw = scrape(links)
```

## Cleaning Data
We come to the unenviable task of cleaning the data, which represents a few million words from 50 years of Beige Book Reports. First, we strip the dataset of html tags and numbers. When using `BeautifulSoup`, this is often done with the `get_text()` command; however, we use `find_all` with the html paragraph tag to scrape our data, so `get_text` does not work for this case.

### Preprocessing
The following function uses regex to replace characters, html tags, and other wacky spacing issues.

```{python, eval = F}
## ---- cleaning

def preprocess(df, 
 text_field, # field that has text
 new_text_field_name # name of field that has normalized text
 ):
    """
    normalizes dataframes by converting it to lowercase and 
    removing characters that do not contribute to natural text meaning
    """
    df[new_text_field_name] = (df[text_field]
        .apply(lambda elem: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", 
            "", str(elem)))
        .apply(lambda elem: re.sub("<.*?>","", str(elem))) # strips out tags
        .apply(lambda elem: re.sub(r"\d+", "", str(elem)))
        )
    
    return df
```

### Tokenization
We want to convert our bag of words for each report into the tidy text format, which Julia Silge and David Robinson define in Chapter 1 of [*Text Mining in R*](https://www.tidytextmining.com/tidytext.html) as:

>We thus define the tidy text format as being **a table with one-token-per-row**. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the **token** that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph. In the tidytext package, we provide functionality to tokenize by commonly used units of text like these and convert to a one-term-per-row format.

You may ask: why would you want to use a `tidy` format if we're using Python for mining? This is a good question, but here were some of my reasons:

1. `NLTK` is set up to use ML techniques for sentiment analysis
    + The basic sentiment analysis technique we are using involves matching words in our data with a predefined dictionary.
    + The `tidytext` package preloads the Loughran-McDonald dictionary, which is useful for analysis of finance (and economics-related) literature. More on this later, and the usage of different lexicon.
2. CSV files struggle with lengthy lists of lists.
    + Sometimes there are display errors.
    + Sometimes there are errors where the list is too many characters.
    + The processing speed can be quite slow.
3. I'm personally more familiar with visualization with ``ggplot2``.
4. I do attempt to use `python` to repeat some of the data processing that was first completed with `R`. This is clearly indicated (and the reader can also ascertain which language is used by the programming syntax).

Here, we define a function for word tokenization, which borrows heavily from [Michelle Fullwood's](https://github.com/michelleful) [project](https://github.com/michelleful/TidyTextMining-Python) to convert *Text Mining in R* to Python. I have modified the function to merge the tokens with the original list of links based on the date of the report.

```{python, evaluate = F}
# word tokenization
def unnest(df, # line-based dataframe
                  column_to_tokenize, # name of the column with the text
                  new_token_column_name, # what you want the column of words to be called
                  tokenizer_function, # what tokenizer to use = (nltk.word_tokenize)
                  original_list): # original list of data
    """
    unnests words from html and returns dataframe in long format merged with original list.
    word tokenization in tidy text format.
    """

    return (df[column_to_tokenize]
                .apply(str)
                .apply(tokenizer_function)
                .apply(pd.Series)
                .stack()
                .reset_index(level=0)
                .set_index('level_0')
                .rename(columns={0: new_token_column_name})
                .join(df.drop(column_to_tokenize, 1), how='left')
                .reset_index(drop=True)
                .merge(original_list, how = 'outer', on = 'url')
              )
```

To complete the preprocessing and tokenization process, I apply the functions to our raw data, convert it to lowercase (which is easier for future processes) and save it as a `csv` file to call it for future use.

```{python, eval = F}
fed_text_raw = preprocess(fed_text_raw, 'text', 'text')
fed_text_all = unnest(fed_text_raw, 'text', 'word', nltk.word_tokenize, links)
fed_text_all['word'] = fed_text_all['word'].str.lower() # convert to lowercase
fed_text_all.to_csv('fed_text_all.csv', index = False) # save as csv
```

### Dask  

If you find that `pandas` is too slow for these applications or for your data, you may want to look into `dask`, which uses multi-core processing in addition to other goodies to speed things up for large datasets. Shikhar Chauhan has a blog post about using dask for text pre-processing at [MindOrks](https://medium.com/mindorks/speeding-up-text-pre-processing-using-dask-45cc3ede1366).  

### Stemming
Here's a stemming function that would work on a `tidy` formatted dataframe of words. I didn't end up using stemming (and instead used lemmatization) because I found stemming removed too much --- what should have been words simply became nonsensical stems. [DataCamp](http://datacamp.co/) has a tutorial for both stemming and lemmatization and compares the two methods [here](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python).

```{python, eval = F}
snowball = SnowballStemmer(language = 'english')
def stemming(df,
             column_to_stem,
             stem_function, # what stemming function to use = snowball
             ):
     """
     returns stem of words in a dataframe
     """
    
     return (df[column_to_stem]
                 .apply(stem_function)
                 )

 stemming(fed_text_all, 'word', snowball.stem)
```

### Lemmatization

Instead of stemming, I lemmatized the text by comparing the verbs in the data to the WordNet corpus in `NLTK`. This reduces a word like "swimming" to "swim," which dictionaries pick up better for sentiment analysis.

```{python, eval = F}
wordnet_lemmatizer = WordNetLemmatizer()
def lemmatize(df,
            column_to_lemmatize,
            lemmatize_func, # lemmatizing function to use = wordnet
            ):
    """
    returns root words by matching with the lemmatizing function
    """
    
    return (df[column_to_lemmatize]
                .apply(lambda word: lemmatize_func.lemmatize(word, pos = "v")) # part of speech is set to verb
                )

fed_text_all['word'] = lemmatize(fed_text_all, 'word', wordnet_lemmatizer) # lemmatize
```

### Stopwords

To finish the cleaning process, I remove stop words from the data, which keeps roughly 60 percent of the raw data (the csv file becomes 600 MB). Stop words are words that do not contribute to the meaning of a selection. The list of `NLTK` stopwords, which is used in this code is available in [this resource](https://www.nltk.org/book/ch02.html) for the package.

I add to the stop words list:

1. Left over html tags, such as "pp."
2. A custom list of stop words for the economic context. 
    - Words such as "maturity" or "work" may be coded as positive in natural language but means something different in economics. 
    - Similarly, words such as "gross" (domestic product) or "crude" (oil) may be coded as negative, but do not have such a negative conntation.
    - Len Kiefer's [blog post](http://lenkiefer.com/2018/07/29/beige-ian-statistics/) about this was helpful, and the list is taken from his post.

```{python eval = F}
# make custom stop words list
stop_words = stopwords.words("english")
stop_words_html = ['ppwe', 'uspp', 'pwe', 'usp', 'pp', 'usbrp'] # leftover stopwords from converting html to text 
# with 'p' tags
stop_words_econ = ['debt', 'gross', 'crude', 'well', 'maturity', 'work', 'marginally', 'leverage'] # stop words for economic usage

stop_words.extend(stop_words_html)
stop_words.extend(stop_words_econ)

# delete stopwords
def remove_stops(df,
                column_to_process,
                stop_words_list, # list of stop words = stop_words
                index_reset = True # whether you want to reset index or not
                ):
    """
    function remove stopwords if data is a pandas dataframe and in tidy format e.g. long.
    different from typical pandas format which is lists of words in wide format.
    """
    if index_reset == True:
        return (df[~df[column_to_process].isin(stop_words_list)]
                .reset_index(drop = True)) # this line resets index
    else:
        return df[~df[column_to_process].isin(stop_words_list)]

# remove_stops(fed_text_test, 'word', stop_words) # test code
fed_text_all = remove_stops(fed_text_all, 'word', stop_words)
```

I saved the cleaned data as a csv to read it later for sentiment analysis.

```{python eval = F}
fed_text_all.to_csv('fed_text_all.csv', index = False) # save as csv
```

But before that, some word frequency summary statistics:

```{python eval = T}
# summary statistics
fed_text_all = pd.read_csv('~/_econ/fedspeak/text mining/fed_text_all.csv') # read csv
fed_text_all['word'].value_counts().head(20)
```

It's unsurprising to me that reports and district are mentioned often. After all, the Beige Book is a report that is grouped by district. It appears that real economic activity is more important than financial activity, based on words like demand and sales. This isn't surprising, either, as the Fed's concern about financial activity is certainly more recent as markets have grown more important in the wake of the Great Recession. 

## Sentiment Analysis

### in R
The `tidytext` package has a convenient `get_sentiments()` function, which essentially uses a merge method to match a pre-defined dictionary to the data in tidy format. We use python to emulate this method in a following section.

Let's first load some packages and set up our working environment. I use the `vroom` package that is part of the `tidyverse` to load in the file with all of the words. It is significantly faster and easier to use than any `read_csv` function that I've used before.

```{r echo = T, results = 'hide', warning = F, message = F}
library(tidyverse)
library(tidytext)
library(vroom)
library(ggthemes)
library(zoo)
library(tidyquant)
library(lubridate)
library(ggfortify)

setwd("C:\\Users\\darre\\Documents\\_econ\\fedspeak\\sentiment analysis")
fed_text_all <- vroom('../text mining/fed_text_all.csv') # read csv

recessions.df <- read.table(textConnection(
    "Peak, Trough
    1948-11-01, 1949-10-01
    1953-07-01, 1954-05-01
    1957-08-01, 1958-04-01
    1960-04-01, 1961-02-01
    1969-12-01, 1970-11-01
    1973-11-01, 1975-03-01
    1980-01-01, 1980-07-01
    1981-07-01, 1982-11-01
    1990-07-01, 1991-03-01
    2001-03-01, 2001-11-01
    2007-12-01, 2009-06-01"), 
    sep=',',
    colClasses = c('Date', 'Date'), 
    header = TRUE)

```

### Polarity  

We find the polarity from our data by report with the `inner_join` method in *Text Mining in R*:

```{r warning = F, message = F}
fed_sentiment <- 
    fed_text_all %>% 
    inner_join(get_sentiments("loughran")) %>% # or bing
    # inner_join(get_sentiments("bing")) %>% 
    count(date, sentiment) %>% 
    pivot_wider(names_from = sentiment, 
                values_from = n, 
                values_fill = 0) %>% 
    mutate(sentiment = (positive - negative)/(positive + negative)) %>%
    mutate(date = as.Date(date, format = "%m/%d/%Y")) %>% 
    filter(sentiment != 1) %>% 
    filter(date != "2015-07-01") %>% 
    mutate(sent_ma = rollmean(sentiment, k = 3, fill = NA)) %>% 
    select(date, sentiment, sent_ma) %>% 
    pivot_longer(-date,
            names_to = 'transformation',
            values_to = 'value') %>% 
    mutate(transformation = as_factor(transformation))
```

Much of this is based on Len Kiefer's code, which is linked to in an earlier part of this post. A few key differences: 

* I use the [Loughran-McDonald](https://sraf.nd.edu/textual-analysis/resources/) dictionary instead of the Bing dictionary. I compare the results later --- they are similar, although the finance-specific lexicon seems to fit the context better.
* I substituted `pivot_wider` for `spread` as `spread` is beginning to become deprecated.
* The Minneapolis Fed doesn't have data for June 2015 (not sure the reason), so I filter that out.
* I use the `zoo` package to generate rolling means based on quarters, since there is quite a bit of noise.

Here is the plot:

```{r warning = F, message = F}
ggplot(fed_sentiment,
    aes(x = date,
        y = value,
        color = transformation)) +
    geom_line(aes()) +
    scale_color_fivethirtyeight() + 
    theme_fivethirtyeight() +
    scale_x_date(
        #breaks = "5 years", 
    limits = as.Date(c("1970-01-01","2020-06-01")),
    date_labels = "%Y") +
    labs(x = "Beige Book Report (~8/year)",
        y = "polarity",
        title = "Sentiment in Federal Reserve Beige Book",
        subtitle = "customized Loughran lexicon\npolarity = (positive-negative)/(positive+negative)",
        caption = "Source: Federal Reserve Bank of Minneapolis\nShaded areas NBER recessions") +
    geom_rect(data=recessions.df, 
                    inherit.aes = F, 
                aes(xmin = Peak, 
                    xmax = Trough, 
                    ymin = -Inf, 
                    ymax = +Inf), 
                    fill='darkgray', 
                    alpha=0.5)
```

Even this raw data, without any adjustments, appears to track recessions well. It's not clear if this is a leading indicator, lagging indicator, or neither. Note that the moving averages may lose some of the power in terms of matching recessions.

We can also plot the sentiment by bank using the `facet_wrap` object in `ggplot2`. This yields interesting results: the authors at the different Federal Reserve banks have different scales, thus changing the composition of the data when each bank (and the national summary) is given equal weight.

```{r warning = F, echo = F, message = F}

fed_bank_plot <- 
    fed_text_all %>% 
    inner_join(get_sentiments("loughran")) %>% # or bing
    # inner_join(get_sentiments("bing")) %>% 
    count(date, bank, sentiment) %>% 
    pivot_wider(names_from = sentiment, 
                values_from = n, 
                values_fill = 0) %>% 
    mutate(sentiment = (positive - negative)/(positive + negative)) %>%
    mutate(date = as.Date(date, format = "%m/%d/%Y")) %>% 
    filter(sentiment != 1) %>% 
    filter(date != "2015-07-01")

ggplot(fed_bank_plot,
    aes(x = date, y = sentiment)) +
    geom_line() +
    scale_color_fivethirtyeight() +
    theme_fivethirtyeight() +
    # scale_x_date(limits = as.Date(c("1970-01-01","2020-01-01")),
    # date_labels = "%Y") +
    labs(x = "Beige Book Report (~8/year)",
        y = "polarity",
        title = "Sentiment in Beige Book by Federal Reserve Bank",
        subtitle = "customized Loughran lexicon\npolarity = (positive-negative)/(positive+negative)",
        caption = "Source: Federal Reserve Bank of Minneapolis\nShaded areas NBER recessions") +
    facet_wrap(~bank, scales = 'free_x', ncol = 5,
        labeller = as_labeller(c(
                    'at' = 'Atlanta', 'bo' = 'Boston',
                    'ch' = 'Chicago', 'cl' = 'Cleveland',
                    'da' = 'Dallas', 'kc' = 'Kansas City',
                    'mi' = 'Minneapolis', 'ny' = 'New York',
                    'ph' = 'Philadelphia', 'ri' = 'Richmond',
                    'sf' = 'San Francisco', 'sl' = 'St. Louis',
                    'su' = 'National Summary')))
```

### Scaling  

To adjust for the differences in each bank's Beige Book reports, I applied a simple scaling function. No, seriously, very simple. The range of values is larger but adjusted to the mean and standard deviation of each sample, which is bank-specific.

Here is the code and plot:

```{r warning = F, message = F}

fed_sentiment_bank <- 
    fed_text_all %>% 
    inner_join(get_sentiments("loughran")) %>% # or bing
    # inner_join(get_sentiments("bing")) %>%
    count(report, year, date, bank, sentiment) %>% 
    pivot_wider(names_from = sentiment, 
                values_from = n, 
                values_fill = 0) %>% 
    mutate(sentiment = (positive - negative)/(positive+negative)) %>%
    group_by(bank) %>% 
    mutate(sent_norm = scale(sentiment)) %>% 
    ungroup() %>%
    mutate(date = as.Date(date, format = "%m/%d/%Y")) %>% 
    filter(sentiment != 1) %>%
    filter(date != "2015-07-01") %>% 
    select(date, bank, sent_norm, sentiment) %>% 
    pivot_longer(-c(date, bank), 
                names_to = "transformation", 
                values_to = "value") %>% 
    mutate(transformation = as_factor(transformation))

ggplot(fed_sentiment_bank,
    aes(x = date, y = value, color = transformation)) +
    geom_line() +
    theme_fivethirtyeight() +
    scale_x_date(
        limits = as.Date(c("1970-01-01","2020-06-01")),
        date_labels = "%Y") +
    scale_color_fivethirtyeight(
        name = "Transformation",
        labels = c('Scaled Polarity', 'Raw Polarity')) +
    labs(x = "Beige Book Report (~8/year)",
        y = "polarity",
        title = "Sentiment in Federal Reserve Beige Book",
        subtitle = "customized Loughran lexicon\npolarity = (positive-negative)/(positive+negative)",
        caption = "Source: Federal Reserve Bank of Minneapolis\nShaded areas NBER recessions") +
    facet_wrap(~bank, scales = 'free_x', ncol = 5,
    labeller = as_labeller(c('at' = 'Atlanta', 'bo' = 'Boston',
                    'ch' = 'Chicago', 'cl' = 'Cleveland',
                    'da' = 'Dallas', 'kc' = 'Kansas City',
                    'mi' = 'Minneapolis', 'ny' = 'New York',
                    'ph' = 'Philadelphia', 'ri' = 'Richmond',
                    'sf' = 'San Francisco', 'sl' = 'St. Louis',
                    'su' = 'National Summary'))) +
    geom_rect(data = recessions.df, 
                    inherit.aes = F, 
                aes(xmin = Peak, 
                    xmax = Trough, 
                    ymin = -Inf, 
                    ymax = +Inf), 
                    fill='darkgray', 
                    alpha=0.5)
```

We can now put the scaling and moving averages together:

```{r warning = F, message = F}
fed_sentiment_scale <- 
    fed_text_all %>% 
    inner_join(get_sentiments("loughran")) %>% # or bing
    # inner_join(get_sentiments("bing")) %>%
    count(report, year, date, bank, sentiment) %>% 
    pivot_wider(names_from = sentiment, 
                values_from = n, 
                values_fill = 0) %>%     
    mutate(sentiment = (positive - negative)/(positive+negative)) %>%
    group_by(bank) %>% 
    mutate(sent_norm = scale(sentiment)) %>% 
    ungroup() %>%
    mutate(date = as.Date(date, format = "%m/%d/%Y")) %>% 
    filter(sentiment != 1) %>%
    filter(date != "2015-07-01") %>% 
    select(date, sent_norm, bank, sentiment) %>% 
    group_by(date) %>% 
    summarize(norm_mean = mean(sent_norm),
            sent_mean = mean(sentiment)) %>% 
    mutate(sent_norm_mean_ma = rollmean(norm_mean, 
            k = 3, 
            fill = NA)) %>%
    mutate(sent_mean_ma = rollmean(sent_mean, 
            k = 3, 
            fill = NA)) %>%
    pivot_longer(-date, 
                names_to = "transformation", 
                values_to = "value") %>% 
    mutate(transformation = as_factor(transformation))

ggplot(filter(fed_sentiment_scale, 
transformation == "sent_norm_mean_ma" | transformation == "norm_mean" | transformation == 'sent_mean'),
    aes(x = date, y = value, color = transformation)) +
    geom_line() +
    theme_fivethirtyeight() +
    scale_x_date(limits = as.Date(c("1970-01-01","2020-06-01")),
                date_labels = "%Y") +
    scale_color_fivethirtyeight(
        name = "Transformation",
        labels = c('Scaled Polarity',
                   'Raw Polarity', 
                   'Scaled Polarity (3 mo. mvg avg)')) +
    labs(x = "Beige Book Report (~8/year)",
        y = "polarity",
        title = "Sentiment in Federal Reserve Beige Book",
        subtitle = "customized Loughran lexicon\npolarity = (positive-negative)/(positive+negative)",
        caption = "Source: Federal Reserve Bank of Minneapolis\nShaded areas NBER recessions") +
    geom_rect(data = recessions.df, 
                    inherit.aes = F, 
                aes(xmin = Peak, 
                    xmax = Trough, 
                    ymin = -Inf, 
                    ymax = +Inf), 
                    fill='darkgray', 
                    alpha=0.5) +
    theme(axis.title = element_text())
```

Are the transformations we applied actually any better at tracking recessions? Comparing to GDP growth rates (which we will do in the next section) will give more clues as to how we can use this quantification of the Beige Book. The scaled polarity tracks the 2007-09 Great Recession as a more protracted or more serious recession than those in the past, which backs up conventional wisdom at data. The moving averages (unsurprisingly) get rid of some of the noise without removing the general effects. COVID-19 also has clear negative impacts on Beige Book sentiment. From eyeballing, it seems that the transformations smooth out some issues with the data.

### in Python

I downloaded the Loughran-McDonald Sentiment World list as a `csv`, which is [available here](https://sraf.nd.edu/textual-analysis/resources/). Tim Loughran and Bill McDonald include other resources, such as linked papers and a parser for assigning sentiments to each file in a folder; however, our data is not cleanly separated into files (although it could be). I resort to constructing a dictionary as a tidy dataframe.

### Dictionary Construction  

This chunk separates different columns of the word list and then appends them into our preferred dictionary format that we can use for a join.

```{python eval = T}
lm_dict = (pd.read_csv('C:\\Users\\darre\\Documents\\_econ\\fedspeak\\sentiment analysis\\loughran_mcdonald.csv', header = 0)
                .fillna('')
                .apply(lambda x: x.astype(str))
                .apply(lambda x: x.str.lower())
                # .to_dict()
                )

positive = lm_dict['positive'].tolist()
positive = (pd.DataFrame(list(filter(None, positive)), columns = ['word'])
                .assign(sentiment = 'positive'))

negative = lm_dict['negative'].tolist()
negative = (pd.DataFrame(list(filter(None, negative)), columns = ['word'])
                .assign(sentiment = 'negative'))

uncertainty = lm_dict['uncertainty'].tolist()
uncertainty = (pd.DataFrame(list(filter(None, uncertainty)), columns = ['word'])
                .assign(sentiment = 'uncertainty'))

weak_modal = lm_dict['weak_modal'].tolist()
weak_modal = (pd.DataFrame(list(filter(None, weak_modal)), columns = ['word'])
                .assign(sentiment = 'weak_modal'))

strong_modal = lm_dict['strong_modal'].tolist()
strong_modal = (pd.DataFrame(list(filter(None, strong_modal)), columns = ['word'])
                .assign(sentiment = 'strong_modal'))

constraining = lm_dict['constraining'].tolist()
constraining = (pd.DataFrame(list(filter(None, constraining)), columns = ['word'])
                .assign(sentiment = 'constraining'))

lm_dict = (positive.append(negative, ignore_index = True)
        .append(uncertainty, ignore_index = True)
        .append(weak_modal, ignore_index = True)
        .append(strong_modal, ignore_index = True)
        .append(constraining, ignore_index = True)
)
lm_dict.head(20)
```

### Get Sentiments  

I use `merge` to join our data and our dictionary to create a column with our assigned sentiments.

```{python}
def get_sentiment(
    df, # dataframe of words
    dict # dataframe of dictionary that you want to use
):
    return (df
            .merge(dict)
            .sort_values(by = ['report'], ignore_index = False))
```

Here, I calculate the polarity score for each report.

```{python}
fed_sentiment = get_sentiment(fed_text_all, lm_dict)

fed_sentiment = (fed_sentiment
    .groupby(['report','date','url','year','month','bank','sentiment'])
    .count()
    .unstack(-1, fill_value = 0)
    )

# get rid of some wacky indexing
fed_sentiment.reset_index(inplace = True)

# rename columns
fed_sentiment.columns = ['report', 'date', 'url', 'year', 'month',
'bank', 'constraining', 'negative', 'positive', 'strong_modal',
'uncertainty', 'weak_modal']

fed_sentiment = (fed_sentiment
    .drop(['url'], axis = 1)
    .assign(polarity = lambda x: (fed_sentiment['positive']-fed_sentiment['negative'])/
    (fed_sentiment['positive']+fed_sentiment['negative']))
)

fed_sentiment.head(20)
```

## The BB Index vs Economic Indicators

Next, I compare the BB index that I have constructed with the steps above against economic indicators. My hypothesis is that the BB index can be used as a helpful quantitative indicator with some qualitative dimensions, e.g. understanding inflation or liquidity. Furthermore, the BB index may be useful for understanding changes across Federal Reserve districts, for which not a lot of data exists. Looking up data segmented by FRB district on FRED returns limited results (versus data segmented by state, for example).

### GDP Growth

First, I collect the data from FRED using the `tidyquant` package, which is incredibly versatile and integrated with the `tidyverse` ecosystem. I download two different measures for GDP that are similar. [`A191RL1Q225SBEA`](https://fred.stlouisfed.org/series/A191RL1Q225SBEA) is Real GDP with units of Percent Change from Preceding Period, Seasonally Adjusted Annual Rate. It is released quarterly. [`A191RO1Q156NBEA`](https://fred.stlouisfed.org/series/A191RO1Q156NBEA) is Real GDP with units of Percent Change from Quarter One Year Ago, Seasonally Adjusted, also released quarterly. I name the two series `pch` and `pca`, respectively.

```{r warming = F}
## ---- INDICATOR COMPARISON

# US real GDP
gdp_tickers <- c("A191RL1Q225SBEA", "A191RO1Q156NBEA")
gdp <- tq_get(gdp_tickers, get = "economic.data", from = "1970-01-01") %>% 
    mutate(series = case_when(symbol == "A191RL1Q225SBEA" ~ "gdp_pch",
                            symbol == "A191RO1Q156NBEA" ~ "gdp_pca")) %>% 
    select(-symbol) %>% 
    rename(value = price) 
```

Second, I scale the GDP estimates, which is important because the BB index is also scaled. This makes comparison on the same graph easier without using multiple y-axes, although it becomes less meaningful in terms of real percentage estimates. Of course, it is possible to "de-scale" these data later to obtain the GDP estimate in percent. The data that is used in the regression is not scaled, but the following chunk of code is included because it is used in visualizations. The data and date transformations are discussed in more detail in the next step.

```{r warning = F}
gdp_scale <- gdp %>% 
    group_by(series) %>% 
    mutate(value = scale(value)) %>% 
    ungroup()

sent_gdp_scale <- 
    fed_sentiment_scale %>% 
    filter(transformation == "sent_norm_mean_ma" | transformation == "norm_mean") %>% 
    mutate(series = case_when(transformation == "norm_mean" ~ "polarity",
                            transformation == "sent_norm_mean_ma" ~ "polarity_ma")) %>% 
    select(-transformation) %>% 
    mutate(quarter = quarter(date, 
            with_year = T, 
            fiscal_start = 1)) %>% 
    mutate(q_date = as.Date(as.yearqtr(as.character(quarter), 
                            format = "%Y.%q"))) %>% 
    group_by(quarter, series) %>% 
    mutate(q_value = mean(value)) %>% 
    distinct(q_value, .keep_all = T) %>% 
    ungroup() %>% 
    select(-value, -date, -quarter) %>%
    rename(date = q_date) %>% 
    rename(value = q_value) %>% 
    bind_rows(gdp_scale)
```

Third, I perform some transformations on the GDP data and merge it with the BB index data. Most importantly, the quarterly dates on the BB index have to matched with the quarterly dates on the GDP data. This is done with the `lubridate::quarter()` function. The conversion back to R date objects is done with `zoo::as.yearqtr()`. GDP data is released quarterly, so I match the BB index to GDP quarters by taking the mean of the BB index values within quarters.

```{r warning = F}
sent_gdp <- 
    fed_sentiment_scale %>% 
    filter(transformation == "sent_norm_mean_ma" | transformation == "norm_mean") %>% 
    mutate(series = case_when(transformation == "norm_mean" ~ "polarity",
                            transformation == "sent_norm_mean_ma" ~ "polarity_ma")) %>% 
    select(-transformation) %>% 
    mutate(quarter = quarter(date, 
            with_year = T, 
            fiscal_start = 1)) %>% 
    mutate(q_date = as.Date(as.yearqtr(as.character(quarter), 
                            format = "%Y.%q"))) %>% 
    group_by(quarter, series) %>% 
    mutate(q_value = mean(value)) %>% 
    distinct(q_value, .keep_all = T) %>% 
    ungroup() %>% 
    select(-value, -date, -quarter) %>%
    rename(date = q_date) %>% 
    rename(value = q_value) %>% 
    bind_rows(gdp)
```

Fourth, I convert the tibble with the GDP data and the BB index into the wide format so I can perform a regression.

```{r warning = F}
sent_gdp_wide <- 
    sent_gdp %>% 
    pivot_wider(
        names_from = series,
        values_from = value)
```

Finally, I run the regression. Obviously, this is an extreme simplification of understanding how GDP and the BB index co-move. However, I'm not looking for detailed information on how that happens, simply that there is some correlation between the two datasets. I choose the `polarity_ma` (moving average index) and the `gdp_pca` (percent change year over year) measures because these generate the highest $R^2$ values. 

```{r warning = F}
lm_ma_gdp <- lm(polarity_ma ~ gdp_pca,
    data = sent_gdp_wide)

summary(lm_ma_gdp)
autoplot(lm_ma_gdp)

ggplot(select(sent_gdp_wide, date, polarity_ma, gdp_pca),
        aes(x = polarity_ma, y = gdp_pca)) +
    geom_point() +
    theme_fivethirtyeight() +
    geom_smooth(method = 'lm') +
    labs(x = 'Polarity (3 mo. mvg avg)',
        y = 'GDP (% change yoy)',
        title = 'Scatter Plot of Beige Book Sentiment Regression') +
    theme(axis.title = element_text())

```

There is some evidence that the residuals may not be normally distributed, but with such a simple model, it's unclear if a different model would make much of an improvement. It appears that GDP is a significant predictor (with $p = 1.26*10^{-15}$) of the BB index measures; the BB index is also a significant predictor of GDP. With just the two variables, there is little to be said about the direction of effect. The models explain nearly $28\%$ of the variation of the data, which is a decent chunk, considering that the BB index is a constructed measure. I've also included the scatter plot for convenience. Originally, I thought it would be helpful to include lagged regressions, but it appears that the GDP data and BB index match closely on date.

```{r warning = F, message = F}
ggplot(filter(sent_gdp_scale, series == 'polarity' | series == 'gdp_pca'),
    aes(x = date, y = value, color = series)) +
    geom_line() +
    theme_fivethirtyeight() +
    scale_color_fivethirtyeight(
        name = '',
        labels = c('GDP Growth yoy', ' Beige Book Index')
    ) +
    scale_x_date(limits = as.Date(c("1970-01-01","2020-06-01")),
                date_labels = "%Y") +
    labs(x = "Beige Book Report (~8/year)",
        y = "polarity",
        title = "Beige Book Sentiment and GDP Growth (Scaled)",
        subtitle = "customized Loughran lexicon\npolarity = (positive-negative)/(positive+negative)",
        caption = "Source: Federal Reserve Bank of Minneapolis\nShaded areas NBER recessions") +
    geom_rect(data = recessions.df, 
                    inherit.aes = F, 
                aes(xmin = Peak, 
                    xmax = Trough, 
                    ymin = -Inf, 
                    ymax = +Inf), 
                    fill='darkgray', 
                    alpha=0.5)+
    theme(axis.title = element_text())
```

## Forecasting

I first attempted to use the `fbprophet` package for some (very light and very automated) forecasting, but it didn't work well for this purpose, as business cycles are of variable length. Instead, I used the [`nowcasting`](https://cran.r-project.org/web/packages/nowcasting/nowcasting.pdf) package, which is available on CRAN and uses a dynamic factor method for estimation based on [Giannone, Reichlin, and Small (2008)](http://dept.ku.edu/~empirics/Courses/Econ844/papers/Nowcasting%20GDP.pdf).

Our exercise is similar to what the New York Fed does in their [weekly nowcasting report](https://www.newyorkfed.org/research/policy/nowcast.html), although limited access to data constrains our forecast's timeliness. In addition, the specific method chosen for nowcasting from the package does not use Kalman filtering, although other methods in the package do utilize Kalman filtering.

### Dynamic Factor Model

The dynamic factor model specification is given by:

$$ x_t = \mu + \Lambda f_t + \epsilon_t$$ (1)

$$ f_t = \sum_{i=1}^{p} A_i f_{t-1} + B u_t$$ (2)

where $u_t \sim i.i.d. N(0, I_q)$. $x_t$ is a vector of $N$ monthly time series and $f_t$ are common factors.

We use the Expectation-Maximization (EM) method, which the `nowcasting` package authors describe in [this paper](https://journal.r-project.org/archive/2019/RJ-2019-020/RJ-2019-020.pdf) in *The R Journal*. Importantly, the EM method can adjust for arbitrary patterns in missing values, which is a key issue of using the Beige Book as an indexing measure. There is no clear pattern as to which eight months out of the year the Beige Book will be released. In addition, the factors do not need to be global. Our data has four blocks of factors: global, soft, real, and labor. Note that on the EM model, the error term $\epsilon_t$ is defined as an AR(1) process, but we have to set the number of shocks to the factors `q` equal to the number of factors `r`.

We can rewrite equation (1) above by first rewriting $\Lambda$ and $f_t$ as a matrix of our four factors:

$$ \Lambda = \begin{pmatrix} 
\Lambda_{S, G} & \Lambda_{S,S} & 0 & 0 \\
\Lambda_{R, G} & 0 & \Lambda_{R,R} & 0 \\
\Lambda_{L, G} & 0 & 0 & \Lambda_{L, L}
\end{pmatrix}$$ (3)

$$ f_t = \begin{pmatrix}
f_t^G\\
f_t^S\\
f_t^R\\
f_t^L \end{pmatrix}$$ (4)

Plugging this into (1), we have

$$x_t = \mu + \begin{pmatrix}
\Lambda_{S, G} & \Lambda_{S,S} & 0 & 0 \\
\Lambda_{R, G} & 0 & \Lambda_{R,R} & 0 \\
\Lambda_{L, G} & 0 & 0 & \Lambda_{L, L}
 \end{pmatrix} \begin{pmatrix}
f_t^G \\
f_t^S\\
f_t^R\\
f_t^L \end{pmatrix} + \epsilon_t $$

The package uses the Expectation-Maximization algorithm to recursively estimate the model by maximum likelihood estimation until the log-likelihood function is less than $10^{-4}$.

### The Mechanics of Nowcasting

I used the New York Fed's list of variables in their DGSE nowcasting model as the baseline model for my specification, adding the Beige Book Index in as `BBPOLAR`. To begin the process, I load the packages and data generated from earlier.

```{r warning = F, results = 'hide', message = F}
library(tidyverse)
library(vroom)
library(nowcasting)
library(tidyquant)
library(ggthemes)
```

Import the Beige Book data:

```{r warning = F, results = 'hide', message = F}
## -- data import

# import sentiments
sent_gdp_month <- vroom("C:\\Users\\darre\\Documents\\_econ\\fedspeak\\sentiment analysis\\sent_gdp_month.csv")
sent_gdp_month <- 
        sent_gdp_month %>% 
        rename(symbol = series,
                price = value)
```

Next, I import the data for the nowcasting model from FRED using the `tidyquant` package, which I've recently discovered with `fredr`'s demise after `R 4.0.0` was released.

```{r warning = F, message = F}
# use tidyquant to get all the indicators
tickers <- c('PAYEMS', # payroll employment
                'JTSJOL', # job openings
                'CPIAUCSL', # CPI inflation urban
                'DGORDER', # durable goods orders
                'RSAFS', # advanced retail sales, retail and food
                'UNRATE', # unemployment rate
                'HOUST', # new housing starts
                'INDPRO', # industrial production index
                'DSPIC96', # real disposable personal income
                'BOPTEXP', # BOP exports
                'BOPTIMP', # BOP imports
                'TTLCONS', # total construction spending
                'IR', # import price index
                'CPILFESL', # CPI less food energy
                'PCEPILFE', # PCE less food energy
                'PCEPI', # PCE price index
                'PERMIT', # building permits
                'TCU', # capacity utilization
                'BUSINV', # business inventories
                'ULCNFB', # unit labor cost
                'IQ', # export price index
                'GACDISA066MSFRBNY', # empire state mfg index
                'GACDFSA066MSFRBPHI', # philly fed mfg index
                'PCEC96', # real consumption spending
                'GDPC1' # real GDP
                )

factors <- tq_get(tickers, get = 'economic.data', from = '1970-01-01')
factors
```

I bind the Beige Book data to the factors imported from FRED, keeping only the moving average data. The moving average is helpful because it returns data for every month instead of using the arbitrary schedule the Fed releases the Beige Book uses.

```{r warning = F, results = 'hide', message = F}
# keep only moving averages
factors_pol <- factors %>% 
        bind_rows(filter(sent_gdp_month, symbol != 'polarity'))
```

There is quite a bit of setup that has to be completed before the `nowcasting::nowcast()` function can be used. First, I lag the GDP such that the observation for `GDPC1` is displayed in third month of every quarter. `Tidyquant` imports the data such that GDP is in the first month, which `nowcasting::Bpanel()` doesn't recognize. I also remove the date column because creating the time series objects will place the dates on the index of the data.

```{r}
# lag GDP properly and make wider
base_bb <- 
        factors_pol %>% 
        pivot_wider(
                names_from = symbol,
                values_from = price) %>%
        mutate(GDPC1 = lag(GDPC1, 2)) %>%
        rename(BBPOLAR = polarity_ma) %>% 
        select(-date)
base_bb
```

Next, I create the time series object with monthly frequency from January 1970 to June 2020.

```{r}
# make times series object
base_ts_bb <- ts(base_bb,
        start = c(1970, 1),
        end = c(2020,6),
        frequency = 12)
```

I set up the parameters for the panel balancing function and for the nowcast function later. The `trans_bb` list is the transformation for the data. There is more information about these transformations in the `nowcasting` package documentation. In this experiment, 0 means that the original series are preserved, 1 is the monthly rate of change, 2 is the monthly difference, 6 is the yearly rate of change, and 7 is the quarterly rate of change. The frequency list is self explanatory: either the data are quarterly or monthly. Blocks are primarily taken from the `nowcasting::NYFED` data, but the `BBPOLAR` data are given the global and soft blocks. This is in line with how the Empire State Index and Philly Manufacturing Index are coded for blocks. 

```{r warning = F, message = F, results = 'hide'}
# set up parameters
trans_bb <- c(2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 2, 2, 1, 6, 1, 0, 0, 1, 7, 0)

frequency_bb <- c(12, 12, 12, 12, 12, 12, 12, 12,
 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
                4, 12, 12, 12, 12, 4, 12)

blocks_bb <- tibble::tribble(~Global, ~Soft, ~Real, ~Labor,
                1,    0,    0,    1,
                1,    0,    0,    1,
                1,    0,    0,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    0,    1,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    0,    0,
                1,    0,    0,    0,
                1,    0,    0,    0,
                1,    0,    0,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    0,    1,
                1,    0,    0,    0,
                1,    1,    0,    0,
                1,    1,    0,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    1,    0,    0
)
```

I use `nowcasting::Bpanel()` to balance the panels, which is useful for involving the transformation in the panel balancing itself. I set `na.prop = 1`, since there is some `NA` data at the beginning of the time period, which I don't want to be excluded but also don't want to be replaced by moving averages (which is what happens when `NA.replace = T`). The function returns `NULL` if the process is completed correctly.

```{r}
## -- balance panels using nowcasting::Bpanel()
base_ts_balance_bb <- Bpanel(base = base_ts_bb,
                trans = trans_bb, 
                NA.replace = F, 
                na.prop = 1)
```

To complete the nowcasting process, I use the `nowcasting::nowcast()` function with the Expectation-Maximization algorithm described above. The formula relates `GDPC1` to all of the other variables. For the EM method, $r=q=1$ by construction, e.g. the model is restricted such that the number of common factors $r$ is equal to the number of common shocks $q$. The function denotes this is an AR(1) process with $p=1$. While computing the estimates, the function outputs the iterative log-likelihood, as shown below.

```{r}
## -- nowcast
gdp_bb_nowcastEM <- nowcasting::nowcast(formula = GDPC1 ~ ., 
                data = base_ts_balance_bb, 
                r = 1, 
                p = 1, 
                method = "EM", 
                blocks = blocks_bb, 
                frequency = frequency_bb)
```

Finally, we plot the nowcasting results. The `nowcasting::nowcast.plot()` function is useful for quick plots, but I moved the data to `ggplot2` to have more control over the plot. The data has to be transformed to a long format from the nested dataframes that the nowcast function outputs.

```{r warning = F, message = F}
## -- plotting
nowcast_EM_bb_y <- as_tibble(gdp_bb_nowcastEM$yfcst) %>% 
        mutate(date = seq(as.Date("1970-01-01"), as.Date("2021-04-01"), by = 'quarter')) %>% 
        select(date, everything()) %>% 
        rename(GDP = y,
        GDP_fcst = out,
        GDP_yhat = `in`) %>% 
        pivot_longer(-date,
                names_to = 'type'
        )

recessions.df <- read.table(textConnection(
        "Peak, Trough
        1948-11-01, 1949-10-01
        1953-07-01, 1954-05-01
        1957-08-01, 1958-04-01
        1960-04-01, 1961-02-01
        1969-12-01, 1970-11-01
        1973-11-01, 1975-03-01
        1980-01-01, 1980-07-01
        1981-07-01, 1982-11-01
        1990-07-01, 1991-03-01
        2001-03-01, 2001-11-01
        2007-12-01, 2009-06-01"), 
        sep=',',
        colClasses = c('Date', 'Date'), 
        header = TRUE)

ggplot(nowcast_EM_bb_y, 
        aes(x = date, y = value, color = type)) + 
        geom_line() +
        theme_fivethirtyeight() +
        scale_color_fivethirtyeight(
                name = '',
                labels = c(
                'Predicted GDP Growth',
                'GDP Growth Forecast',
                'Actual Quarterly GDP Growth')
        ) +
        scale_x_date(
        limits = as.Date(c("1970-01-01","2021-06-01")),
        date_labels = "%Y",
        breaks = '4 years') +
        labs(y = 'Quarterly GDP Growth (%)',
        title = 'Nowcasting GDP with Beige Book Data',
        subtitle = 'Dynamic Factor Model with nowcasting package',
        caption = 'Source: Minneapolis Federal Reserve, New York Federal Reserve, and author calcuations\n Shaded areas NBER recessions.'
        ) +
        geom_rect(data = recessions.df, 
                        inherit.aes = F, 
                aes(xmin = Peak, 
                        xmax = Trough, 
                        ymin = -Inf, 
                        ymax = +Inf), 
                        fill='darkgray', 
                        alpha=0.5) +
        theme(axis.title = element_text())
```

Overall, it seems that there is a reasonable connection between nowcasting with the BB index and the actual GDP. But, the actual amplitude of shocks to GDP are not captured well. Perhaps this is because of the moving average data that are used for the Beige Book. The latest data through 2020:Q2 are not included because they are not yet available, but the data would likely shift the GDP prediction for the next few quarters (2020:Q3 onward) to be much more negative.

However, when we compare nowcast results with and without the Beige Book index, the typical 25 factor nowcast may perform better. The typical nowcast results with the NY Fed factors is shown below. In the COVID-19 period, though, the typical nowcast predicts much more positive GDP in the next few quarters, which may not be an accurate representation of GDP.

```{r results = 'hide', warning = F, message = F}
### -- WITHOUT BB INDEX [BBPOLAR]

## -- data setup

# lag GDP properly and make wider
base <- 
        factors %>% 
        pivot_wider(
                names_from = symbol,
                values_from = price) %>%
        mutate(GDPC1 = lag(GDPC1, 2)) %>%
        select(-date)

# make times series object
base_ts <- ts(base,
        start = c(1970, 1),
        end = c(2020,6),
        frequency = 12)

# set up parameters
trans <- c(2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 6, 1, 0, 0, 1, 7)
frequency <- c(12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
                4, 12, 12, 12, 12, 4)

blocks <- tibble::tribble(~Global, ~Soft, ~Real, ~Labor,
                1,    0,    0,    1,
                1,    0,    0,    1,
                1,    0,    0,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    0,    1,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    0,    0,
                1,    0,    0,    0,
                1,    0,    0,    0,
                1,    0,    0,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    1,    0,
                1,    0,    0,    1,
                1,    0,    0,    0,
                1,    1,    0,    0,
                1,    1,    0,    0,
                1,    0,    1,    0,
                1,    0,    1,    0
)

## -- balance panels using nowcasting::Bpanel()
base_ts_balance <- Bpanel(base = base_ts,
                trans = trans, 
                NA.replace = F, 
                na.prop = 1)

## -- nowcast
gdp_nowcastEM <- nowcast(formula = GDPC1 ~ ., 
                data = base_ts_balance, 
                r = 1, 
                p = 1, 
                method = "EM", 
                blocks = blocks, 
                frequency = frequency)

## -- plotting
nowcast_EM_y <- as_tibble(gdp_nowcastEM$yfcst) %>% 
        mutate(date = seq(as.Date("1970-01-01"), as.Date("2021-04-01"), by = 'quarter')) %>% 
        select(date, everything()) %>% 
        rename(GDP = y,
        GDP_fcst = out,
        GDP_yhat = `in`) %>% 
        pivot_longer(-date,
                names_to = 'type'
        )

recessions.df <- read.table(textConnection(
        "Peak, Trough
        1948-11-01, 1949-10-01
        1953-07-01, 1954-05-01
        1957-08-01, 1958-04-01
        1960-04-01, 1961-02-01
        1969-12-01, 1970-11-01
        1973-11-01, 1975-03-01
        1980-01-01, 1980-07-01
        1981-07-01, 1982-11-01
        1990-07-01, 1991-03-01
        2001-03-01, 2001-11-01
        2007-12-01, 2009-06-01"), 
        sep=',',
        colClasses = c('Date', 'Date'), 
        header = TRUE)
```

```{r warning = F, message = F}
ggplot(nowcast_EM_y, 
        aes(x = date, y = value, color = type)) + 
        geom_line() +
        theme_fivethirtyeight() +
        scale_color_fivethirtyeight(
                name = '',
                labels = c(
                'Actual Quarterly GDP Growth',
                'GDP Growth Forecast',
                'Predicted GDP Growth'
                )
                
        ) +
        scale_x_date(
        limits = as.Date(c("1970-01-01","2021-06-01")),
        date_labels = "%Y",
        breaks = '4 years') +
        labs(y = 'Quarterly GDP Growth (%)',
        title = 'Nowcasting GDP without Beige Book Data',
        subtitle = 'Dynamic Factor Model with nowcasting package',
        caption = 'Source: New York Federal Reserve, and author calcuations'
        ) +
        geom_rect(data = recessions.df, 
                        inherit.aes = F, 
                aes(xmin = Peak, 
                        xmax = Trough, 
                        ymin = -Inf, 
                        ymax = +Inf), 
                        fill='darkgray', 
                        alpha=0.5) +
        theme(axis.title = element_text())
```

## Conclusion

In this project, I scrape text from the Federal Reserve Beige Book and compile it into a quantitative index to measure the polarity of economic sentiment, as expressed through each of the 12 branch banks of the Federal Reserve as well as a national summary. I compare this index to GDP growth, which it tracks well, and use the index in a basic forecasting exercise with a dynamic factor model.

## To do
- Regional economic comparison