-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathfedspeak.Rmd
1211 lines (1020 loc) · 53.5 KB
/
fedspeak.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Fedspeak Documentation"
author: "Darren Chang"
date: "`r Sys.Date()`"
output: pdf_document
latex_engine: xelatex
toc: true
toc depth: 3
number sections: true
keep_md: true
urlcolor: blue
---
```{r setup, include = F}
library(reticulate)
use_python('C:/Users/darre/Python/Python37/python.exe')
```
```{python include = F}
import pandas as pd
import numpy as np
import os
os.chdir('C:/Users/darre/Documents/_econ/fedspeak/text mining')
```
\newpage
## Introduction
Here is the link to the [Beige Book](https://www.federalreserve.gov/monetarypolicy/beige-book-default.htm), a qualitative report that the Federal Reserve releases 8 times a year. In past decades, the Beige Book has been released as often as once a month. Each Federal Reserve district writes their own report and a National Summary is prepared by Research Associates at a district bank on a rotating basis.
### Working
More information on the Beige Book and previous forecasting methods to come.
## Text Mining
Text mining was completed in `Python 3.7` using `BeautifulSoup`. Setup by importing `Pandas`, `BeauitfulSoup`, and `Requests`.
```{python eval = F}
## ---- setup
import pandas as pd
import numpy as np
import os #setwd
import time #timing
# scraping
from bs4 import BeautifulSoup
from contextlib import closing
import requests # website requests
from requests.exceptions import RequestException
from requests import get
```
I scrape the [Minneapolis Fed's Beige Book Archive](https://www.minneapolisfed.org/region-and-community/regional-economic-indicators/beige-book-archive), which hosts `html` files for each Beige Book back to 1970. The links to each website to be scraped is in the `links.csv` file in this repo. Import it as a dataframe.
```{python eval = T}
links = pd.read_csv('~/_econ/fedspeak/text mining/links.csv')
links.head(20)
links.tail(20)
```
Next, I define our scraping functions that iterate through the links, opens the url, and returns all of the text with the `<p>` tag. Much of the code for this portion was taken from Real Python's [guide to web scraping](https://realpython.com/python-web-scraping-practical-introduction/).
```{python eval = F}
## ---- scraping
# functions for getting URLs, with error logging
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
try:
with closing(get(url, stream = True)) as resp:
if is_good_response(resp):
soup = BeautifulSoup(resp.content, 'html.parser')
results = soup.find_all('p')
return results
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def is_good_response(resp):
"""
Returns True if the response seems to be HTML, False otherwise.
"""
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)
def log_error(e):
"""
log errors
"""
print(e)
```
We define another function to do the actual work of scraping. This returns a dataframe with the metadata about the website we're scraping (date, year, bank, url) merged with lists of lists of the actual text in each page.
```{python eval = F}
# scraping a set of links
def scrape(links, #dataframe of urls and other info
):
"""
function for scraping set of links to dataframe.
returns data frame of raw text in lists of lists
"""
links_use = links['url'].values.tolist() # extract urls as list
fed_text_raw = pd.DataFrame() #empty df
for url in links_use:
text = simple_get(url)
df = pd.DataFrame({'url': url, 'text': [text]})
fed_text_raw = fed_text_raw.append(df, ignore_index = True)
fed_text_raw = pd.DataFrame(fed_text_raw)
fed_text_raw.columns = fed_text_raw.columns.str.strip() #strip column names
return fed_text_raw
```
Finally, I scrape our links. The returned file was over 1GB in size when saved as a csv file and the process usually takes about 2.5 hours. I would recommend exporting it (to csv or whatever file storage format you prefer) to avoid scraping multiple times.
```{python eval = F}
fed_text_raw = scrape(links)
```
## Cleaning Data
We come to the unenviable task of cleaning the data, which represents a few million words from 50 years of Beige Book Reports. First, we strip the dataset of html tags and numbers. When using `BeautifulSoup`, this is often done with the `get_text()` command; however, we use `find_all` with the html paragraph tag to scrape our data, so `get_text` does not work for this case.
### Preprocessing
The following function uses regex to replace characters, html tags, and other wacky spacing issues.
```{python, eval = F}
## ---- cleaning
def preprocess(df,
text_field, # field that has text
new_text_field_name # name of field that has normalized text
):
"""
normalizes dataframes by converting it to lowercase and
removing characters that do not contribute to natural text meaning
"""
df[new_text_field_name] = (df[text_field]
.apply(lambda elem: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?",
"", str(elem)))
.apply(lambda elem: re.sub("<.*?>","", str(elem))) # strips out tags
.apply(lambda elem: re.sub(r"\d+", "", str(elem)))
)
return df
```
### Tokenization
We want to convert our bag of words for each report into the tidy text format, which Julia Silge and David Robinson define in Chapter 1 of [*Text Mining in R*](https://www.tidytextmining.com/tidytext.html) as:
>We thus define the tidy text format as being **a table with one-token-per-row**. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the **token** that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph. In the tidytext package, we provide functionality to tokenize by commonly used units of text like these and convert to a one-term-per-row format.
You may ask: why would you want to use a `tidy` format if we're using Python for mining? This is a good question, but here were some of my reasons:
1. `NLTK` is set up to use ML techniques for sentiment analysis
+ The basic sentiment analysis technique we are using involves matching words in our data with a predefined dictionary.
+ The `tidytext` package preloads the Loughran-McDonald dictionary, which is useful for analysis of finance (and economics-related) literature. More on this later, and the usage of different lexicon.
2. CSV files struggle with lengthy lists of lists.
+ Sometimes there are display errors.
+ Sometimes there are errors where the list is too many characters.
+ The processing speed can be quite slow.
3. I'm personally more familiar with visualization with ``ggplot2``.
4. I do attempt to use `python` to repeat some of the data processing that was first completed with `R`. This is clearly indicated (and the reader can also ascertain which language is used by the programming syntax).
Here, we define a function for word tokenization, which borrows heavily from [Michelle Fullwood's](https://github.com/michelleful) [project](https://github.com/michelleful/TidyTextMining-Python) to convert *Text Mining in R* to Python. I have modified the function to merge the tokens with the original list of links based on the date of the report.
```{python, evaluate = F}
# word tokenization
def unnest(df, # line-based dataframe
column_to_tokenize, # name of the column with the text
new_token_column_name, # what you want the column of words to be called
tokenizer_function, # what tokenizer to use = (nltk.word_tokenize)
original_list): # original list of data
"""
unnests words from html and returns dataframe in long format merged with original list.
word tokenization in tidy text format.
"""
return (df[column_to_tokenize]
.apply(str)
.apply(tokenizer_function)
.apply(pd.Series)
.stack()
.reset_index(level=0)
.set_index('level_0')
.rename(columns={0: new_token_column_name})
.join(df.drop(column_to_tokenize, 1), how='left')
.reset_index(drop=True)
.merge(original_list, how = 'outer', on = 'url')
)
```
To complete the preprocessing and tokenization process, I apply the functions to our raw data, convert it to lowercase (which is easier for future processes) and save it as a `csv` file to call it for future use.
```{python, eval = F}
fed_text_raw = preprocess(fed_text_raw, 'text', 'text')
fed_text_all = unnest(fed_text_raw, 'text', 'word', nltk.word_tokenize, links)
fed_text_all['word'] = fed_text_all['word'].str.lower() # convert to lowercase
fed_text_all.to_csv('fed_text_all.csv', index = False) # save as csv
```
### Dask
If you find that `pandas` is too slow for these applications or for your data, you may want to look into `dask`, which uses multi-core processing in addition to other goodies to speed things up for large datasets. Shikhar Chauhan has a blog post about using dask for text pre-processing at [MindOrks](https://medium.com/mindorks/speeding-up-text-pre-processing-using-dask-45cc3ede1366).
### Stemming
Here's a stemming function that would work on a `tidy` formatted dataframe of words. I didn't end up using stemming (and instead used lemmatization) because I found stemming removed too much --- what should have been words simply became nonsensical stems. [DataCamp](http://datacamp.co/) has a tutorial for both stemming and lemmatization and compares the two methods [here](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python).
```{python, eval = F}
snowball = SnowballStemmer(language = 'english')
def stemming(df,
column_to_stem,
stem_function, # what stemming function to use = snowball
):
"""
returns stem of words in a dataframe
"""
return (df[column_to_stem]
.apply(stem_function)
)
stemming(fed_text_all, 'word', snowball.stem)
```
### Lemmatization
Instead of stemming, I lemmatized the text by comparing the verbs in the data to the WordNet corpus in `NLTK`. This reduces a word like "swimming" to "swim," which dictionaries pick up better for sentiment analysis.
```{python, eval = F}
wordnet_lemmatizer = WordNetLemmatizer()
def lemmatize(df,
column_to_lemmatize,
lemmatize_func, # lemmatizing function to use = wordnet
):
"""
returns root words by matching with the lemmatizing function
"""
return (df[column_to_lemmatize]
.apply(lambda word: lemmatize_func.lemmatize(word, pos = "v")) # part of speech is set to verb
)
fed_text_all['word'] = lemmatize(fed_text_all, 'word', wordnet_lemmatizer) # lemmatize
```
### Stopwords
To finish the cleaning process, I remove stop words from the data, which keeps roughly 60 percent of the raw data (the csv file becomes 600 MB). Stop words are words that do not contribute to the meaning of a selection. The list of `NLTK` stopwords, which is used in this code is available in [this resource](https://www.nltk.org/book/ch02.html) for the package.
I add to the stop words list:
1. Left over html tags, such as "pp."
2. A custom list of stop words for the economic context.
- Words such as "maturity" or "work" may be coded as positive in natural language but means something different in economics.
- Similarly, words such as "gross" (domestic product) or "crude" (oil) may be coded as negative, but do not have such a negative conntation.
- Len Kiefer's [blog post](http://lenkiefer.com/2018/07/29/beige-ian-statistics/) about this was helpful, and the list is taken from his post.
```{python eval = F}
# make custom stop words list
stop_words = stopwords.words("english")
stop_words_html = ['ppwe', 'uspp', 'pwe', 'usp', 'pp', 'usbrp'] # leftover stopwords from converting html to text
# with 'p' tags
stop_words_econ = ['debt', 'gross', 'crude', 'well', 'maturity', 'work', 'marginally', 'leverage'] # stop words for economic usage
stop_words.extend(stop_words_html)
stop_words.extend(stop_words_econ)
# delete stopwords
def remove_stops(df,
column_to_process,
stop_words_list, # list of stop words = stop_words
index_reset = True # whether you want to reset index or not
):
"""
function remove stopwords if data is a pandas dataframe and in tidy format e.g. long.
different from typical pandas format which is lists of words in wide format.
"""
if index_reset == True:
return (df[~df[column_to_process].isin(stop_words_list)]
.reset_index(drop = True)) # this line resets index
else:
return df[~df[column_to_process].isin(stop_words_list)]
# remove_stops(fed_text_test, 'word', stop_words) # test code
fed_text_all = remove_stops(fed_text_all, 'word', stop_words)
```
I saved the cleaned data as a csv to read it later for sentiment analysis.
```{python eval = F}
fed_text_all.to_csv('fed_text_all.csv', index = False) # save as csv
```
But before that, some word frequency summary statistics:
```{python eval = T}
# summary statistics
fed_text_all = pd.read_csv('~/_econ/fedspeak/text mining/fed_text_all.csv') # read csv
fed_text_all['word'].value_counts().head(20)
```
It's unsurprising to me that reports and district are mentioned often. After all, the Beige Book is a report that is grouped by district. It appears that real economic activity is more important than financial activity, based on words like demand and sales. This isn't surprising, either, as the Fed's concern about financial activity is certainly more recent as markets have grown more important in the wake of the Great Recession.
## Sentiment Analysis
### in R
The `tidytext` package has a convenient `get_sentiments()` function, which essentially uses a merge method to match a pre-defined dictionary to the data in tidy format. We use python to emulate this method in a following section.
Let's first load some packages and set up our working environment. I use the `vroom` package that is part of the `tidyverse` to load in the file with all of the words. It is significantly faster and easier to use than any `read_csv` function that I've used before.
```{r echo = T, results = 'hide', warning = F, message = F}
library(tidyverse)
library(tidytext)
library(vroom)
library(ggthemes)
library(zoo)
library(tidyquant)
library(lubridate)
library(ggfortify)
setwd("C:\\Users\\darre\\Documents\\_econ\\fedspeak\\sentiment analysis")
fed_text_all <- vroom('../text mining/fed_text_all.csv') # read csv
recessions.df <- read.table(textConnection(
"Peak, Trough
1948-11-01, 1949-10-01
1953-07-01, 1954-05-01
1957-08-01, 1958-04-01
1960-04-01, 1961-02-01
1969-12-01, 1970-11-01
1973-11-01, 1975-03-01
1980-01-01, 1980-07-01
1981-07-01, 1982-11-01
1990-07-01, 1991-03-01
2001-03-01, 2001-11-01
2007-12-01, 2009-06-01"),
sep=',',
colClasses = c('Date', 'Date'),
header = TRUE)
```
### Polarity
We find the polarity from our data by report with the `inner_join` method in *Text Mining in R*:
```{r warning = F, message = F}
fed_sentiment <-
fed_text_all %>%
inner_join(get_sentiments("loughran")) %>% # or bing
# inner_join(get_sentiments("bing")) %>%
count(date, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = (positive - negative)/(positive + negative)) %>%
mutate(date = as.Date(date, format = "%m/%d/%Y")) %>%
filter(sentiment != 1) %>%
filter(date != "2015-07-01") %>%
mutate(sent_ma = rollmean(sentiment, k = 3, fill = NA)) %>%
select(date, sentiment, sent_ma) %>%
pivot_longer(-date,
names_to = 'transformation',
values_to = 'value') %>%
mutate(transformation = as_factor(transformation))
```
Much of this is based on Len Kiefer's code, which is linked to in an earlier part of this post. A few key differences:
* I use the [Loughran-McDonald](https://sraf.nd.edu/textual-analysis/resources/) dictionary instead of the Bing dictionary. I compare the results later --- they are similar, although the finance-specific lexicon seems to fit the context better.
* I substituted `pivot_wider` for `spread` as `spread` is beginning to become deprecated.
* The Minneapolis Fed doesn't have data for June 2015 (not sure the reason), so I filter that out.
* I use the `zoo` package to generate rolling means based on quarters, since there is quite a bit of noise.
Here is the plot:
```{r warning = F, message = F}
ggplot(fed_sentiment,
aes(x = date,
y = value,
color = transformation)) +
geom_line(aes()) +
scale_color_fivethirtyeight() +
theme_fivethirtyeight() +
scale_x_date(
#breaks = "5 years",
limits = as.Date(c("1970-01-01","2020-06-01")),
date_labels = "%Y") +
labs(x = "Beige Book Report (~8/year)",
y = "polarity",
title = "Sentiment in Federal Reserve Beige Book",
subtitle = "customized Loughran lexicon\npolarity = (positive-negative)/(positive+negative)",
caption = "Source: Federal Reserve Bank of Minneapolis\nShaded areas NBER recessions") +
geom_rect(data=recessions.df,
inherit.aes = F,
aes(xmin = Peak,
xmax = Trough,
ymin = -Inf,
ymax = +Inf),
fill='darkgray',
alpha=0.5)
```
Even this raw data, without any adjustments, appears to track recessions well. It's not clear if this is a leading indicator, lagging indicator, or neither. Note that the moving averages may lose some of the power in terms of matching recessions.
We can also plot the sentiment by bank using the `facet_wrap` object in `ggplot2`. This yields interesting results: the authors at the different Federal Reserve banks have different scales, thus changing the composition of the data when each bank (and the national summary) is given equal weight.
```{r warning = F, echo = F, message = F}
fed_bank_plot <-
fed_text_all %>%
inner_join(get_sentiments("loughran")) %>% # or bing
# inner_join(get_sentiments("bing")) %>%
count(date, bank, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = (positive - negative)/(positive + negative)) %>%
mutate(date = as.Date(date, format = "%m/%d/%Y")) %>%
filter(sentiment != 1) %>%
filter(date != "2015-07-01")
ggplot(fed_bank_plot,
aes(x = date, y = sentiment)) +
geom_line() +
scale_color_fivethirtyeight() +
theme_fivethirtyeight() +
# scale_x_date(limits = as.Date(c("1970-01-01","2020-01-01")),
# date_labels = "%Y") +
labs(x = "Beige Book Report (~8/year)",
y = "polarity",
title = "Sentiment in Beige Book by Federal Reserve Bank",
subtitle = "customized Loughran lexicon\npolarity = (positive-negative)/(positive+negative)",
caption = "Source: Federal Reserve Bank of Minneapolis\nShaded areas NBER recessions") +
facet_wrap(~bank, scales = 'free_x', ncol = 5,
labeller = as_labeller(c(
'at' = 'Atlanta', 'bo' = 'Boston',
'ch' = 'Chicago', 'cl' = 'Cleveland',
'da' = 'Dallas', 'kc' = 'Kansas City',
'mi' = 'Minneapolis', 'ny' = 'New York',
'ph' = 'Philadelphia', 'ri' = 'Richmond',
'sf' = 'San Francisco', 'sl' = 'St. Louis',
'su' = 'National Summary')))
```
### Scaling
To adjust for the differences in each bank's Beige Book reports, I applied a simple scaling function. No, seriously, very simple. The range of values is larger but adjusted to the mean and standard deviation of each sample, which is bank-specific.
Here is the code and plot:
```{r warning = F, message = F}
fed_sentiment_bank <-
fed_text_all %>%
inner_join(get_sentiments("loughran")) %>% # or bing
# inner_join(get_sentiments("bing")) %>%
count(report, year, date, bank, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = (positive - negative)/(positive+negative)) %>%
group_by(bank) %>%
mutate(sent_norm = scale(sentiment)) %>%
ungroup() %>%
mutate(date = as.Date(date, format = "%m/%d/%Y")) %>%
filter(sentiment != 1) %>%
filter(date != "2015-07-01") %>%
select(date, bank, sent_norm, sentiment) %>%
pivot_longer(-c(date, bank),
names_to = "transformation",
values_to = "value") %>%
mutate(transformation = as_factor(transformation))
ggplot(fed_sentiment_bank,
aes(x = date, y = value, color = transformation)) +
geom_line() +
theme_fivethirtyeight() +
scale_x_date(
limits = as.Date(c("1970-01-01","2020-06-01")),
date_labels = "%Y") +
scale_color_fivethirtyeight(
name = "Transformation",
labels = c('Scaled Polarity', 'Raw Polarity')) +
labs(x = "Beige Book Report (~8/year)",
y = "polarity",
title = "Sentiment in Federal Reserve Beige Book",
subtitle = "customized Loughran lexicon\npolarity = (positive-negative)/(positive+negative)",
caption = "Source: Federal Reserve Bank of Minneapolis\nShaded areas NBER recessions") +
facet_wrap(~bank, scales = 'free_x', ncol = 5,
labeller = as_labeller(c('at' = 'Atlanta', 'bo' = 'Boston',
'ch' = 'Chicago', 'cl' = 'Cleveland',
'da' = 'Dallas', 'kc' = 'Kansas City',
'mi' = 'Minneapolis', 'ny' = 'New York',
'ph' = 'Philadelphia', 'ri' = 'Richmond',
'sf' = 'San Francisco', 'sl' = 'St. Louis',
'su' = 'National Summary'))) +
geom_rect(data = recessions.df,
inherit.aes = F,
aes(xmin = Peak,
xmax = Trough,
ymin = -Inf,
ymax = +Inf),
fill='darkgray',
alpha=0.5)
```
We can now put the scaling and moving averages together:
```{r warning = F, message = F}
fed_sentiment_scale <-
fed_text_all %>%
inner_join(get_sentiments("loughran")) %>% # or bing
# inner_join(get_sentiments("bing")) %>%
count(report, year, date, bank, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = (positive - negative)/(positive+negative)) %>%
group_by(bank) %>%
mutate(sent_norm = scale(sentiment)) %>%
ungroup() %>%
mutate(date = as.Date(date, format = "%m/%d/%Y")) %>%
filter(sentiment != 1) %>%
filter(date != "2015-07-01") %>%
select(date, sent_norm, bank, sentiment) %>%
group_by(date) %>%
summarize(norm_mean = mean(sent_norm),
sent_mean = mean(sentiment)) %>%
mutate(sent_norm_mean_ma = rollmean(norm_mean,
k = 3,
fill = NA)) %>%
mutate(sent_mean_ma = rollmean(sent_mean,
k = 3,
fill = NA)) %>%
pivot_longer(-date,
names_to = "transformation",
values_to = "value") %>%
mutate(transformation = as_factor(transformation))
ggplot(filter(fed_sentiment_scale,
transformation == "sent_norm_mean_ma" | transformation == "norm_mean" | transformation == 'sent_mean'),
aes(x = date, y = value, color = transformation)) +
geom_line() +
theme_fivethirtyeight() +
scale_x_date(limits = as.Date(c("1970-01-01","2020-06-01")),
date_labels = "%Y") +
scale_color_fivethirtyeight(
name = "Transformation",
labels = c('Scaled Polarity',
'Raw Polarity',
'Scaled Polarity (3 mo. mvg avg)')) +
labs(x = "Beige Book Report (~8/year)",
y = "polarity",
title = "Sentiment in Federal Reserve Beige Book",
subtitle = "customized Loughran lexicon\npolarity = (positive-negative)/(positive+negative)",
caption = "Source: Federal Reserve Bank of Minneapolis\nShaded areas NBER recessions") +
geom_rect(data = recessions.df,
inherit.aes = F,
aes(xmin = Peak,
xmax = Trough,
ymin = -Inf,
ymax = +Inf),
fill='darkgray',
alpha=0.5) +
theme(axis.title = element_text())
```
Are the transformations we applied actually any better at tracking recessions? Comparing to GDP growth rates (which we will do in the next section) will give more clues as to how we can use this quantification of the Beige Book. The scaled polarity tracks the 2007-09 Great Recession as a more protracted or more serious recession than those in the past, which backs up conventional wisdom at data. The moving averages (unsurprisingly) get rid of some of the noise without removing the general effects. COVID-19 also has clear negative impacts on Beige Book sentiment. From eyeballing, it seems that the transformations smooth out some issues with the data.
### in Python
I downloaded the Loughran-McDonald Sentiment World list as a `csv`, which is [available here](https://sraf.nd.edu/textual-analysis/resources/). Tim Loughran and Bill McDonald include other resources, such as linked papers and a parser for assigning sentiments to each file in a folder; however, our data is not cleanly separated into files (although it could be). I resort to constructing a dictionary as a tidy dataframe.
### Dictionary Construction
This chunk separates different columns of the word list and then appends them into our preferred dictionary format that we can use for a join.
```{python eval = T}
lm_dict = (pd.read_csv('C:\\Users\\darre\\Documents\\_econ\\fedspeak\\sentiment analysis\\loughran_mcdonald.csv', header = 0)
.fillna('')
.apply(lambda x: x.astype(str))
.apply(lambda x: x.str.lower())
# .to_dict()
)
positive = lm_dict['positive'].tolist()
positive = (pd.DataFrame(list(filter(None, positive)), columns = ['word'])
.assign(sentiment = 'positive'))
negative = lm_dict['negative'].tolist()
negative = (pd.DataFrame(list(filter(None, negative)), columns = ['word'])
.assign(sentiment = 'negative'))
uncertainty = lm_dict['uncertainty'].tolist()
uncertainty = (pd.DataFrame(list(filter(None, uncertainty)), columns = ['word'])
.assign(sentiment = 'uncertainty'))
weak_modal = lm_dict['weak_modal'].tolist()
weak_modal = (pd.DataFrame(list(filter(None, weak_modal)), columns = ['word'])
.assign(sentiment = 'weak_modal'))
strong_modal = lm_dict['strong_modal'].tolist()
strong_modal = (pd.DataFrame(list(filter(None, strong_modal)), columns = ['word'])
.assign(sentiment = 'strong_modal'))
constraining = lm_dict['constraining'].tolist()
constraining = (pd.DataFrame(list(filter(None, constraining)), columns = ['word'])
.assign(sentiment = 'constraining'))
lm_dict = (positive.append(negative, ignore_index = True)
.append(uncertainty, ignore_index = True)
.append(weak_modal, ignore_index = True)
.append(strong_modal, ignore_index = True)
.append(constraining, ignore_index = True)
)
lm_dict.head(20)
```
### Get Sentiments
I use `merge` to join our data and our dictionary to create a column with our assigned sentiments.
```{python}
def get_sentiment(
df, # dataframe of words
dict # dataframe of dictionary that you want to use
):
return (df
.merge(dict)
.sort_values(by = ['report'], ignore_index = False))
```
Here, I calculate the polarity score for each report.
```{python}
fed_sentiment = get_sentiment(fed_text_all, lm_dict)
fed_sentiment = (fed_sentiment
.groupby(['report','date','url','year','month','bank','sentiment'])
.count()
.unstack(-1, fill_value = 0)
)
# get rid of some wacky indexing
fed_sentiment.reset_index(inplace = True)
# rename columns
fed_sentiment.columns = ['report', 'date', 'url', 'year', 'month',
'bank', 'constraining', 'negative', 'positive', 'strong_modal',
'uncertainty', 'weak_modal']
fed_sentiment = (fed_sentiment
.drop(['url'], axis = 1)
.assign(polarity = lambda x: (fed_sentiment['positive']-fed_sentiment['negative'])/
(fed_sentiment['positive']+fed_sentiment['negative']))
)
fed_sentiment.head(20)
```
## The BB Index vs Economic Indicators
Next, I compare the BB index that I have constructed with the steps above against economic indicators. My hypothesis is that the BB index can be used as a helpful quantitative indicator with some qualitative dimensions, e.g. understanding inflation or liquidity. Furthermore, the BB index may be useful for understanding changes across Federal Reserve districts, for which not a lot of data exists. Looking up data segmented by FRB district on FRED returns limited results (versus data segmented by state, for example).
### GDP Growth
First, I collect the data from FRED using the `tidyquant` package, which is incredibly versatile and integrated with the `tidyverse` ecosystem. I download two different measures for GDP that are similar. [`A191RL1Q225SBEA`](https://fred.stlouisfed.org/series/A191RL1Q225SBEA) is Real GDP with units of Percent Change from Preceding Period, Seasonally Adjusted Annual Rate. It is released quarterly. [`A191RO1Q156NBEA`](https://fred.stlouisfed.org/series/A191RO1Q156NBEA) is Real GDP with units of Percent Change from Quarter One Year Ago, Seasonally Adjusted, also released quarterly. I name the two series `pch` and `pca`, respectively.
```{r warming = F}
## ---- INDICATOR COMPARISON
# US real GDP
gdp_tickers <- c("A191RL1Q225SBEA", "A191RO1Q156NBEA")
gdp <- tq_get(gdp_tickers, get = "economic.data", from = "1970-01-01") %>%
mutate(series = case_when(symbol == "A191RL1Q225SBEA" ~ "gdp_pch",
symbol == "A191RO1Q156NBEA" ~ "gdp_pca")) %>%
select(-symbol) %>%
rename(value = price)
```
Second, I scale the GDP estimates, which is important because the BB index is also scaled. This makes comparison on the same graph easier without using multiple y-axes, although it becomes less meaningful in terms of real percentage estimates. Of course, it is possible to "de-scale" these data later to obtain the GDP estimate in percent. The data that is used in the regression is not scaled, but the following chunk of code is included because it is used in visualizations. The data and date transformations are discussed in more detail in the next step.
```{r warning = F}
gdp_scale <- gdp %>%
group_by(series) %>%
mutate(value = scale(value)) %>%
ungroup()
sent_gdp_scale <-
fed_sentiment_scale %>%
filter(transformation == "sent_norm_mean_ma" | transformation == "norm_mean") %>%
mutate(series = case_when(transformation == "norm_mean" ~ "polarity",
transformation == "sent_norm_mean_ma" ~ "polarity_ma")) %>%
select(-transformation) %>%
mutate(quarter = quarter(date,
with_year = T,
fiscal_start = 1)) %>%
mutate(q_date = as.Date(as.yearqtr(as.character(quarter),
format = "%Y.%q"))) %>%
group_by(quarter, series) %>%
mutate(q_value = mean(value)) %>%
distinct(q_value, .keep_all = T) %>%
ungroup() %>%
select(-value, -date, -quarter) %>%
rename(date = q_date) %>%
rename(value = q_value) %>%
bind_rows(gdp_scale)
```
Third, I perform some transformations on the GDP data and merge it with the BB index data. Most importantly, the quarterly dates on the BB index have to matched with the quarterly dates on the GDP data. This is done with the `lubridate::quarter()` function. The conversion back to R date objects is done with `zoo::as.yearqtr()`. GDP data is released quarterly, so I match the BB index to GDP quarters by taking the mean of the BB index values within quarters.
```{r warning = F}
sent_gdp <-
fed_sentiment_scale %>%
filter(transformation == "sent_norm_mean_ma" | transformation == "norm_mean") %>%
mutate(series = case_when(transformation == "norm_mean" ~ "polarity",
transformation == "sent_norm_mean_ma" ~ "polarity_ma")) %>%
select(-transformation) %>%
mutate(quarter = quarter(date,
with_year = T,
fiscal_start = 1)) %>%
mutate(q_date = as.Date(as.yearqtr(as.character(quarter),
format = "%Y.%q"))) %>%
group_by(quarter, series) %>%
mutate(q_value = mean(value)) %>%
distinct(q_value, .keep_all = T) %>%
ungroup() %>%
select(-value, -date, -quarter) %>%
rename(date = q_date) %>%
rename(value = q_value) %>%
bind_rows(gdp)
```
Fourth, I convert the tibble with the GDP data and the BB index into the wide format so I can perform a regression.
```{r warning = F}
sent_gdp_wide <-
sent_gdp %>%
pivot_wider(
names_from = series,
values_from = value)
```
Finally, I run the regression. Obviously, this is an extreme simplification of understanding how GDP and the BB index co-move. However, I'm not looking for detailed information on how that happens, simply that there is some correlation between the two datasets. I choose the `polarity_ma` (moving average index) and the `gdp_pca` (percent change year over year) measures because these generate the highest $R^2$ values.
```{r warning = F}
lm_ma_gdp <- lm(polarity_ma ~ gdp_pca,
data = sent_gdp_wide)
summary(lm_ma_gdp)
autoplot(lm_ma_gdp)
ggplot(select(sent_gdp_wide, date, polarity_ma, gdp_pca),
aes(x = polarity_ma, y = gdp_pca)) +
geom_point() +
theme_fivethirtyeight() +
geom_smooth(method = 'lm') +
labs(x = 'Polarity (3 mo. mvg avg)',
y = 'GDP (% change yoy)',
title = 'Scatter Plot of Beige Book Sentiment Regression') +
theme(axis.title = element_text())
```
There is some evidence that the residuals may not be normally distributed, but with such a simple model, it's unclear if a different model would make much of an improvement. It appears that GDP is a significant predictor (with $p = 1.26*10^{-15}$) of the BB index measures; the BB index is also a significant predictor of GDP. With just the two variables, there is little to be said about the direction of effect. The models explain nearly $28\%$ of the variation of the data, which is a decent chunk, considering that the BB index is a constructed measure. I've also included the scatter plot for convenience. Originally, I thought it would be helpful to include lagged regressions, but it appears that the GDP data and BB index match closely on date.
```{r warning = F, message = F}
ggplot(filter(sent_gdp_scale, series == 'polarity' | series == 'gdp_pca'),
aes(x = date, y = value, color = series)) +
geom_line() +
theme_fivethirtyeight() +
scale_color_fivethirtyeight(
name = '',
labels = c('GDP Growth yoy', ' Beige Book Index')
) +
scale_x_date(limits = as.Date(c("1970-01-01","2020-06-01")),
date_labels = "%Y") +
labs(x = "Beige Book Report (~8/year)",
y = "polarity",
title = "Beige Book Sentiment and GDP Growth (Scaled)",
subtitle = "customized Loughran lexicon\npolarity = (positive-negative)/(positive+negative)",
caption = "Source: Federal Reserve Bank of Minneapolis\nShaded areas NBER recessions") +
geom_rect(data = recessions.df,
inherit.aes = F,
aes(xmin = Peak,
xmax = Trough,
ymin = -Inf,
ymax = +Inf),
fill='darkgray',
alpha=0.5)+
theme(axis.title = element_text())
```
## Forecasting
I first attempted to use the `fbprophet` package for some (very light and very automated) forecasting, but it didn't work well for this purpose, as business cycles are of variable length. Instead, I used the [`nowcasting`](https://cran.r-project.org/web/packages/nowcasting/nowcasting.pdf) package, which is available on CRAN and uses a dynamic factor method for estimation based on [Giannone, Reichlin, and Small (2008)](http://dept.ku.edu/~empirics/Courses/Econ844/papers/Nowcasting%20GDP.pdf).
Our exercise is similar to what the New York Fed does in their [weekly nowcasting report](https://www.newyorkfed.org/research/policy/nowcast.html), although limited access to data constrains our forecast's timeliness. In addition, the specific method chosen for nowcasting from the package does not use Kalman filtering, although other methods in the package do utilize Kalman filtering.
### Dynamic Factor Model
The dynamic factor model specification is given by:
$$ x_t = \mu + \Lambda f_t + \epsilon_t$$ (1)
$$ f_t = \sum_{i=1}^{p} A_i f_{t-1} + B u_t$$ (2)
where $u_t \sim i.i.d. N(0, I_q)$. $x_t$ is a vector of $N$ monthly time series and $f_t$ are common factors.
We use the Expectation-Maximization (EM) method, which the `nowcasting` package authors describe in [this paper](https://journal.r-project.org/archive/2019/RJ-2019-020/RJ-2019-020.pdf) in *The R Journal*. Importantly, the EM method can adjust for arbitrary patterns in missing values, which is a key issue of using the Beige Book as an indexing measure. There is no clear pattern as to which eight months out of the year the Beige Book will be released. In addition, the factors do not need to be global. Our data has four blocks of factors: global, soft, real, and labor. Note that on the EM model, the error term $\epsilon_t$ is defined as an AR(1) process, but we have to set the number of shocks to the factors `q` equal to the number of factors `r`.
We can rewrite equation (1) above by first rewriting $\Lambda$ and $f_t$ as a matrix of our four factors:
$$ \Lambda = \begin{pmatrix}
\Lambda_{S, G} & \Lambda_{S,S} & 0 & 0 \\
\Lambda_{R, G} & 0 & \Lambda_{R,R} & 0 \\
\Lambda_{L, G} & 0 & 0 & \Lambda_{L, L}
\end{pmatrix}$$ (3)
$$ f_t = \begin{pmatrix}
f_t^G\\
f_t^S\\
f_t^R\\
f_t^L \end{pmatrix}$$ (4)
Plugging this into (1), we have
$$x_t = \mu + \begin{pmatrix}
\Lambda_{S, G} & \Lambda_{S,S} & 0 & 0 \\
\Lambda_{R, G} & 0 & \Lambda_{R,R} & 0 \\
\Lambda_{L, G} & 0 & 0 & \Lambda_{L, L}
\end{pmatrix} \begin{pmatrix}
f_t^G \\
f_t^S\\
f_t^R\\
f_t^L \end{pmatrix} + \epsilon_t $$
The package uses the Expectation-Maximization algorithm to recursively estimate the model by maximum likelihood estimation until the log-likelihood function is less than $10^{-4}$.
### The Mechanics of Nowcasting
I used the New York Fed's list of variables in their DGSE nowcasting model as the baseline model for my specification, adding the Beige Book Index in as `BBPOLAR`. To begin the process, I load the packages and data generated from earlier.
```{r warning = F, results = 'hide', message = F}
library(tidyverse)
library(vroom)
library(nowcasting)
library(tidyquant)
library(ggthemes)
```
Import the Beige Book data:
```{r warning = F, results = 'hide', message = F}
## -- data import
# import sentiments
sent_gdp_month <- vroom("C:\\Users\\darre\\Documents\\_econ\\fedspeak\\sentiment analysis\\sent_gdp_month.csv")
sent_gdp_month <-
sent_gdp_month %>%
rename(symbol = series,
price = value)
```
Next, I import the data for the nowcasting model from FRED using the `tidyquant` package, which I've recently discovered with `fredr`'s demise after `R 4.0.0` was released.
```{r warning = F, message = F}
# use tidyquant to get all the indicators
tickers <- c('PAYEMS', # payroll employment
'JTSJOL', # job openings
'CPIAUCSL', # CPI inflation urban
'DGORDER', # durable goods orders
'RSAFS', # advanced retail sales, retail and food
'UNRATE', # unemployment rate
'HOUST', # new housing starts
'INDPRO', # industrial production index
'DSPIC96', # real disposable personal income
'BOPTEXP', # BOP exports
'BOPTIMP', # BOP imports
'TTLCONS', # total construction spending
'IR', # import price index
'CPILFESL', # CPI less food energy
'PCEPILFE', # PCE less food energy
'PCEPI', # PCE price index
'PERMIT', # building permits
'TCU', # capacity utilization
'BUSINV', # business inventories
'ULCNFB', # unit labor cost
'IQ', # export price index
'GACDISA066MSFRBNY', # empire state mfg index
'GACDFSA066MSFRBPHI', # philly fed mfg index
'PCEC96', # real consumption spending
'GDPC1' # real GDP
)
factors <- tq_get(tickers, get = 'economic.data', from = '1970-01-01')
factors
```
I bind the Beige Book data to the factors imported from FRED, keeping only the moving average data. The moving average is helpful because it returns data for every month instead of using the arbitrary schedule the Fed releases the Beige Book uses.
```{r warning = F, results = 'hide', message = F}
# keep only moving averages
factors_pol <- factors %>%
bind_rows(filter(sent_gdp_month, symbol != 'polarity'))
```
There is quite a bit of setup that has to be completed before the `nowcasting::nowcast()` function can be used. First, I lag the GDP such that the observation for `GDPC1` is displayed in third month of every quarter. `Tidyquant` imports the data such that GDP is in the first month, which `nowcasting::Bpanel()` doesn't recognize. I also remove the date column because creating the time series objects will place the dates on the index of the data.
```{r}
# lag GDP properly and make wider
base_bb <-
factors_pol %>%
pivot_wider(
names_from = symbol,
values_from = price) %>%
mutate(GDPC1 = lag(GDPC1, 2)) %>%
rename(BBPOLAR = polarity_ma) %>%
select(-date)
base_bb
```
Next, I create the time series object with monthly frequency from January 1970 to June 2020.
```{r}
# make times series object
base_ts_bb <- ts(base_bb,
start = c(1970, 1),
end = c(2020,6),
frequency = 12)
```
I set up the parameters for the panel balancing function and for the nowcast function later. The `trans_bb` list is the transformation for the data. There is more information about these transformations in the `nowcasting` package documentation. In this experiment, 0 means that the original series are preserved, 1 is the monthly rate of change, 2 is the monthly difference, 6 is the yearly rate of change, and 7 is the quarterly rate of change. The frequency list is self explanatory: either the data are quarterly or monthly. Blocks are primarily taken from the `nowcasting::NYFED` data, but the `BBPOLAR` data are given the global and soft blocks. This is in line with how the Empire State Index and Philly Manufacturing Index are coded for blocks.
```{r warning = F, message = F, results = 'hide'}
# set up parameters
trans_bb <- c(2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 1, 6, 1, 0, 0, 1, 7, 0)
frequency_bb <- c(12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
4, 12, 12, 12, 12, 4, 12)
blocks_bb <- tibble::tribble(~Global, ~Soft, ~Real, ~Labor,
1, 0, 0, 1,
1, 0, 0, 1,
1, 0, 0, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 0, 1,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 0, 0,
1, 0, 0, 0,
1, 0, 0, 0,
1, 0, 0, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 0, 1,
1, 0, 0, 0,
1, 1, 0, 0,
1, 1, 0, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 1, 0, 0
)
```
I use `nowcasting::Bpanel()` to balance the panels, which is useful for involving the transformation in the panel balancing itself. I set `na.prop = 1`, since there is some `NA` data at the beginning of the time period, which I don't want to be excluded but also don't want to be replaced by moving averages (which is what happens when `NA.replace = T`). The function returns `NULL` if the process is completed correctly.
```{r}
## -- balance panels using nowcasting::Bpanel()
base_ts_balance_bb <- Bpanel(base = base_ts_bb,
trans = trans_bb,
NA.replace = F,
na.prop = 1)
```
To complete the nowcasting process, I use the `nowcasting::nowcast()` function with the Expectation-Maximization algorithm described above. The formula relates `GDPC1` to all of the other variables. For the EM method, $r=q=1$ by construction, e.g. the model is restricted such that the number of common factors $r$ is equal to the number of common shocks $q$. The function denotes this is an AR(1) process with $p=1$. While computing the estimates, the function outputs the iterative log-likelihood, as shown below.
```{r}
## -- nowcast
gdp_bb_nowcastEM <- nowcasting::nowcast(formula = GDPC1 ~ .,