-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Loading status checks…
update RNN
1 parent
ba14b95
commit ce781f9
Showing
5 changed files
with
385 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
155 changes: 155 additions & 0 deletions
155
notebooks/chapter_recurrent_neural_networks/sequence.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
218 changes: 218 additions & 0 deletions
218
notebooks/chapter_recurrent_neural_networks/text-sequence.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,218 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Converting Raw Text into Sequence Data\n", | ||
"\n", | ||
"## Reading the Dataset\n", | ||
"\n", | ||
"Here, we will work with H. G. Wells’ The Time Machine, a book containing just over 30,000 words. While real applications will typically involve significantly larger datasets, this is sufficient to demonstrate the preprocessing pipeline. The following _download method reads the raw text into a string.\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"\"/tmp/jl_eD65mPmPM9\"" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"using Downloads\n", | ||
"\n", | ||
"file_path = Downloads.download(\"https://www.gutenberg.org/cache/epub/35/pg35.txt\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 28, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"\"\\ufeffThe Project Gutenberg eBook of The Time Machine\\r\\n \\r\\nTh\"" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"raw_text = open(io->read(io, String),file_path)\n", | ||
"raw_text[begin:60]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"For simplicity, we ignore punctuation and capitalization when preprocessing the raw text." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 29, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"\" the project gutenberg ebook of the time machine this ebook \"" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"str = lowercase(replace(raw_text,r\"[^A-Za-z]+\"=>\" \"))\n", | ||
"str[begin:60]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Tokenization\n", | ||
"\n", | ||
"Tokens are the atomic (indivisible) units of text. Each time step corresponds to 1 token, but what precisely constitutes a token is a design choice. For example, we could represent the sentence “Baby needs a new pair of shoes” as a sequence of 7 words, where the set of all words comprise a large vocabulary (typically tens or hundreds of thousands of words). Or we would represent the same sentence as a much longer sequence of 30 characters, using a much smaller vocabulary (there are only 256 distinct ASCII characters). Below, we tokenize our preprocessed text into a sequence of characters." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 31, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"\" ,t,h,e, ,p,r,o,j,e,c,t, ,g,u,t,e,n,b,e,r,g, ,e,b,o,o,k, ,o\"" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"tokens = [str...]\n", | ||
"join(tokens[begin:30],\",\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Vocabulary\n", | ||
"\n", | ||
"We now construct a vocabulary for our dataset, converting the sequence of strings into a list of numerical indices. Note that we have not lost any information and can easily convert our dataset back to its original (string) representation." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 32, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"indices:[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", | ||
"words:[' ', 't', 'h', 'e', 'p', 'r', 'o', 'j', 'c', 'g']\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"vocab = unique(tokens)\n", | ||
"vocab_dict = Dict(vocab .=> 1:length(vocab))\n", | ||
"indices_dict = Dict(i[2]=>i[1] for i in vocab_dict)\n", | ||
"\n", | ||
"to_indices(v::Vector{Char}) = [vocab_dict[i] for i in v]\n", | ||
"to_vocab(v::Vector{Int}) = [indices_dict[i] for i in v]\n", | ||
"\n", | ||
"indices = to_indices(vocab[begin:10])\n", | ||
"println(\"indices:$(indices)\")\n", | ||
"println(\"words:$(to_vocab(indices))\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Exploratory Language Statistics\n", | ||
"\n", | ||
"Using the real corpus and the Vocab class defined over words, we can inspect basic statistics concerning word use in our corpus. Below, we construct a vocabulary from words used in The Time Machine and print the ten most frequently occurring of them." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 57, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"10-element view(::Vector{Pair{String, Int64}}, 1:10) with eltype Pair{String, Int64}:\n", | ||
" \"the\" => 2468\n", | ||
" \"and\" => 1296\n", | ||
" \"of\" => 1281\n", | ||
" \"i\" => 1242\n", | ||
" \"a\" => 864\n", | ||
" \"to\" => 760\n", | ||
" \"in\" => 605\n", | ||
" \"was\" => 550\n", | ||
" \"that\" => 451\n", | ||
" \"my\" => 439" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"using TextAnalysis\n", | ||
"\n", | ||
"sd = StringDocument(raw_text)\n", | ||
"remove_case!(sd)\n", | ||
"prepare!(sd, strip_non_letters)\n", | ||
"crps = Corpus([sd])\n", | ||
"update_lexicon!(crps)\n", | ||
"lex_dict = lexicon(crps)\n", | ||
"\n", | ||
"partialsort([lex_dict...], 1:10; by = x -> x[2], rev=true)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
" Word frequency tends to follow a power law distribution (specifically the Zipfian) as we go down the ranks. To get a better idea, we plot the figure of the word frequency." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Julia 1.10.3", | ||
"language": "julia", | ||
"name": "julia-1.10" | ||
}, | ||
"language_info": { | ||
"file_extension": ".jl", | ||
"mimetype": "application/julia", | ||
"name": "julia", | ||
"version": "1.10.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |