Skip to content

Commit

Permalink
update RNN
Browse files Browse the repository at this point in the history
NeroBlackstone committed May 13, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
1 parent ba14b95 commit ce781f9
Showing 5 changed files with 385 additions and 4 deletions.
3 changes: 3 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -10,12 +10,15 @@ DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
FeatureTransforms = "8fd68953-04b8-4117-ac19-158bf6de9782"
Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
IJulia = "7073ff75-c697-5162-941a-fcdaad2a7d2a"
ImageShow = "4e3cecfd-b093-5904-9786-8bbb286a6a31"
IterTools = "c8e1da08-722c-5040-9ed9-7db0dc04731e"
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
MLBase = "f0e99cf1-93fa-52ec-9ecc-5026115318e0"
MLDatasets = "eb30cadb-4394-5ae3-aed4-317e484a6458"
MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
TextAnalysis = "a2db99b7-8b79-58f8-94bf-bbc811eef33d"
Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
9 changes: 7 additions & 2 deletions _toc.yml
Original file line number Diff line number Diff line change
@@ -34,5 +34,10 @@ parts:
numbered: true
chapters:
- file: notebooks/chapter_builders_guide/parameters.ipynb
- file: /notebooks/chapter_builders_guide/read-write.ipynb

- file: notebooks/chapter_builders_guide/read-write.ipynb
- caption: Recurrent Neural Networks
numbered: true
chapters:
- file: chapter_recurrent_neural_networks/sequence.ipynb
- file: chapter_recurrent_neural_networks/text-sequence.ipynbs

4 changes: 2 additions & 2 deletions notebooks/chapter_builders_guide/read-write.ipynb
Original file line number Diff line number Diff line change
@@ -233,13 +233,13 @@
},
{
"cell_type": "code",
"execution_count": 46,
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"false"
"true"
]
},
"metadata": {},
155 changes: 155 additions & 0 deletions notebooks/chapter_recurrent_neural_networks/sequence.ipynb

Large diffs are not rendered by default.

218 changes: 218 additions & 0 deletions notebooks/chapter_recurrent_neural_networks/text-sequence.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Converting Raw Text into Sequence Data\n",
"\n",
"## Reading the Dataset\n",
"\n",
"Here, we will work with H. G. Wells’ The Time Machine, a book containing just over 30,000 words. While real applications will typically involve significantly larger datasets, this is sufficient to demonstrate the preprocessing pipeline. The following _download method reads the raw text into a string.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"/tmp/jl_eD65mPmPM9\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"using Downloads\n",
"\n",
"file_path = Downloads.download(\"https://www.gutenberg.org/cache/epub/35/pg35.txt\")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"\\ufeffThe Project Gutenberg eBook of The Time Machine\\r\\n \\r\\nTh\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"raw_text = open(io->read(io, String),file_path)\n",
"raw_text[begin:60]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For simplicity, we ignore punctuation and capitalization when preprocessing the raw text."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" the project gutenberg ebook of the time machine this ebook \""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"str = lowercase(replace(raw_text,r\"[^A-Za-z]+\"=>\" \"))\n",
"str[begin:60]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tokenization\n",
"\n",
"Tokens are the atomic (indivisible) units of text. Each time step corresponds to 1 token, but what precisely constitutes a token is a design choice. For example, we could represent the sentence “Baby needs a new pair of shoes” as a sequence of 7 words, where the set of all words comprise a large vocabulary (typically tens or hundreds of thousands of words). Or we would represent the same sentence as a much longer sequence of 30 characters, using a much smaller vocabulary (there are only 256 distinct ASCII characters). Below, we tokenize our preprocessed text into a sequence of characters."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\" ,t,h,e, ,p,r,o,j,e,c,t, ,g,u,t,e,n,b,e,r,g, ,e,b,o,o,k, ,o\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tokens = [str...]\n",
"join(tokens[begin:30],\",\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Vocabulary\n",
"\n",
"We now construct a vocabulary for our dataset, converting the sequence of strings into a list of numerical indices. Note that we have not lost any information and can easily convert our dataset back to its original (string) representation."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"indices:[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n",
"words:[' ', 't', 'h', 'e', 'p', 'r', 'o', 'j', 'c', 'g']\n"
]
}
],
"source": [
"vocab = unique(tokens)\n",
"vocab_dict = Dict(vocab .=> 1:length(vocab))\n",
"indices_dict = Dict(i[2]=>i[1] for i in vocab_dict)\n",
"\n",
"to_indices(v::Vector{Char}) = [vocab_dict[i] for i in v]\n",
"to_vocab(v::Vector{Int}) = [indices_dict[i] for i in v]\n",
"\n",
"indices = to_indices(vocab[begin:10])\n",
"println(\"indices:$(indices)\")\n",
"println(\"words:$(to_vocab(indices))\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploratory Language Statistics\n",
"\n",
"Using the real corpus and the Vocab class defined over words, we can inspect basic statistics concerning word use in our corpus. Below, we construct a vocabulary from words used in The Time Machine and print the ten most frequently occurring of them."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10-element view(::Vector{Pair{String, Int64}}, 1:10) with eltype Pair{String, Int64}:\n",
" \"the\" => 2468\n",
" \"and\" => 1296\n",
" \"of\" => 1281\n",
" \"i\" => 1242\n",
" \"a\" => 864\n",
" \"to\" => 760\n",
" \"in\" => 605\n",
" \"was\" => 550\n",
" \"that\" => 451\n",
" \"my\" => 439"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"using TextAnalysis\n",
"\n",
"sd = StringDocument(raw_text)\n",
"remove_case!(sd)\n",
"prepare!(sd, strip_non_letters)\n",
"crps = Corpus([sd])\n",
"update_lexicon!(crps)\n",
"lex_dict = lexicon(crps)\n",
"\n",
"partialsort([lex_dict...], 1:10; by = x -> x[2], rev=true)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Word frequency tends to follow a power law distribution (specifically the Zipfian) as we go down the ranks. To get a better idea, we plot the figure of the word frequency."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.10.3",
"language": "julia",
"name": "julia-1.10"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.10.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit ce781f9

Please sign in to comment.