update RNN

NeroBlackstone · May 13, 2024 · ce781f9 · ce781f9
1 parent ba14b95
commit ce781f9
Showing 5 changed files with 385 additions and 4 deletions.
diff --git a/Project.toml b/Project.toml
@@ -10,12 +10,15 @@ DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
 Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
 FeatureTransforms = "8fd68953-04b8-4117-ac19-158bf6de9782"
 Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
+HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
 IJulia = "7073ff75-c697-5162-941a-fcdaad2a7d2a"
 ImageShow = "4e3cecfd-b093-5904-9786-8bbb286a6a31"
+IterTools = "c8e1da08-722c-5040-9ed9-7db0dc04731e"
 JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
 MLBase = "f0e99cf1-93fa-52ec-9ecc-5026115318e0"
 MLDatasets = "eb30cadb-4394-5ae3-aed4-317e484a6458"
 MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
 Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"
 Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
+TextAnalysis = "a2db99b7-8b79-58f8-94bf-bbc811eef33d"
 Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
diff --git a/_toc.yml b/_toc.yml
@@ -34,5 +34,10 @@ parts:
     numbered: true
     chapters:
     - file: notebooks/chapter_builders_guide/parameters.ipynb
-    - file: /notebooks/chapter_builders_guide/read-write.ipynb
-
+    - file: notebooks/chapter_builders_guide/read-write.ipynb
+  - caption: Recurrent Neural Networks
+    numbered: true
+    chapters:
+    - file: chapter_recurrent_neural_networks/sequence.ipynb
+    - file: chapter_recurrent_neural_networks/text-sequence.ipynbs
+
diff --git a/notebooks/chapter_builders_guide/read-write.ipynb b/notebooks/chapter_builders_guide/read-write.ipynb
@@ -233,13 +233,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 46,
+   "execution_count": 47,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "false"
+       "true"
       ]
      },
      "metadata": {},

diff --git a/notebooks/chapter_recurrent_neural_networks/sequence.ipynb b/notebooks/chapter_recurrent_neural_networks/sequence.ipynb
diff --git a/notebooks/chapter_recurrent_neural_networks/text-sequence.ipynb b/notebooks/chapter_recurrent_neural_networks/text-sequence.ipynb
@@ -0,0 +1,218 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Converting Raw Text into Sequence Data\n",
+    "\n",
+    "## Reading the Dataset\n",
+    "\n",
+    "Here, we will work with H. G. Wells’ The Time Machine, a book containing just over 30,000 words. While real applications will typically involve significantly larger datasets, this is sufficient to demonstrate the preprocessing pipeline. The following _download method reads the raw text into a string.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\"/tmp/jl_eD65mPmPM9\""
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "using Downloads\n",
+    "\n",
+    "file_path = Downloads.download(\"https://www.gutenberg.org/cache/epub/35/pg35.txt\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\"\\ufeffThe Project Gutenberg eBook of The Time Machine\\r\\n    \\r\\nTh\""
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "raw_text = open(io->read(io, String),file_path)\n",
+    "raw_text[begin:60]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For simplicity, we ignore punctuation and capitalization when preprocessing the raw text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\" the project gutenberg ebook of the time machine this ebook \""
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "str = lowercase(replace(raw_text,r\"[^A-Za-z]+\"=>\" \"))\n",
+    "str[begin:60]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tokenization\n",
+    "\n",
+    "Tokens are the atomic (indivisible) units of text. Each time step corresponds to 1 token, but what precisely constitutes a token is a design choice. For example, we could represent the sentence “Baby needs a new pair of shoes” as a sequence of 7 words, where the set of all words comprise a large vocabulary (typically tens or hundreds of thousands of words). Or we would represent the same sentence as a much longer sequence of 30 characters, using a much smaller vocabulary (there are only 256 distinct ASCII characters). Below, we tokenize our preprocessed text into a sequence of characters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\" ,t,h,e, ,p,r,o,j,e,c,t, ,g,u,t,e,n,b,e,r,g, ,e,b,o,o,k, ,o\""
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "tokens = [str...]\n",
+    "join(tokens[begin:30],\",\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Vocabulary\n",
+    "\n",
+    "We now construct a vocabulary for our dataset, converting the sequence of strings into a list of numerical indices. Note that we have not lost any information and can easily convert our dataset back to its original (string) representation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "indices:[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n",
+      "words:[' ', 't', 'h', 'e', 'p', 'r', 'o', 'j', 'c', 'g']\n"
+     ]
+    }
+   ],
+   "source": [
+    "vocab = unique(tokens)\n",
+    "vocab_dict = Dict(vocab .=> 1:length(vocab))\n",
+    "indices_dict = Dict(i[2]=>i[1] for i in vocab_dict)\n",
+    "\n",
+    "to_indices(v::Vector{Char}) = [vocab_dict[i] for i in v]\n",
+    "to_vocab(v::Vector{Int}) = [indices_dict[i] for i in v]\n",
+    "\n",
+    "indices = to_indices(vocab[begin:10])\n",
+    "println(\"indices:$(indices)\")\n",
+    "println(\"words:$(to_vocab(indices))\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exploratory Language Statistics\n",
+    "\n",
+    "Using the real corpus and the Vocab class defined over words, we can inspect basic statistics concerning word use in our corpus. Below, we construct a vocabulary from words used in The Time Machine and print the ten most frequently occurring of them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 57,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "10-element view(::Vector{Pair{String, Int64}}, 1:10) with eltype Pair{String, Int64}:\n",
+       "  \"the\" => 2468\n",
+       "  \"and\" => 1296\n",
+       "   \"of\" => 1281\n",
+       "    \"i\" => 1242\n",
+       "    \"a\" => 864\n",
+       "   \"to\" => 760\n",
+       "   \"in\" => 605\n",
+       "  \"was\" => 550\n",
+       " \"that\" => 451\n",
+       "   \"my\" => 439"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "using TextAnalysis\n",
+    "\n",
+    "sd = StringDocument(raw_text)\n",
+    "remove_case!(sd)\n",
+    "prepare!(sd, strip_non_letters)\n",
+    "crps = Corpus([sd])\n",
+    "update_lexicon!(crps)\n",
+    "lex_dict = lexicon(crps)\n",
+    "\n",
+    "partialsort([lex_dict...], 1:10; by = x -> x[2], rev=true)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    " Word frequency tends to follow a power law distribution (specifically the Zipfian) as we go down the ranks. To get a better idea, we plot the figure of the word frequency."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Julia 1.10.3",
+   "language": "julia",
+   "name": "julia-1.10"
+  },
+  "language_info": {
+   "file_extension": ".jl",
+   "mimetype": "application/julia",
+   "name": "julia",
+   "version": "1.10.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}