You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Embedding_layer_for_python.ipynb
+123-2
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@
6
6
"name": "Embedding layer for python.ipynb",
7
7
"provenance": [],
8
8
"toc_visible": true,
9
-
"authorship_tag": "ABX9TyM3xmkE0c1edQeEECt+lVFq",
9
+
"authorship_tag": "ABX9TyMgLBY3fqnOq5XgoOKy7zY8",
10
10
"include_colab_link": true
11
11
},
12
12
"kernelspec": {
@@ -37,7 +37,13 @@
37
37
"\r\n",
38
38
"The dataset is at http://www.phontron.com/download/conala-corpus-v1.1.zip\r\n",
39
39
"\r\n",
40
-
"We will do language model on the conala-mined part of the dataset."
40
+
"We will do language model on the conala-mined part of the dataset.\r\n",
41
+
"\r\n",
42
+
"The problem with tokenization is that the python tokenizer tokenizes comments and strings in print() as seperate token. If we make seperate tokens we will have a huge library. Another approach is using character level dictionary. But that increases the model output len and will take a longer time to train embeddings.\r\n",
43
+
"\r\n",
44
+
"Another problem is that conala dataset has very less newline and \\tab.\r\n",
45
+
"\r\n",
46
+
"Use the second dataset at https://www.sri.inf.ethz.ch/py150 if need more training."
0 commit comments