-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor Results on Large Corpus #210
Comments
Update: I waited for a while for shuffling to terminate and it terminated with the following error: $ build/shuffle -memory 16.0 -verbose 2 < out/cooccur.bin > out/cooccurrence.shuf.bin
Using random seed 1680251209
SHUFFLING COOCCURRENCES
array size: 1020054732
Shuffling by chunks: processed 1020054732 lines../demo.sh: line 45: 355 Killed $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE |
I've never tried glove without shuffling, so I can't advise on whether or not this kind of curve is how it normally goes with unshuffled text. You could always try a smaller version of your dataset and compared shuffled vs non-shuffled if you're curious. However, the best place to start is probably with running shuffle, where you should be able to set the |
I have solved the memory error by decreasing the value of the memory parameter in the script. Now, I have trained my model with the following parameters:
I expect my model to give better results in syntactic/semantic analysis tasks compared to Word2Vec with (5 epochs + 300 embeddings). But unfortunately, results of GloVe are worse than Word2Vec results. Is there something wrong with my parameters? My corpus is ~10.5 GB. Overall, I have 1,384,961,747 tokens and 1,573,013 unique Some of the possible problems that come to my mind:
I'm stuck at this point and can't really see why GloVe word vectors are performing extremely poorly - open to suggestions to iterate new ideas/play with parameters etc @AngledLuffa. Note: Sorry for changing the title. My previous problem with shuffling is solved, thank you for that. |
I don't know how to figure it out based on this information, but if you
send me a sample of the text you are using to train, I can take a look and
see if there is anything obvious
…On Tue, Apr 4, 2023 at 12:04 AM Karahan Sarıtaş ***@***.***> wrote:
I have solved the memory error by decreasing the value of the memory
parameter in the script. Now, I have trained my model with the following
parameters:
VOCAB_MIN_COUNT=10
VECTOR_SIZE=300
MAX_ITER=100
WINDOW_SIZE=5
X_MAX=100
I expect my model to give better results in syntactic/semantic analysis
tasks compared to Word2Vec with (5 epochs + 300 embeddings). But
unfortunately, results of GloVe are worse than Word2Vec results. Is there
something wrong with my parameters? My corpus is ~10.5 GB. Overall, I have
1,384,961,747 tokens and 1,573,013 unique
words (excluding words occurring less than the minimum frequency).
Some of the possible problems that come to my mind:
- Is there a problem with the corpus?: Well, I compare the resulting
vocab.txt file from GloVe with the one I had from Word2Vec. They are
almost identical. There doesn't seem to be any problem extracting the
vocabulary - therefore I guess there shouldn't be any technical problem
with the corpus. If there was a problem with corpus, we would understand it
from vocab.txt, right?
- Hardware related issues?: I trained models on both my local machine
(i7 11390H) and on a remote machine (Intel® Xeon® Gold 6342 Processor) -
results are similar.
- Overfitting?: Well, I trained GloVe with 20 iterations as well - yet
again I get awful results. (That's why I switched to 100 iterations. It is
also the suggested number in the paper for 300 dimensions.)
I'm stuck at this point and can't really see why GloVe word vectors are
performing extremely poorly - open to suggestions to iterate new ideas/play
with parameters etc.
—
Reply to this email directly, view it on GitHub
<#210 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWKXPJ46DBRJZMQHFSLW7PBZJANCNFSM6AAAAAAWOLMGUE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Let me share some glimpses from the content:
Here is the output of
Each example is separated by
Technically, this is one example composed of several sentences. We used the same corpus for Word2Vec as well - so such examples shouldn't be a problem (unless there is a more special technical issue). If you can and want to spend more time and effort on it, here is the link to the corpus we are using: https://drive.google.com/file/d/1BhHG8-btnTcfndU5fvsvTG3mD9WGf6L0/view?usp=sharing |
You would ideally have one sentence per line, I would say. It should still
work, though, as we have trained English vectors with multiple sentences in
a line and only noticed a small drop in performance.
I'll download the corpus and take a look, although I can't promise I'll
find anything. Glove is not too familiar to me (there just isn't anyone
else to work on it at this point)
…On Tue, Apr 4, 2023 at 1:48 AM Karahan Sarıtaş ***@***.***> wrote:
Let me share some glimpses from the content:
Here is the output of head -1 corpus.txt:
lovecraft'ın türkçe'deki ilk kitabı
Here is the output of head -5 corpus.txt:
lovecraft'ın türkçe'deki ilk kitabı
yazarın ikinci kitabı
lovecraft türkçe'de
cthulhu'nun çağrısı ve ardından deliliğin dağlarında adlı eserleri türkçe'ye çevrilen howard phillips lovecraft korku ve gerilim ustası bir yazar
beş mayıs howard phillips lovecraft'ın yaşamı boyunca yazdığı elli bir öyküden sekizini bir araya getiren cthulhu'nun çağrısı gotik edebiyatın klasik örneklerinden biri sayılıyor
Each example is separated by \n. Examples do not have to be single
sentences, they can be a collection of couple of sentences as well. For
example, there is an example like this as well:
Beşiktaş Teknik Direktörü Bernd Schuster , kulübeye çektiği İbrahim Üzülmez dışında son haftalardaki tertibiyle sahadaydı . 4'lü defansın önünde Mehmet Aurelio ile Ernst , onlarında önünde Guti , üçlü hücumcu olarak da sağda Tabata , solda Holosko ve ortada Nobre görev yaptı . Oyun anlayışında bir değişiklik düşünülmediğinden alışılagelmiş şablon içerisinde bir futbol vardı . Defans bloku kalenin uzağında kademeleniyor , kazanılan toplar Ernst ve Guti tarafından forvet elemanlarına servis ediliyordu . Dün gece gene Guti'nin ne kadar önemli bir oyuncu olduğu izlendi . Ayağından çıkan topların çoğunluğu arkadaşlarını pozisyona sokuyordu . 79'da Nobre'nin kafasına adeta topu kondurması ustalığının getirisiydi . Sarı-Kırmızılı takım topa daha çok sahip olmasına rağmen ataklarda çoğalamamanın sıkıntısını yaşadı . 2-3 önemli pozisyondan da istifade etmesini bilemediler .
Technically, this is one example composed of several sentences. We used
the same corpus for Word2Vec as well - so such examples shouldn't be a
problem (unless there is a more special technical issue).
As you can see, all tokens are separated by spaces.
If you can and want to spend more time and effort on it, here is the link
to the corpus we are using:
https://drive.google.com/file/d/1BhHG8-btnTcfndU5fvsvTG3mD9WGf6L0/view?usp=sharing
—
Reply to this email directly, view it on GitHub
<#210 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWJEEFFPESPARDPXQN3W7PN5XANCNFSM6AAAAAAWOLMGUE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'd be very grateful for any assistance you could provide. If you have time to train the model as well, please use from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format("path/to/glove/vectors.txt", no_header = True, binary=False)
print(word_vectors.most_similar_cosmul(positive=['kadın', 'kral'], negative=['adam'])) The output is like this:
So if we somehow manage to make our model return |
I came across with some sources that suggest that Word2Vec performs better than GloVe in Turkish.
So, I happen to think like there is no technical issue with our results - it's a fact that GloVe doesn't perform as good as Word2Vec for agglutinative languages like Turkish. If it's really the case, what would you say the main reason for that @AngledLuffa ? (Having some problems with our implementation is still an option but doesn't seem likely) |
I'm not sure when or if I'll have time to do a deep dive into this, but I
will point out that fasttext should be better in general for agglutinative
languages on account of looking at word pieces.
…On Thu, Apr 6, 2023 at 12:50 AM Karahan Sarıtaş ***@***.***> wrote:
I came across with some sources that suggest that Word2Vec performs better
than GloVe in Turkish.
- For example, here
<https://www.cmpe.boun.edu.tr/content/building-word-embeddings-repository-turkish>,
in the "About Glove" section, it is stated that "In the article published
by Stanford University, GloVe is showed to be better than Word2Vec. But in
our study for Turkish, Word2Vec gave better results".
- Here in this
<https://dergipark.org.tr/tr/download/article-file/790325> paper, it
is stated in the conclusion part that Word2Vec performs better than GloVe
in analogy tasks.
So, I happen to think like there is no technical issue with our results -
it's a fact that GloVe doesn't perform as good as Word2Vec for
agglutinative languages like Turkish. If it's really the case, what would
you say the main reason for that @AngledLuffa
<https://github.com/AngledLuffa> ? (Having some problems with our
implementation is still an option but doesn't seem likely)
—
Reply to this email directly, view it on GitHub
<#210 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWLI3PMUEKKGP2GP6FTW7ZYV3ANCNFSM6AAAAAAWOLMGUE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Greetings,
I'm trying to train my own GloVe word embeddings for Turkish language using a corpus of size ~10 GB. I have enough disk capacity on my computer and 16 GB memory. I created the
vocab.txt
successfully, I can confirm that there is no problem with it. Now, I believe I successfully generated acooccurrence
matrix as well which is of size ~35 GB but afterwards shuffling took too long and suddenly terminated. In contrast to the cooccurrence generation step, shuffling seems non-responsive, it's not really printing anything to the console. Then I decided to train my model on an unshuffled cooccurrence matrix directly.I trained it for 20 iterations. Cost for each iteration was something like this (numbers are not precise but my point is that the cost increased for first 3 iterations and then gradually decreased to ~0.11):
Then I loaded the word vectors using
load_word2vec_format
function provided bygensim
. Tested the vectors with several analogy tasks and unfortunately, the results are terrible. So, here is my questions:I tried to print out some local variables and saw that they are increasing. So, the program is actually running but it feels like it will run to forever (if not terminates due to some error). Is it really supposed to take that long (even longer than cooccurrence matrix generation)? I'm suspicious that my memory is not enough. If it's the case, is there any solution rather than simply switching to another hardware/remote server etc.? (Also, it would be really weird that my memory is enough for matrix generation but not for shuffling o.O')
Note: I'm training on Windows using Ubuntu WSL. FYI
The text was updated successfully, but these errors were encountered: