Create Interlinear Text from two published books #65
Replies: 8 comments
-
You might be interested in exploring or contributing to the site linked
below.
Another person interested in interlinear displays for language translations.
kevin
PS https://interlineardotworld.blogspot.com/
…On Wed, Dec 25, 2024 at 11:09 PM mariuscmorar ***@***.***> wrote:
I wanted to run an idea by you guys. I'm not sure if you've tought about
this.
I was thinking to take two books, one English and one Spanish, then take
the first English sentence and right under it have the equivalent Spanish
sentence. This could be any pair of languages. Then take the next English
sentence and the equivalent Spanish one and so on. This would be similar to
what DoppelText is doing. It's not as good as the interlinear text but I
think much better than having two parallel texts. If I have two books side
by side, I lose a lot of time switching between them and finding my way in
the paragraph. If I have two sentences right under each other, I don't lose
any time. The order doesn't matter. In the more beginner phase, I find it
more useful to read the English first and then the Spanish. In the more
advanced stages, I can just read the Spanish and read the translation as
needed.
I have epub versions of books two books. The tricky part is manipulating
the text. And it's not just taking the first sentence from each and
combining them. The meaning has to match. If there is one sentence in one
text and two equivalent ones in the other, they would have to match by
taking two sentences at a time or breaking the long one up.
I tried using ChatGPT to do this. The problem I'm having is that it can't
handle large amount of texts. With the O1 model I was able to do about
5,000 words but then I run out of credits. Then I tried with the 4o model.
I created three documents in the chat, one with English text one with
Spanish and one with the interlinear. It only did about 7 pairs and then it
stopped. I kept asking it to keep going and it would go a bit further and
stop.
I think an LLM model would be great at this because it understands meaning
so it can match the exact content. We could also have the model checking
its own work.
I think it's possible to do it, but I don't have the skills. I'm not a
programmer.
This would open up the entire world of books out there. Anything that's
translated. In Spanish, there's interlinear texts from HypLern and
interlinearbooks, and then there's only DoppelText which does what I'm
proposing, but they only have three books in Spanish. There's just not
enough in Spanish in this format.
One way would be to acquire vocabulary by reading these texts 10 times.
Another would be to read 10 times more text once. Let's say I read the
interlinear texts 5 times and then move to this format. Do I want to read
the same 3 books 5 times? Why not read 15 books. Why read the Three
Musketeers 5 times and not read 5 of Dumas' books once?
One advantage is that I see the same vocabulary in a much broader context.
The biggest advantage is that it keeps my interest. People like different
things and want to read different things.
I read a decent amount in English, Many books I read are available in
other languages. I could just incorporate language learning into my current
reading process. It would just take 2-3 times longer. Then learning a
language and building vocabulary is not something I do on its own, a
separate thing. I can incorporate it into something I already do.
There's just so many advantages to doing something like this.
The main advantage of doing this is that the quality of the translation
can be better than a translation generated by LLMs. THe dissadvantage is
the there might not be direct word for word translation.
Another possibility is to do this and then do the word for word
translation as well. Maybe have the user pick the order. I think the order
depends on the level. If I'm at an late intermediate or late advanced
stage, I would pick Spanish, then word for word, then thought for though
tranlation since I would want to try to read mostly inspanish and rely on
translation as needed. At an beginner or early intermediate I would do
english thought for though first, then spanish, then english word for word.
Let me know if you have any thoughts on this or if you would be interested
in developing a solution for this. I think the LLMs open up a whole new
range of possibilities when it comes to language acquisition.
Here's the prompt I used with the O1-mini model. The output is not perfect
but maybe it can be improved with better prompting.
I will give you two texts. One in English and another one in another
language (Spanish, German). I want you to create an interlinear text by
taking breaking the english text up sentence by sentence and after each
sentence paste the corespending sentence from the the other language
without altering either text. For example, if I give you and English and a
Spanish text, take the first english sentence exactly word for word, then
take the coresponding sentence from the Spanish text exactly word for word
and paste it after the English sentence. Then add a blank row. Then take
the second english sentence exactly word for word, then take the
coresponding sentence from the Spanish text exactly word for word and paste
it after the second English sentence. Then add a blank row. Then take the
third english sentence exactly word for word, then take the coresponding
sentence from the Spanish text exactly word for word and paste it after the
English sentence. And so on until you get to the end of the paragraph. At
the end of each paragraph, add two blank rows. The output should be a
combination of the two texts I gave you broken out sentence by sentence.
Each pair of english and spanish sentence should be separated by a blank
row and each paragraph by two blank rows. It is absolutely critical that
both english and spanish texts are copied word for word. There should be
absolutely no modification to the text. The texts should be 100% exactly
the same. The only operation you are performing is spliting the text into
sentences and weaving the spanish text into the english text to create an
interlinear text. The words of the text are not to be modified under any
circumstances.
Regards,
Marius
Here's the example of the output text:
PROLOGUE
PRÓLOGO
Mysterious bands of men on horseback travel the roads of Greece.
Misteriosos grupos de hombres a caballo recorren los caminos de Grecia.
The country folk watch them with suspicion from their plots of land, or
the doors to their huts.
Los campesinos los observan con desconfianza desde sus tierras o desde las
puertas de sus cabañas.
They know from experience that only those who represent danger travel:
soldiers, mercenaries, and slave traders.
La experiencia les ha enseñado que solo viaja la gente peligrosa:
soldados, mercenarios y traficantes de esclavos.
They frown and grumble until the men disappear over the horizon.
Arrugan la frente y gruñen hasta que los ven hundirse otra vez en el
horizonte.
Country folk do not look kindly upon armed strangers.
No les gustan los forasteros armados.
The horsemen ride on, paying the villagers no heed.
Los jinetes cabalgan sin fijarse en los aldeanos.
For months, they have climbed mountains, traversed ravines, crossed
valleys, forded rivers, and sailed from island to island.
Durante meses han escalado montañas, han franqueado desfiladeros, han
cruzado valles, han vadeado ríos, han navegado de isla en isla.
Their muscles have hardened and their endurance increased since they were
sent on this peculiar mission.
Sus músculos y su resistencia se han endurecido desde que les encargaron
esta extraña misión.
To achieve their task, they must venture into violent realms in a world
that is almost continually at war.
Para cumplir su tarea deben aventurarse por los violentos territorios de
un mundo en guerra casi constante.
—
Reply to this email directly, view it on GitHub
<#65>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHNWJVRCSFKA4FKNJATNDY32HMUK3AVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXG42DGOBZHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Thanks! I’ll check it out. Regards,MariusOn Dec 27, 2024, at 9:57 AM, jkmactavish ***@***.***> wrote:
You might be interested in exploring or contributing to the site linked
below.
Another person interested in interlinear displays for language translations.
kevin
PS https://interlineardotworld.blogspot.com/
On Wed, Dec 25, 2024 at 11:09 PM mariuscmorar ***@***.***> wrote:
I wanted to run an idea by you guys. I'm not sure if you've tought about
this.
I was thinking to take two books, one English and one Spanish, then take
the first English sentence and right under it have the equivalent Spanish
sentence. This could be any pair of languages. Then take the next English
sentence and the equivalent Spanish one and so on. This would be similar to
what DoppelText is doing. It's not as good as the interlinear text but I
think much better than having two parallel texts. If I have two books side
by side, I lose a lot of time switching between them and finding my way in
the paragraph. If I have two sentences right under each other, I don't lose
any time. The order doesn't matter. In the more beginner phase, I find it
more useful to read the English first and then the Spanish. In the more
advanced stages, I can just read the Spanish and read the translation as
needed.
I have epub versions of books two books. The tricky part is manipulating
the text. And it's not just taking the first sentence from each and
combining them. The meaning has to match. If there is one sentence in one
text and two equivalent ones in the other, they would have to match by
taking two sentences at a time or breaking the long one up.
I tried using ChatGPT to do this. The problem I'm having is that it can't
handle large amount of texts. With the O1 model I was able to do about
5,000 words but then I run out of credits. Then I tried with the 4o model.
I created three documents in the chat, one with English text one with
Spanish and one with the interlinear. It only did about 7 pairs and then it
stopped. I kept asking it to keep going and it would go a bit further and
stop.
I think an LLM model would be great at this because it understands meaning
so it can match the exact content. We could also have the model checking
its own work.
I think it's possible to do it, but I don't have the skills. I'm not a
programmer.
This would open up the entire world of books out there. Anything that's
translated. In Spanish, there's interlinear texts from HypLern and
interlinearbooks, and then there's only DoppelText which does what I'm
proposing, but they only have three books in Spanish. There's just not
enough in Spanish in this format.
One way would be to acquire vocabulary by reading these texts 10 times.
Another would be to read 10 times more text once. Let's say I read the
interlinear texts 5 times and then move to this format. Do I want to read
the same 3 books 5 times? Why not read 15 books. Why read the Three
Musketeers 5 times and not read 5 of Dumas' books once?
One advantage is that I see the same vocabulary in a much broader context.
The biggest advantage is that it keeps my interest. People like different
things and want to read different things.
I read a decent amount in English, Many books I read are available in
other languages. I could just incorporate language learning into my current
reading process. It would just take 2-3 times longer. Then learning a
language and building vocabulary is not something I do on its own, a
separate thing. I can incorporate it into something I already do.
There's just so many advantages to doing something like this.
The main advantage of doing this is that the quality of the translation
can be better than a translation generated by LLMs. THe dissadvantage is
the there might not be direct word for word translation.
Another possibility is to do this and then do the word for word
translation as well. Maybe have the user pick the order. I think the order
depends on the level. If I'm at an late intermediate or late advanced
stage, I would pick Spanish, then word for word, then thought for though
tranlation since I would want to try to read mostly inspanish and rely on
translation as needed. At an beginner or early intermediate I would do
english thought for though first, then spanish, then english word for word.
Let me know if you have any thoughts on this or if you would be interested
in developing a solution for this. I think the LLMs open up a whole new
range of possibilities when it comes to language acquisition.
Here's the prompt I used with the O1-mini model. The output is not perfect
but maybe it can be improved with better prompting.
I will give you two texts. One in English and another one in another
language (Spanish, German). I want you to create an interlinear text by
taking breaking the english text up sentence by sentence and after each
sentence paste the corespending sentence from the the other language
without altering either text. For example, if I give you and English and a
Spanish text, take the first english sentence exactly word for word, then
take the coresponding sentence from the Spanish text exactly word for word
and paste it after the English sentence. Then add a blank row. Then take
the second english sentence exactly word for word, then take the
coresponding sentence from the Spanish text exactly word for word and paste
it after the second English sentence. Then add a blank row. Then take the
third english sentence exactly word for word, then take the coresponding
sentence from the Spanish text exactly word for word and paste it after the
English sentence. And so on until you get to the end of the paragraph. At
the end of each paragraph, add two blank rows. The output should be a
combination of the two texts I gave you broken out sentence by sentence.
Each pair of english and spanish sentence should be separated by a blank
row and each paragraph by two blank rows. It is absolutely critical that
both english and spanish texts are copied word for word. There should be
absolutely no modification to the text. The texts should be 100% exactly
the same. The only operation you are performing is spliting the text into
sentences and weaving the spanish text into the english text to create an
interlinear text. The words of the text are not to be modified under any
circumstances.
Regards,
Marius
Here's the example of the output text:
PROLOGUE
PRÓLOGO
Mysterious bands of men on horseback travel the roads of Greece.
Misteriosos grupos de hombres a caballo recorren los caminos de Grecia.
The country folk watch them with suspicion from their plots of land, or
the doors to their huts.
Los campesinos los observan con desconfianza desde sus tierras o desde las
puertas de sus cabañas.
They know from experience that only those who represent danger travel:
soldiers, mercenaries, and slave traders.
La experiencia les ha enseñado que solo viaja la gente peligrosa:
soldados, mercenarios y traficantes de esclavos.
They frown and grumble until the men disappear over the horizon.
Arrugan la frente y gruñen hasta que los ven hundirse otra vez en el
horizonte.
Country folk do not look kindly upon armed strangers.
No les gustan los forasteros armados.
The horsemen ride on, paying the villagers no heed.
Los jinetes cabalgan sin fijarse en los aldeanos.
For months, they have climbed mountains, traversed ravines, crossed
valleys, forded rivers, and sailed from island to island.
Durante meses han escalado montañas, han franqueado desfiladeros, han
cruzado valles, han vadeado ríos, han navegado de isla en isla.
Their muscles have hardened and their endurance increased since they were
sent on this peculiar mission.
Sus músculos y su resistencia se han endurecido desde que les encargaron
esta extraña misión.
To achieve their task, they must venture into violent realms in a world
that is almost continually at war.
Para cumplir su tarea deben aventurarse por los violentos territorios de
un mundo en guerra casi constante.
—
Reply to this email directly, view it on GitHub
<#65>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHNWJVRCSFKA4FKNJATNDY32HMUK3AVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXG42DGOBZHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Marius,
Do I understand correctly that you want to have book-length text displayed
in interlinear form? and that you would read such long texts?
If so, I believe I have misjudged the universe of those who want to have
and use interlinear documents. Based on your answer, I will correct what I
wrote on that blog.
Thanks for a reply,
kevin
…On Wed, Dec 25, 2024 at 11:09 PM mariuscmorar ***@***.***> wrote:
I wanted to run an idea by you guys. I'm not sure if you've tought about
this.
I was thinking to take two books, one English and one Spanish, then take
the first English sentence and right under it have the equivalent Spanish
sentence. This could be any pair of languages. Then take the next English
sentence and the equivalent Spanish one and so on. This would be similar to
what DoppelText is doing. It's not as good as the interlinear text but I
think much better than having two parallel texts. If I have two books side
by side, I lose a lot of time switching between them and finding my way in
the paragraph. If I have two sentences right under each other, I don't lose
any time. The order doesn't matter. In the more beginner phase, I find it
more useful to read the English first and then the Spanish. In the more
advanced stages, I can just read the Spanish and read the translation as
needed.
I have epub versions of books two books. The tricky part is manipulating
the text. And it's not just taking the first sentence from each and
combining them. The meaning has to match. If there is one sentence in one
text and two equivalent ones in the other, they would have to match by
taking two sentences at a time or breaking the long one up.
I tried using ChatGPT to do this. The problem I'm having is that it can't
handle large amount of texts. With the O1 model I was able to do about
5,000 words but then I run out of credits. Then I tried with the 4o model.
I created three documents in the chat, one with English text one with
Spanish and one with the interlinear. It only did about 7 pairs and then it
stopped. I kept asking it to keep going and it would go a bit further and
stop.
I think an LLM model would be great at this because it understands meaning
so it can match the exact content. We could also have the model checking
its own work.
I think it's possible to do it, but I don't have the skills. I'm not a
programmer.
This would open up the entire world of books out there. Anything that's
translated. In Spanish, there's interlinear texts from HypLern and
interlinearbooks, and then there's only DoppelText which does what I'm
proposing, but they only have three books in Spanish. There's just not
enough in Spanish in this format.
One way would be to acquire vocabulary by reading these texts 10 times.
Another would be to read 10 times more text once. Let's say I read the
interlinear texts 5 times and then move to this format. Do I want to read
the same 3 books 5 times? Why not read 15 books. Why read the Three
Musketeers 5 times and not read 5 of Dumas' books once?
One advantage is that I see the same vocabulary in a much broader context.
The biggest advantage is that it keeps my interest. People like different
things and want to read different things.
I read a decent amount in English, Many books I read are available in
other languages. I could just incorporate language learning into my current
reading process. It would just take 2-3 times longer. Then learning a
language and building vocabulary is not something I do on its own, a
separate thing. I can incorporate it into something I already do.
There's just so many advantages to doing something like this.
The main advantage of doing this is that the quality of the translation
can be better than a translation generated by LLMs. THe dissadvantage is
the there might not be direct word for word translation.
Another possibility is to do this and then do the word for word
translation as well. Maybe have the user pick the order. I think the order
depends on the level. If I'm at an late intermediate or late advanced
stage, I would pick Spanish, then word for word, then thought for though
tranlation since I would want to try to read mostly inspanish and rely on
translation as needed. At an beginner or early intermediate I would do
english thought for though first, then spanish, then english word for word.
Let me know if you have any thoughts on this or if you would be interested
in developing a solution for this. I think the LLMs open up a whole new
range of possibilities when it comes to language acquisition.
Here's the prompt I used with the O1-mini model. The output is not perfect
but maybe it can be improved with better prompting.
I will give you two texts. One in English and another one in another
language (Spanish, German). I want you to create an interlinear text by
taking breaking the english text up sentence by sentence and after each
sentence paste the corespending sentence from the the other language
without altering either text. For example, if I give you and English and a
Spanish text, take the first english sentence exactly word for word, then
take the coresponding sentence from the Spanish text exactly word for word
and paste it after the English sentence. Then add a blank row. Then take
the second english sentence exactly word for word, then take the
coresponding sentence from the Spanish text exactly word for word and paste
it after the second English sentence. Then add a blank row. Then take the
third english sentence exactly word for word, then take the coresponding
sentence from the Spanish text exactly word for word and paste it after the
English sentence. And so on until you get to the end of the paragraph. At
the end of each paragraph, add two blank rows. The output should be a
combination of the two texts I gave you broken out sentence by sentence.
Each pair of english and spanish sentence should be separated by a blank
row and each paragraph by two blank rows. It is absolutely critical that
both english and spanish texts are copied word for word. There should be
absolutely no modification to the text. The texts should be 100% exactly
the same. The only operation you are performing is spliting the text into
sentences and weaving the spanish text into the english text to create an
interlinear text. The words of the text are not to be modified under any
circumstances.
Regards,
Marius
Here's the example of the output text:
PROLOGUE
PRÓLOGO
Mysterious bands of men on horseback travel the roads of Greece.
Misteriosos grupos de hombres a caballo recorren los caminos de Grecia.
The country folk watch them with suspicion from their plots of land, or
the doors to their huts.
Los campesinos los observan con desconfianza desde sus tierras o desde las
puertas de sus cabañas.
They know from experience that only those who represent danger travel:
soldiers, mercenaries, and slave traders.
La experiencia les ha enseñado que solo viaja la gente peligrosa:
soldados, mercenarios y traficantes de esclavos.
They frown and grumble until the men disappear over the horizon.
Arrugan la frente y gruñen hasta que los ven hundirse otra vez en el
horizonte.
Country folk do not look kindly upon armed strangers.
No les gustan los forasteros armados.
The horsemen ride on, paying the villagers no heed.
Los jinetes cabalgan sin fijarse en los aldeanos.
For months, they have climbed mountains, traversed ravines, crossed
valleys, forded rivers, and sailed from island to island.
Durante meses han escalado montañas, han franqueado desfiladeros, han
cruzado valles, han vadeado ríos, han navegado de isla en isla.
Their muscles have hardened and their endurance increased since they were
sent on this peculiar mission.
Sus músculos y su resistencia se han endurecido desde que les encargaron
esta extraña misión.
To achieve their task, they must venture into violent realms in a world
that is almost continually at war.
Para cumplir su tarea deben aventurarse por los violentos territorios de
un mundo en guerra casi constante.
—
Reply to this email directly, view it on GitHub
<#65>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHNWJVRCSFKA4FKNJATNDY32HMUK3AVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXG42DGOBZHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi Kevin,
Absolutely Yes!!!
But let me pursue this further, why wouldn't everyone want it?
Let's think through this. The amount of people who want to learn languages is large. Most fail or quit. Why? Because it's hard and boring!!! They do drills, they do flashcards, anki, etc That's what I used to do. The Hamiltonian method, which is the method you and I discovered, tells us that it doesn't have to be hard or super boring! So, if you don't do flashcards, how do you do it? By reading!
Then there's the question of, how well do you want to learn a language? If you just talk to people or watch TV or listen to podcasts, you will only acquire a limited vocabulary range, 5,000-10,000 words. If you want to learn the language at an advanced level, you have to read. And the frequency of words beyond the most frequent 10,000 words is very very low. So, unless you make flashcard and memorize them that way, which is not fun, then you'll have to read, and read, and read and keep on reading. There's just no way around it.
I would say that most people want to learn languages to talk to people and get by. But a significant number want to be able to read. A large number would want to acquire 20,000 words, which is what an average native speaker has. Quite a few people would want to acquire 30,000 or more, which is what an educated person has. How many people get there? Very few. Why? Because it's very boring and hard work looking up words in the dictionary and making flashcards.
So what's the solution. The solution is interlinear texts. Why? Because it's the most efficient way to acquire vocabulary. I would say at the very beginning Assimil manuals and the shadowing technique as taught by Alexander Arguelles is the most efficient. Then at the intermediate level nothing comes close to interlinear texts. That's the level with a vocabulary size of 2,000-20,000 words. I would argue that interlinear texts are the best method until one knows at least 98%-99% of the words. If you know that many words, then you still have to look up 3-6 words per page on average, but you can figure many out from context. At the advanced level the best tool I found is ReadLang. I can look up words almost instantly and keep reading. I don't have to reread the sentence when I look up the word. It's the most efficient way. But if I have to look up a lot of them, then it becomes very inefficient. If I have to look up more than 2%, then interlinear texts are better. If less than 1-2%, then ReadLand is ideal. I can look up 3-6 words per page and still read at a decent level.
So the challenge is twofold. One, how do I acquire a large enough vocabulary so that I can pick up any text in any field and know 98% of words. There are a lot of different kinds of texts out there that use different subsets of vocabulary. Even If I know 98% of the words in a newspaper, it doesn't mean I can pick up Proust and still know 98%, or a scientific texbook, or a paper on physics, or an academic philosophy book. So how do I do it in the most efficient way? Right now there is no way. There are very few interlinear texts. The ones I found are:
HypLern, which you can buy in print as well
Interlinearbooks
LeyerlePublications, print as well but extremely expensive.
DoppelText - not trully interlinear though but can get by with Readlang and has the widest selection by far.
These sources have more material in the big languages, German, French, Italian. Much fewer in Spanish. But if you want to learn a smaller language, like Finish, you're out of luck. There's almost nothing for you. And to acquire a vocabulary at an advanced level, you need to read a lot of books. Not just a dozen, but a few dozen, maybe 100. And given how diverse people's tastes are, we would need a few thousand books for each pair of languages to give enough content in various styles.
Then there's the question of interest. People have different interests. Some like literature, other theology, other science. Some don't want to read fairytales but would devour a textbook on physics. Some don't care about science but would devour a Fantasy or Mystery or Romance novel. Because they don't have content they enjoy, they lose interest and give up, or make much slower progress than they would if they had the reading material that grabs them.
So what is the solution? One solution is to get two books and read the original and translation side by side. But that's a pain. It's extremely inefficient. You read a sentence, then have to lift your eyes and look for the sentence in the other book, then come back and find your way back, then find the words you don't know and go back and forth a few times. This is just extremely inefficient. It's better than flashcards or looking up words in the dictionary, but you will make progress slowly.
The other challenge with this is idiomatic translation. Often the translation is not word for word, and you can acquire the wrong definitions like this.
The ideal solution is to have any text one might want to read available in both word for word and thought for thought in interlinear format. We might get there in a few years with LLMs, but we're not there yet. However, with LLMs we do have a practical solution. We can have the LLMs create an interlinear text like I slowed, have the human translation, the original and the word for word. The LLMs can parse a text, extract a sentence and match its translation and then generate a word for word translation.
I am doing this right now. I wanted to ready Papyrus: The Invention of Books in the Ancient World by Irene Vallejo. I wanted to read it in English. Then I realized it's translated from Spanish, which is a language I'm learning now. So I that's how I got this idea. Why not read it in English as I would normally do, but after each sentence read the Spanish sentence as well? It would take me longer, but I would acquire vocabulary in the process. It's a proper translation, by a human. So I used ChatGPT to create a three-line interlinear text. And to deal with idiomatic translation or situations where the translator takes liberties, I have the ChatGPT word for word translation. I don't read that. I only look at it when I need help. It's not perfect, but good enough. And I'm reading it with ReadLang, so if I need even more help, I can get the meaning of a word like I would by reading texts where I already know 98% of vocabuary. I use ReadLang maybe 2-3 times per page.
This solution is not perfect. The perfect solution would be a human generated tri-line translation with both word for word and though for though. If you want to see what I think is the perfect solution, look at Leyerle Publications. They have the original, then word for word, then thought for thought, then a commentary at the end. But it's only for German, Italian and French and they only do Opera Libretti and Songs.
We will never have that done by humans at scale. But we don't need to. I think that what I'm proposing is good enough. I'm trying this out as we speak and it's working. I'm day 50 though my first Assimil manual and I'm reading a non-fiction text in Spanish. English first, then Spanish. And I can understand the Spanish text. And I building vocabulary before I even finish my first manual. When my Spanish will get good enough, I'll swap the order and have the Spanish line first then word for word and then though for thought human translation last. This is still useful to learn idioms.
I could read like this for the rest of my life. In fact, I will. Since reading like this is so effortless, I will probably always have a language that I will acquire like this. I will only give up this format when I will know at least 99% of the vocabulary. There is absolutely no reason to give up this format. When my Spanish is good enough, I will just flip the order, have Spanish first. Then I will just read the Spanish row and rely on the two translations when I need to. Until I don't need it much anymore. Then I'll move to Readlang. But I could read the entire book by just reading the first Spanish row and have the translations just in case.
I'm following Alexander Arguelles techniques to learn. Shadowing, listening and reading aloud. And reading a lot. He advises against flashcards or even looking things up in the dictionary. He advices reading interlinear texts and then two books side by side.
We don't have enough interlinear books. I found 10 texts (Hyplern and InterlinearBoox). I need 100. So this is the solution. This is what Alguelles is recommending but much much more efficient.
So, to answer your question, why wouldn't people want this? I can't imagine why anyone who is serious about learning a language at an advanced level would not want to read like this. There are two reasons I can think of: 1) they don't know about it and 2) they can't find the texts that grab them. The only texts are literary texts and some just don't want to read fairytale 5-10 times. I get it.
The thing is, this solution doesn't exist (except Leyerle Publications), so how can people not want something that doesn't exist? And then for purely interlinear, how can we say people are not interested when the texts they are interested in don't exist? I can't see any reason why this will not be used widely once we can finetune the technology.
Another advantage is that we don't have to read the same text 5-10 times. The reason we have to do that now, is because we only have a few texts. And since interlinear is the most efficient way to master vocabulary, we want to master all the vocabulary available in this form.
But if we had an almost unlimited amount of texts at our disposal, we wouldn't have to read a text 10 times. Why would I read The Three Musketeers 10 times? The only reason is that it's the only Dumas text available in this format (published by DoppelTexts). But I could read all 10 books in the entire series in that time. And I could acquire an even broader vocabulary. I wouldn't lose anything by not reading one book 10 times and instead read 10 books one time. Quite the opposite, it would keep my interest all the way though. It wouldn't feel like a chore. It wouldn't feel like studying. It wouldn't feel like learning languages. It would feel like reading. It would take 2-3 times longer in the beginning, but much quicker towards the end.
I'm super excited as you can tell. I believe this is the future of language learning. I'm very excited about LLMs. I think they will transform many aspects of the way we do things now and this will be one of them. The Hamiltonian method is effortless but it's boring because you read texts that don't grab you over and over. I think this is the first time in human history when it's possible to acquire a language almost effortlessly by reading whatever you want.
I know it works because I'm doing it.
The challenge is that I'm not a programmer and I'm using the chat and there is a limit to how much text you can input. And I have to break up a book into small bits to run the prompt which is a waste of time. We just need someone to build a back end to do the whole thing. Take two epubs, break them up, then run the prompt and put it back together.
This was not possible with the 4o model because the model wouldn't complete the task. The o1 mini does complete the task. So I think this last o1 model is a portal to another universe, a universe where we can acquire languages to a high level effortlessly. There's a lot of people who enjoy reading their favorite books, but don't like learning languages (the old way). Now they can, because language learning can just become reading and if reading is enjoyable, language learning will be enjoyable too.
I will keep breaking the book up like this until I find someone who's willing to implement the back end.
Would you be able and willing to do it? If not, do you know someone who would? I just don't have the skills otherwise I would do it myself. If not, I'll keep asking.
Regards,
Marius
________________________________
From: jkmactavish ***@***.***>
Sent: Saturday, December 28, 2024 1:04 AM
To: parkchamchi/GlossySnake ***@***.***>
Cc: mariuscmorar ***@***.***>; Author ***@***.***>
Subject: Re: [parkchamchi/GlossySnake] Create Interlinear Text from two published books (Discussion #65)
Marius,
Do I understand correctly that you want to have book-length text displayed
in interlinear form? and that you would read such long texts?
If so, I believe I have misjudged the universe of those who want to have
and use interlinear documents. Based on your answer, I will correct what I
wrote on that blog.
Thanks for a reply,
kevin
On Wed, Dec 25, 2024 at 11:09 PM mariuscmorar ***@***.***> wrote:
I wanted to run an idea by you guys. I'm not sure if you've tought about
this.
I was thinking to take two books, one English and one Spanish, then take
the first English sentence and right under it have the equivalent Spanish
sentence. This could be any pair of languages. Then take the next English
sentence and the equivalent Spanish one and so on. This would be similar to
what DoppelText is doing. It's not as good as the interlinear text but I
think much better than having two parallel texts. If I have two books side
by side, I lose a lot of time switching between them and finding my way in
the paragraph. If I have two sentences right under each other, I don't lose
any time. The order doesn't matter. In the more beginner phase, I find it
more useful to read the English first and then the Spanish. In the more
advanced stages, I can just read the Spanish and read the translation as
needed.
I have epub versions of books two books. The tricky part is manipulating
the text. And it's not just taking the first sentence from each and
combining them. The meaning has to match. If there is one sentence in one
text and two equivalent ones in the other, they would have to match by
taking two sentences at a time or breaking the long one up.
I tried using ChatGPT to do this. The problem I'm having is that it can't
handle large amount of texts. With the O1 model I was able to do about
5,000 words but then I run out of credits. Then I tried with the 4o model.
I created three documents in the chat, one with English text one with
Spanish and one with the interlinear. It only did about 7 pairs and then it
stopped. I kept asking it to keep going and it would go a bit further and
stop.
I think an LLM model would be great at this because it understands meaning
so it can match the exact content. We could also have the model checking
its own work.
I think it's possible to do it, but I don't have the skills. I'm not a
programmer.
This would open up the entire world of books out there. Anything that's
translated. In Spanish, there's interlinear texts from HypLern and
interlinearbooks, and then there's only DoppelText which does what I'm
proposing, but they only have three books in Spanish. There's just not
enough in Spanish in this format.
One way would be to acquire vocabulary by reading these texts 10 times.
Another would be to read 10 times more text once. Let's say I read the
interlinear texts 5 times and then move to this format. Do I want to read
the same 3 books 5 times? Why not read 15 books. Why read the Three
Musketeers 5 times and not read 5 of Dumas' books once?
One advantage is that I see the same vocabulary in a much broader context.
The biggest advantage is that it keeps my interest. People like different
things and want to read different things.
I read a decent amount in English, Many books I read are available in
other languages. I could just incorporate language learning into my current
reading process. It would just take 2-3 times longer. Then learning a
language and building vocabulary is not something I do on its own, a
separate thing. I can incorporate it into something I already do.
There's just so many advantages to doing something like this.
The main advantage of doing this is that the quality of the translation
can be better than a translation generated by LLMs. THe dissadvantage is
the there might not be direct word for word translation.
Another possibility is to do this and then do the word for word
translation as well. Maybe have the user pick the order. I think the order
depends on the level. If I'm at an late intermediate or late advanced
stage, I would pick Spanish, then word for word, then thought for though
tranlation since I would want to try to read mostly inspanish and rely on
translation as needed. At an beginner or early intermediate I would do
english thought for though first, then spanish, then english word for word.
Let me know if you have any thoughts on this or if you would be interested
in developing a solution for this. I think the LLMs open up a whole new
range of possibilities when it comes to language acquisition.
Here's the prompt I used with the O1-mini model. The output is not perfect
but maybe it can be improved with better prompting.
I will give you two texts. One in English and another one in another
language (Spanish, German). I want you to create an interlinear text by
taking breaking the english text up sentence by sentence and after each
sentence paste the corespending sentence from the the other language
without altering either text. For example, if I give you and English and a
Spanish text, take the first english sentence exactly word for word, then
take the coresponding sentence from the Spanish text exactly word for word
and paste it after the English sentence. Then add a blank row. Then take
the second english sentence exactly word for word, then take the
coresponding sentence from the Spanish text exactly word for word and paste
it after the second English sentence. Then add a blank row. Then take the
third english sentence exactly word for word, then take the coresponding
sentence from the Spanish text exactly word for word and paste it after the
English sentence. And so on until you get to the end of the paragraph. At
the end of each paragraph, add two blank rows. The output should be a
combination of the two texts I gave you broken out sentence by sentence.
Each pair of english and spanish sentence should be separated by a blank
row and each paragraph by two blank rows. It is absolutely critical that
both english and spanish texts are copied word for word. There should be
absolutely no modification to the text. The texts should be 100% exactly
the same. The only operation you are performing is spliting the text into
sentences and weaving the spanish text into the english text to create an
interlinear text. The words of the text are not to be modified under any
circumstances.
Regards,
Marius
Here's the example of the output text:
PROLOGUE
PRÓLOGO
Mysterious bands of men on horseback travel the roads of Greece.
Misteriosos grupos de hombres a caballo recorren los caminos de Grecia.
The country folk watch them with suspicion from their plots of land, or
the doors to their huts.
Los campesinos los observan con desconfianza desde sus tierras o desde las
puertas de sus cabañas.
They know from experience that only those who represent danger travel:
soldiers, mercenaries, and slave traders.
La experiencia les ha enseñado que solo viaja la gente peligrosa:
soldados, mercenarios y traficantes de esclavos.
They frown and grumble until the men disappear over the horizon.
Arrugan la frente y gruñen hasta que los ven hundirse otra vez en el
horizonte.
Country folk do not look kindly upon armed strangers.
No les gustan los forasteros armados.
The horsemen ride on, paying the villagers no heed.
Los jinetes cabalgan sin fijarse en los aldeanos.
For months, they have climbed mountains, traversed ravines, crossed
valleys, forded rivers, and sailed from island to island.
Durante meses han escalado montañas, han franqueado desfiladeros, han
cruzado valles, han vadeado ríos, han navegado de isla en isla.
Their muscles have hardened and their endurance increased since they were
sent on this peculiar mission.
Sus músculos y su resistencia se han endurecido desde que les encargaron
esta extraña misión.
To achieve their task, they must venture into violent realms in a world
that is almost continually at war.
Para cumplir su tarea deben aventurarse por los violentos territorios de
un mundo en guerra casi constante.
—
Reply to this email directly, view it on GitHub
<#65>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHNWJVRCSFKA4FKNJATNDY32HMUK3AVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXG42DGOBZHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
—
Reply to this email directly, view it on GitHub<#65 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMSQKOELDGZREOEEZBQVJV32HZSRRAVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRYGM3DCNI>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi, thank you for visiting this repo. The current implementation (and the design even) is unstable but I'm still interested in this project. I haven't been touching the codebase since I was doing a side project (collecting and structuring the existing interlinear corpora), but would be back on track. (Should learn Italian...) Using Langchain library would make the project stabilized and maintainable, I hope... As per the sentence-level interlinear text generation using two given corpora: The simple prompt method would be unreliable, as you mentioned, due to the token limit and the unreliability of the LLM output. A solution I'd propose is using augmented subroutines, which can be easily implemented using Langchain library. For example,
I would keep on the project. If you need any questions or help freely tell me. |
Beta Was this translation helpful? Give feedback.
-
Are you publishing somewhere the existing interlinear corpora? I would be interested as well. When you say the current LLM is unreliable, which one are you referring to? Are you referring to the o1 model? In my tests I found it to be stable. The 4o model is unstable. The o1 model can match multiple sentences to one sentence in the other language. It can even reliably break long sentences into smaller components. Proust sentences can run pages long so we will want to break them up. The limit on the back end is 128,000 tokens. My limit is smaller because I’m using the chat. We would need a script to break up the book into smaller components, some took that doesn’t rely on LLM to do this, or it relies only partially to check that the breakpoints match, let’s say chapter size or 10,000 words or 40,000 words or whatever and run the prompt like that. My problem is that I have no programing skills so I can only work with the actual chat, which will always limit the text I can work with. That’s why we need a backend solution. I don’t see why this couldn’t be done on the back end. Again, it works. The output is good. The o1 model is reliable. I just need a better way to break up a book into chapters that copy paste. The tech is there. When you were working on this the models weren’t ready. Now they are. Everything changed with the o1 model. Regards,MariusOn Dec 28, 2024, at 10:34 AM, Chanjin Park ***@***.***> wrote:
Hi, thank you for visiting this repo.
The current implementation (and the design even) is unstable and I'm still interested in this project. I hasn't been touching the codebase since I was doing a side project (collecting and structuring the existing interlinear corpora), but would be back on track. (Should learn Italian...) Using Langchain library would make the project stabilized and maintainable, I hope...
As per the sentence-level interlinear text generation using two given corpora: The simple prompt method would be unreliable, as you mentioned, due to the token limit and the unreliability of the LLM output. A solution I'd propose is using augmented subroutines, which can be easily implemented using Langchain library.
For example,
Divide the corpora as paragraphs and in which sentences. This can be done with traditional NLP tokenizers, e.g. NLTK.
The problem: The sentences are not parsed evenly: for example, English translator would divide one German sentence into multiple. Thankfully LLM is versatile at this task.
Connect the matching sentences .
Since by the nature of LLM the output is not consistent, it should only emit the position, not the actual output.
Validate.
I would keep on the project. If you need any questions or help freely tell me.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
No, Marius, I don't have the skills to do what you are asking for, but I am
enthusiastic about it and await technology (e.g., LLMs and their evolution,
as wells as AI generally) and skilled visionaries to grab the opportunity.
You are on a track, learning other languages through deep and extensive
reading, and this branch of my interests in interlinear translations I
would like to highlight.
To wit, would you be interested in having your thoughtful message or parts
of it quoted or linked to via my modest effort at
https://interlineardotworld.blogspot.com/ ? Linking to it would require
target, like a site or something you are working on, professional profile
perhaps. . . .
Thanks for your thorough and convincing reply to my short message.
kevinMessage ID:
***@***.***>
… |
Beta Was this translation helpful? Give feedback.
-
Hi Kevin,
Understood.
Yes, you can quote me. But I don't have a website or any other online presence. I'm too busy with work and my two boys to carve out more than 1-2 hours per day and I dedicate those to acquire languages. I'm excited by the tools but much more excited about reading. If I do anything it will come at the expense of reading so unfortunately at this point I can't carve out the time for a side project.
I would encourage you to reach out to Kees from HypLern. He's very passionate about publishing Interlinear books.
Just curious, do you know who built GlossySnake? I was thinking that one wouldn't need more skills to implement what I'm suggesting than doing that.
Regards,
Marius
________________________________
From: jkmactavish ***@***.***>
Sent: Monday, December 30, 2024 12:42 AM
To: parkchamchi/GlossySnake ***@***.***>
Cc: mariuscmorar ***@***.***>; Author ***@***.***>
Subject: Re: [parkchamchi/GlossySnake] Create Interlinear Text from two published books (Discussion #65)
No, Marius, I don't have the skills to do what you are asking for, but I am
enthusiastic about it and await technology (e.g., LLMs and their evolution,
as wells as AI generally) and skilled visionaries to grab the opportunity.
You are on a track, learning other languages through deep and extensive
reading, and this branch of my interests in interlinear translations I
would like to highlight.
To wit, would you be interested in having your thoughtful message or parts
of it quoted or linked to via my modest effort at
https://interlineardotworld.blogspot.com/ ? Linking to it would require
target, like a site or something you are working on, professional profile
perhaps. . . .
Thanks for your thorough and convincing reply to my short message.
kevinMessage ID:
***@***.***>
—
Reply to this email directly, view it on GitHub<#65 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMSQKODNQ3LUHJDOV75JTX32IEBN3AVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRZGUZDANQ>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I wanted to run an idea by you guys. I'm not sure if you've tought about this.
I was thinking to take two books, one English and one Spanish, then take the first English sentence and right under it have the equivalent Spanish sentence. This could be any pair of languages. Then take the next English sentence and the equivalent Spanish one and so on. This would be similar to what DoppelText is doing. It's not as good as the interlinear text but I think much better than having two parallel texts. If I have two books side by side, I lose a lot of time switching between them and finding my way in the paragraph. If I have two sentences right under each other, I don't lose any time. The order doesn't matter. In the more beginner phase, I find it more useful to read the English first and then the Spanish. In the more advanced stages, I can just read the Spanish and read the translation as needed.
I have epub versions of books two books. The tricky part is manipulating the text. And it's not just taking the first sentence from each and combining them. The meaning has to match. If there is one sentence in one text and two equivalent ones in the other, they would have to match by taking two sentences at a time or breaking the long one up.
I tried using ChatGPT to do this. The problem I'm having is that it can't handle large amount of texts. With the O1 model I was able to do about 5,000 words but then I run out of credits. Then I tried with the 4o model. I created three documents in the chat, one with English text one with Spanish and one with the interlinear. It only did about 7 pairs and then it stopped. I kept asking it to keep going and it would go a bit further and stop.
I think an LLM model would be great at this because it understands meaning so it can match the exact content. We could also have the model checking its own work.
I think it's possible to do it, but I don't have the skills. I'm not a programmer.
This would open up the entire world of books out there. Anything that's translated. In Spanish, there's interlinear texts from HypLern and interlinearbooks, and then there's only DoppelText which does what I'm proposing, but they only have three books in Spanish. There's just not enough in Spanish in this format.
One way would be to acquire vocabulary by reading these texts 10 times. Another would be to read 10 times more text once. Let's say I read the interlinear texts 5 times and then move to this format. Do I want to read the same 3 books 5 times? Why not read 15 books. Why read the Three Musketeers 5 times and not read 5 of Dumas' books once?
One advantage is that I see the same vocabulary in a much broader context.
The biggest advantage is that it keeps my interest. People like different things and want to read different things.
I read a decent amount in English, Many books I read are available in other languages. I could just incorporate language learning into my current reading process. It would just take 2-3 times longer. Then learning a language and building vocabulary is not something I do on its own, a separate thing. I can incorporate it into something I already do.
There's just so many advantages to doing something like this.
The main advantage of doing this is that the quality of the translation can be better than a translation generated by LLMs. THe dissadvantage is the there might not be direct word for word translation.
Another possibility is to do this and then do the word for word translation as well. Maybe have the user pick the order. I think the order depends on the level. If I'm at an late intermediate or late advanced stage, I would pick Spanish, then word for word, then thought for though tranlation since I would want to try to read mostly inspanish and rely on translation as needed. At an beginner or early intermediate I would do english thought for though first, then spanish, then english word for word.
Let me know if you have any thoughts on this or if you would be interested in developing a solution for this. I think the LLMs open up a whole new range of possibilities when it comes to language acquisition.
Here's the prompt I used with the O1-mini model. The output is not perfect but maybe it can be improved with better prompting.
I will give you two texts. One in English and another one in another language (Spanish, German). I want you to create an interlinear text by taking breaking the english text up sentence by sentence and after each sentence paste the corespending sentence from the the other language without altering either text. For example, if I give you and English and a Spanish text, take the first english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. Then add a blank row. Then take the second english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the second English sentence. Then add a blank row. Then take the third english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. And so on until you get to the end of the paragraph. At the end of each paragraph, add two blank rows. The output should be a combination of the two texts I gave you broken out sentence by sentence. Each pair of english and spanish sentence should be separated by a blank row and each paragraph by two blank rows. It is absolutely critical that both english and spanish texts are copied word for word. There should be absolutely no modification to the text. The texts should be 100% exactly the same. The only operation you are performing is spliting the text into sentences and weaving the spanish text into the english text to create an interlinear text. The words of the text are not to be modified under any circumstances.
Regards,
Marius
Here's the example of the output text:
PROLOGUE
PRÓLOGO
Mysterious bands of men on horseback travel the roads of Greece.
Misteriosos grupos de hombres a caballo recorren los caminos de Grecia.
The country folk watch them with suspicion from their plots of land, or the doors to their huts.
Los campesinos los observan con desconfianza desde sus tierras o desde las puertas de sus cabañas.
They know from experience that only those who represent danger travel: soldiers, mercenaries, and slave traders.
La experiencia les ha enseñado que solo viaja la gente peligrosa: soldados, mercenarios y traficantes de esclavos.
They frown and grumble until the men disappear over the horizon.
Arrugan la frente y gruñen hasta que los ven hundirse otra vez en el horizonte.
Country folk do not look kindly upon armed strangers.
No les gustan los forasteros armados.
The horsemen ride on, paying the villagers no heed.
Los jinetes cabalgan sin fijarse en los aldeanos.
For months, they have climbed mountains, traversed ravines, crossed valleys, forded rivers, and sailed from island to island.
Durante meses han escalado montañas, han franqueado desfiladeros, han cruzado valles, han vadeado ríos, han navegado de isla en isla.
Their muscles have hardened and their endurance increased since they were sent on this peculiar mission.
Sus músculos y su resistencia se han endurecido desde que les encargaron esta extraña misión.
To achieve their task, they must venture into violent realms in a world that is almost continually at war.
Para cumplir su tarea deben aventurarse por los violentos territorios de un mundo en guerra casi constante.
Beta Was this translation helpful? Give feedback.
All reactions