Create Interlinear Text from two published books #65

mariuscmorar · 2024-12-25T22:09:29Z

mariuscmorar
Dec 25, 2024

I wanted to run an idea by you guys. I'm not sure if you've tought about this.

I was thinking to take two books, one English and one Spanish, then take the first English sentence and right under it have the equivalent Spanish sentence. This could be any pair of languages. Then take the next English sentence and the equivalent Spanish one and so on. This would be similar to what DoppelText is doing. It's not as good as the interlinear text but I think much better than having two parallel texts. If I have two books side by side, I lose a lot of time switching between them and finding my way in the paragraph. If I have two sentences right under each other, I don't lose any time. The order doesn't matter. In the more beginner phase, I find it more useful to read the English first and then the Spanish. In the more advanced stages, I can just read the Spanish and read the translation as needed.

I have epub versions of books two books. The tricky part is manipulating the text. And it's not just taking the first sentence from each and combining them. The meaning has to match. If there is one sentence in one text and two equivalent ones in the other, they would have to match by taking two sentences at a time or breaking the long one up.

I tried using ChatGPT to do this. The problem I'm having is that it can't handle large amount of texts. With the O1 model I was able to do about 5,000 words but then I run out of credits. Then I tried with the 4o model. I created three documents in the chat, one with English text one with Spanish and one with the interlinear. It only did about 7 pairs and then it stopped. I kept asking it to keep going and it would go a bit further and stop.

I think an LLM model would be great at this because it understands meaning so it can match the exact content. We could also have the model checking its own work.

I think it's possible to do it, but I don't have the skills. I'm not a programmer.

This would open up the entire world of books out there. Anything that's translated. In Spanish, there's interlinear texts from HypLern and interlinearbooks, and then there's only DoppelText which does what I'm proposing, but they only have three books in Spanish. There's just not enough in Spanish in this format.

One way would be to acquire vocabulary by reading these texts 10 times. Another would be to read 10 times more text once. Let's say I read the interlinear texts 5 times and then move to this format. Do I want to read the same 3 books 5 times? Why not read 15 books. Why read the Three Musketeers 5 times and not read 5 of Dumas' books once?

One advantage is that I see the same vocabulary in a much broader context.

The biggest advantage is that it keeps my interest. People like different things and want to read different things.

I read a decent amount in English, Many books I read are available in other languages. I could just incorporate language learning into my current reading process. It would just take 2-3 times longer. Then learning a language and building vocabulary is not something I do on its own, a separate thing. I can incorporate it into something I already do.

There's just so many advantages to doing something like this.

The main advantage of doing this is that the quality of the translation can be better than a translation generated by LLMs. THe dissadvantage is the there might not be direct word for word translation.

Another possibility is to do this and then do the word for word translation as well. Maybe have the user pick the order. I think the order depends on the level. If I'm at an late intermediate or late advanced stage, I would pick Spanish, then word for word, then thought for though tranlation since I would want to try to read mostly inspanish and rely on translation as needed. At an beginner or early intermediate I would do english thought for though first, then spanish, then english word for word.

Let me know if you have any thoughts on this or if you would be interested in developing a solution for this. I think the LLMs open up a whole new range of possibilities when it comes to language acquisition.

Here's the prompt I used with the O1-mini model. The output is not perfect but maybe it can be improved with better prompting.

I will give you two texts. One in English and another one in another language (Spanish, German). I want you to create an interlinear text by taking breaking the english text up sentence by sentence and after each sentence paste the corespending sentence from the the other language without altering either text. For example, if I give you and English and a Spanish text, take the first english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. Then add a blank row. Then take the second english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the second English sentence. Then add a blank row. Then take the third english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. And so on until you get to the end of the paragraph. At the end of each paragraph, add two blank rows. The output should be a combination of the two texts I gave you broken out sentence by sentence. Each pair of english and spanish sentence should be separated by a blank row and each paragraph by two blank rows. It is absolutely critical that both english and spanish texts are copied word for word. There should be absolutely no modification to the text. The texts should be 100% exactly the same. The only operation you are performing is spliting the text into sentences and weaving the spanish text into the english text to create an interlinear text. The words of the text are not to be modified under any circumstances.

Regards,
Marius

Here's the example of the output text:
PROLOGUE
PRÓLOGO
Mysterious bands of men on horseback travel the roads of Greece.
Misteriosos grupos de hombres a caballo recorren los caminos de Grecia.
The country folk watch them with suspicion from their plots of land, or the doors to their huts.
Los campesinos los observan con desconfianza desde sus tierras o desde las puertas de sus cabañas.
They know from experience that only those who represent danger travel: soldiers, mercenaries, and slave traders.
La experiencia les ha enseñado que solo viaja la gente peligrosa: soldados, mercenarios y traficantes de esclavos.
They frown and grumble until the men disappear over the horizon.
Arrugan la frente y gruñen hasta que los ven hundirse otra vez en el horizonte.
Country folk do not look kindly upon armed strangers.
No les gustan los forasteros armados.
The horsemen ride on, paying the villagers no heed.
Los jinetes cabalgan sin fijarse en los aldeanos.
For months, they have climbed mountains, traversed ravines, crossed valleys, forded rivers, and sailed from island to island.
Durante meses han escalado montañas, han franqueado desfiladeros, han cruzado valles, han vadeado ríos, han navegado de isla en isla.
Their muscles have hardened and their endurance increased since they were sent on this peculiar mission.
Sus músculos y su resistencia se han endurecido desde que les encargaron esta extraña misión.
To achieve their task, they must venture into violent realms in a world that is almost continually at war.
Para cumplir su tarea deben aventurarse por los violentos territorios de un mundo en guerra casi constante.

jkmactavish · 2024-12-27T14:57:05Z

jkmactavish
Dec 27, 2024

You might be interested in exploring or contributing to the site linked below. Another person interested in interlinear displays for language translations. kevin PS https://interlineardotworld.blogspot.com/

…

On Wed, Dec 25, 2024 at 11:09 PM mariuscmorar ***@***.***> wrote: I wanted to run an idea by you guys. I'm not sure if you've tought about this. I was thinking to take two books, one English and one Spanish, then take the first English sentence and right under it have the equivalent Spanish sentence. This could be any pair of languages. Then take the next English sentence and the equivalent Spanish one and so on. This would be similar to what DoppelText is doing. It's not as good as the interlinear text but I think much better than having two parallel texts. If I have two books side by side, I lose a lot of time switching between them and finding my way in the paragraph. If I have two sentences right under each other, I don't lose any time. The order doesn't matter. In the more beginner phase, I find it more useful to read the English first and then the Spanish. In the more advanced stages, I can just read the Spanish and read the translation as needed. I have epub versions of books two books. The tricky part is manipulating the text. And it's not just taking the first sentence from each and combining them. The meaning has to match. If there is one sentence in one text and two equivalent ones in the other, they would have to match by taking two sentences at a time or breaking the long one up. I tried using ChatGPT to do this. The problem I'm having is that it can't handle large amount of texts. With the O1 model I was able to do about 5,000 words but then I run out of credits. Then I tried with the 4o model. I created three documents in the chat, one with English text one with Spanish and one with the interlinear. It only did about 7 pairs and then it stopped. I kept asking it to keep going and it would go a bit further and stop. I think an LLM model would be great at this because it understands meaning so it can match the exact content. We could also have the model checking its own work. I think it's possible to do it, but I don't have the skills. I'm not a programmer. This would open up the entire world of books out there. Anything that's translated. In Spanish, there's interlinear texts from HypLern and interlinearbooks, and then there's only DoppelText which does what I'm proposing, but they only have three books in Spanish. There's just not enough in Spanish in this format. One way would be to acquire vocabulary by reading these texts 10 times. Another would be to read 10 times more text once. Let's say I read the interlinear texts 5 times and then move to this format. Do I want to read the same 3 books 5 times? Why not read 15 books. Why read the Three Musketeers 5 times and not read 5 of Dumas' books once? One advantage is that I see the same vocabulary in a much broader context. The biggest advantage is that it keeps my interest. People like different things and want to read different things. I read a decent amount in English, Many books I read are available in other languages. I could just incorporate language learning into my current reading process. It would just take 2-3 times longer. Then learning a language and building vocabulary is not something I do on its own, a separate thing. I can incorporate it into something I already do. There's just so many advantages to doing something like this. The main advantage of doing this is that the quality of the translation can be better than a translation generated by LLMs. THe dissadvantage is the there might not be direct word for word translation. Another possibility is to do this and then do the word for word translation as well. Maybe have the user pick the order. I think the order depends on the level. If I'm at an late intermediate or late advanced stage, I would pick Spanish, then word for word, then thought for though tranlation since I would want to try to read mostly inspanish and rely on translation as needed. At an beginner or early intermediate I would do english thought for though first, then spanish, then english word for word. Let me know if you have any thoughts on this or if you would be interested in developing a solution for this. I think the LLMs open up a whole new range of possibilities when it comes to language acquisition. Here's the prompt I used with the O1-mini model. The output is not perfect but maybe it can be improved with better prompting. I will give you two texts. One in English and another one in another language (Spanish, German). I want you to create an interlinear text by taking breaking the english text up sentence by sentence and after each sentence paste the corespending sentence from the the other language without altering either text. For example, if I give you and English and a Spanish text, take the first english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. Then add a blank row. Then take the second english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the second English sentence. Then add a blank row. Then take the third english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. And so on until you get to the end of the paragraph. At the end of each paragraph, add two blank rows. The output should be a combination of the two texts I gave you broken out sentence by sentence. Each pair of english and spanish sentence should be separated by a blank row and each paragraph by two blank rows. It is absolutely critical that both english and spanish texts are copied word for word. There should be absolutely no modification to the text. The texts should be 100% exactly the same. The only operation you are performing is spliting the text into sentences and weaving the spanish text into the english text to create an interlinear text. The words of the text are not to be modified under any circumstances. Regards, Marius Here's the example of the output text: PROLOGUE PRÓLOGO Mysterious bands of men on horseback travel the roads of Greece. Misteriosos grupos de hombres a caballo recorren los caminos de Grecia. The country folk watch them with suspicion from their plots of land, or the doors to their huts. Los campesinos los observan con desconfianza desde sus tierras o desde las puertas de sus cabañas. They know from experience that only those who represent danger travel: soldiers, mercenaries, and slave traders. La experiencia les ha enseñado que solo viaja la gente peligrosa: soldados, mercenarios y traficantes de esclavos. They frown and grumble until the men disappear over the horizon. Arrugan la frente y gruñen hasta que los ven hundirse otra vez en el horizonte. Country folk do not look kindly upon armed strangers. No les gustan los forasteros armados. The horsemen ride on, paying the villagers no heed. Los jinetes cabalgan sin fijarse en los aldeanos. For months, they have climbed mountains, traversed ravines, crossed valleys, forded rivers, and sailed from island to island. Durante meses han escalado montañas, han franqueado desfiladeros, han cruzado valles, han vadeado ríos, han navegado de isla en isla. Their muscles have hardened and their endurance increased since they were sent on this peculiar mission. Sus músculos y su resistencia se han endurecido desde que les encargaron esta extraña misión. To achieve their task, they must venture into violent realms in a world that is almost continually at war. Para cumplir su tarea deben aventurarse por los violentos territorios de un mundo en guerra casi constante. — Reply to this email directly, view it on GitHub <#65>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHNWJVRCSFKA4FKNJATNDY32HMUK3AVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXG42DGOBZHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

mariuscmorar · 2024-12-27T19:41:56Z

mariuscmorar
Dec 27, 2024
Author

Thanks! I’ll check it out. Regards,MariusOn Dec 27, 2024, at 9:57 AM, jkmactavish ***@***.***> wrote: You might be interested in exploring or contributing to the site linked below. Another person interested in interlinear displays for language translations. kevin PS https://interlineardotworld.blogspot.com/

On Wed, Dec 25, 2024 at 11:09 PM mariuscmorar ***@***.***> wrote: I wanted to run an idea by you guys. I'm not sure if you've tought about this. I was thinking to take two books, one English and one Spanish, then take the first English sentence and right under it have the equivalent Spanish sentence. This could be any pair of languages. Then take the next English sentence and the equivalent Spanish one and so on. This would be similar to what DoppelText is doing. It's not as good as the interlinear text but I think much better than having two parallel texts. If I have two books side by side, I lose a lot of time switching between them and finding my way in the paragraph. If I have two sentences right under each other, I don't lose any time. The order doesn't matter. In the more beginner phase, I find it more useful to read the English first and then the Spanish. In the more advanced stages, I can just read the Spanish and read the translation as needed. I have epub versions of books two books. The tricky part is manipulating the text. And it's not just taking the first sentence from each and combining them. The meaning has to match. If there is one sentence in one text and two equivalent ones in the other, they would have to match by taking two sentences at a time or breaking the long one up. I tried using ChatGPT to do this. The problem I'm having is that it can't handle large amount of texts. With the O1 model I was able to do about 5,000 words but then I run out of credits. Then I tried with the 4o model. I created three documents in the chat, one with English text one with Spanish and one with the interlinear. It only did about 7 pairs and then it stopped. I kept asking it to keep going and it would go a bit further and stop. I think an LLM model would be great at this because it understands meaning so it can match the exact content. We could also have the model checking its own work. I think it's possible to do it, but I don't have the skills. I'm not a programmer. This would open up the entire world of books out there. Anything that's translated. In Spanish, there's interlinear texts from HypLern and interlinearbooks, and then there's only DoppelText which does what I'm proposing, but they only have three books in Spanish. There's just not enough in Spanish in this format. One way would be to acquire vocabulary by reading these texts 10 times. Another would be to read 10 times more text once. Let's say I read the interlinear texts 5 times and then move to this format. Do I want to read the same 3 books 5 times? Why not read 15 books. Why read the Three Musketeers 5 times and not read 5 of Dumas' books once? One advantage is that I see the same vocabulary in a much broader context. The biggest advantage is that it keeps my interest. People like different things and want to read different things. I read a decent amount in English, Many books I read are available in other languages. I could just incorporate language learning into my current reading process. It would just take 2-3 times longer. Then learning a language and building vocabulary is not something I do on its own, a separate thing. I can incorporate it into something I already do. There's just so many advantages to doing something like this. The main advantage of doing this is that the quality of the translation can be better than a translation generated by LLMs. THe dissadvantage is the there might not be direct word for word translation. Another possibility is to do this and then do the word for word translation as well. Maybe have the user pick the order. I think the order depends on the level. If I'm at an late intermediate or late advanced stage, I would pick Spanish, then word for word, then thought for though tranlation since I would want to try to read mostly inspanish and rely on translation as needed. At an beginner or early intermediate I would do english thought for though first, then spanish, then english word for word. Let me know if you have any thoughts on this or if you would be interested in developing a solution for this. I think the LLMs open up a whole new range of possibilities when it comes to language acquisition. Here's the prompt I used with the O1-mini model. The output is not perfect but maybe it can be improved with better prompting. I will give you two texts. One in English and another one in another language (Spanish, German). I want you to create an interlinear text by taking breaking the english text up sentence by sentence and after each sentence paste the corespending sentence from the the other language without altering either text. For example, if I give you and English and a Spanish text, take the first english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. Then add a blank row. Then take the second english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the second English sentence. Then add a blank row. Then take the third english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. And so on until you get to the end of the paragraph. At the end of each paragraph, add two blank rows. The output should be a combination of the two texts I gave you broken out sentence by sentence. Each pair of english and spanish sentence should be separated by a blank row and each paragraph by two blank rows. It is absolutely critical that both english and spanish texts are copied word for word. There should be absolutely no modification to the text. The texts should be 100% exactly the same. The only operation you are performing is spliting the text into sentences and weaving the spanish text into the english text to create an interlinear text. The words of the text are not to be modified under any circumstances. Regards, Marius Here's the example of the output text: PROLOGUE PRÓLOGO Mysterious bands of men on horseback travel the roads of Greece. Misteriosos grupos de hombres a caballo recorren los caminos de Grecia. The country folk watch them with suspicion from their plots of land, or the doors to their huts. Los campesinos los observan con desconfianza desde sus tierras o desde las puertas de sus cabañas. They know from experience that only those who represent danger travel: soldiers, mercenaries, and slave traders. La experiencia les ha enseñado que solo viaja la gente peligrosa: soldados, mercenarios y traficantes de esclavos. They frown and grumble until the men disappear over the horizon. Arrugan la frente y gruñen hasta que los ven hundirse otra vez en el horizonte. Country folk do not look kindly upon armed strangers. No les gustan los forasteros armados. The horsemen ride on, paying the villagers no heed. Los jinetes cabalgan sin fijarse en los aldeanos. For months, they have climbed mountains, traversed ravines, crossed valleys, forded rivers, and sailed from island to island. Durante meses han escalado montañas, han franqueado desfiladeros, han cruzado valles, han vadeado ríos, han navegado de isla en isla. Their muscles have hardened and their endurance increased since they were sent on this peculiar mission. Sus músculos y su resistencia se han endurecido desde que les encargaron esta extraña misión. To achieve their task, they must venture into violent realms in a world that is almost continually at war. Para cumplir su tarea deben aventurarse por los violentos territorios de un mundo en guerra casi constante. — Reply to this email directly, view it on GitHub <#65>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHNWJVRCSFKA4FKNJATNDY32HMUK3AVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXG42DGOBZHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

jkmactavish · 2024-12-28T09:04:03Z

jkmactavish
Dec 28, 2024

Marius, Do I understand correctly that you want to have book-length text displayed in interlinear form? and that you would read such long texts? If so, I believe I have misjudged the universe of those who want to have and use interlinear documents. Based on your answer, I will correct what I wrote on that blog. Thanks for a reply, kevin

…

On Wed, Dec 25, 2024 at 11:09 PM mariuscmorar ***@***.***> wrote: I wanted to run an idea by you guys. I'm not sure if you've tought about this. I was thinking to take two books, one English and one Spanish, then take the first English sentence and right under it have the equivalent Spanish sentence. This could be any pair of languages. Then take the next English sentence and the equivalent Spanish one and so on. This would be similar to what DoppelText is doing. It's not as good as the interlinear text but I think much better than having two parallel texts. If I have two books side by side, I lose a lot of time switching between them and finding my way in the paragraph. If I have two sentences right under each other, I don't lose any time. The order doesn't matter. In the more beginner phase, I find it more useful to read the English first and then the Spanish. In the more advanced stages, I can just read the Spanish and read the translation as needed. I have epub versions of books two books. The tricky part is manipulating the text. And it's not just taking the first sentence from each and combining them. The meaning has to match. If there is one sentence in one text and two equivalent ones in the other, they would have to match by taking two sentences at a time or breaking the long one up. I tried using ChatGPT to do this. The problem I'm having is that it can't handle large amount of texts. With the O1 model I was able to do about 5,000 words but then I run out of credits. Then I tried with the 4o model. I created three documents in the chat, one with English text one with Spanish and one with the interlinear. It only did about 7 pairs and then it stopped. I kept asking it to keep going and it would go a bit further and stop. I think an LLM model would be great at this because it understands meaning so it can match the exact content. We could also have the model checking its own work. I think it's possible to do it, but I don't have the skills. I'm not a programmer. This would open up the entire world of books out there. Anything that's translated. In Spanish, there's interlinear texts from HypLern and interlinearbooks, and then there's only DoppelText which does what I'm proposing, but they only have three books in Spanish. There's just not enough in Spanish in this format. One way would be to acquire vocabulary by reading these texts 10 times. Another would be to read 10 times more text once. Let's say I read the interlinear texts 5 times and then move to this format. Do I want to read the same 3 books 5 times? Why not read 15 books. Why read the Three Musketeers 5 times and not read 5 of Dumas' books once? One advantage is that I see the same vocabulary in a much broader context. The biggest advantage is that it keeps my interest. People like different things and want to read different things. I read a decent amount in English, Many books I read are available in other languages. I could just incorporate language learning into my current reading process. It would just take 2-3 times longer. Then learning a language and building vocabulary is not something I do on its own, a separate thing. I can incorporate it into something I already do. There's just so many advantages to doing something like this. The main advantage of doing this is that the quality of the translation can be better than a translation generated by LLMs. THe dissadvantage is the there might not be direct word for word translation. Another possibility is to do this and then do the word for word translation as well. Maybe have the user pick the order. I think the order depends on the level. If I'm at an late intermediate or late advanced stage, I would pick Spanish, then word for word, then thought for though tranlation since I would want to try to read mostly inspanish and rely on translation as needed. At an beginner or early intermediate I would do english thought for though first, then spanish, then english word for word. Let me know if you have any thoughts on this or if you would be interested in developing a solution for this. I think the LLMs open up a whole new range of possibilities when it comes to language acquisition. Here's the prompt I used with the O1-mini model. The output is not perfect but maybe it can be improved with better prompting. I will give you two texts. One in English and another one in another language (Spanish, German). I want you to create an interlinear text by taking breaking the english text up sentence by sentence and after each sentence paste the corespending sentence from the the other language without altering either text. For example, if I give you and English and a Spanish text, take the first english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. Then add a blank row. Then take the second english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the second English sentence. Then add a blank row. Then take the third english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. And so on until you get to the end of the paragraph. At the end of each paragraph, add two blank rows. The output should be a combination of the two texts I gave you broken out sentence by sentence. Each pair of english and spanish sentence should be separated by a blank row and each paragraph by two blank rows. It is absolutely critical that both english and spanish texts are copied word for word. There should be absolutely no modification to the text. The texts should be 100% exactly the same. The only operation you are performing is spliting the text into sentences and weaving the spanish text into the english text to create an interlinear text. The words of the text are not to be modified under any circumstances. Regards, Marius Here's the example of the output text: PROLOGUE PRÓLOGO Mysterious bands of men on horseback travel the roads of Greece. Misteriosos grupos de hombres a caballo recorren los caminos de Grecia. The country folk watch them with suspicion from their plots of land, or the doors to their huts. Los campesinos los observan con desconfianza desde sus tierras o desde las puertas de sus cabañas. They know from experience that only those who represent danger travel: soldiers, mercenaries, and slave traders. La experiencia les ha enseñado que solo viaja la gente peligrosa: soldados, mercenarios y traficantes de esclavos. They frown and grumble until the men disappear over the horizon. Arrugan la frente y gruñen hasta que los ven hundirse otra vez en el horizonte. Country folk do not look kindly upon armed strangers. No les gustan los forasteros armados. The horsemen ride on, paying the villagers no heed. Los jinetes cabalgan sin fijarse en los aldeanos. For months, they have climbed mountains, traversed ravines, crossed valleys, forded rivers, and sailed from island to island. Durante meses han escalado montañas, han franqueado desfiladeros, han cruzado valles, han vadeado ríos, han navegado de isla en isla. Their muscles have hardened and their endurance increased since they were sent on this peculiar mission. Sus músculos y su resistencia se han endurecido desde que les encargaron esta extraña misión. To achieve their task, they must venture into violent realms in a world that is almost continually at war. Para cumplir su tarea deben aventurarse por los violentos territorios de un mundo en guerra casi constante. — Reply to this email directly, view it on GitHub <#65>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHNWJVRCSFKA4FKNJATNDY32HMUK3AVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXG42DGOBZHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

mariuscmorar · 2024-12-28T14:37:13Z

mariuscmorar
Dec 28, 2024
Author

Hi Kevin, Absolutely Yes!!! But let me pursue this further, why wouldn't everyone want it? Let's think through this. The amount of people who want to learn languages is large. Most fail or quit. Why? Because it's hard and boring!!! They do drills, they do flashcards, anki, etc That's what I used to do. The Hamiltonian method, which is the method you and I discovered, tells us that it doesn't have to be hard or super boring! So, if you don't do flashcards, how do you do it? By reading! Then there's the question of, how well do you want to learn a language? If you just talk to people or watch TV or listen to podcasts, you will only acquire a limited vocabulary range, 5,000-10,000 words. If you want to learn the language at an advanced level, you have to read. And the frequency of words beyond the most frequent 10,000 words is very very low. So, unless you make flashcard and memorize them that way, which is not fun, then you'll have to read, and read, and read and keep on reading. There's just no way around it. I would say that most people want to learn languages to talk to people and get by. But a significant number want to be able to read. A large number would want to acquire 20,000 words, which is what an average native speaker has. Quite a few people would want to acquire 30,000 or more, which is what an educated person has. How many people get there? Very few. Why? Because it's very boring and hard work looking up words in the dictionary and making flashcards. So what's the solution. The solution is interlinear texts. Why? Because it's the most efficient way to acquire vocabulary. I would say at the very beginning Assimil manuals and the shadowing technique as taught by Alexander Arguelles is the most efficient. Then at the intermediate level nothing comes close to interlinear texts. That's the level with a vocabulary size of 2,000-20,000 words. I would argue that interlinear texts are the best method until one knows at least 98%-99% of the words. If you know that many words, then you still have to look up 3-6 words per page on average, but you can figure many out from context. At the advanced level the best tool I found is ReadLang. I can look up words almost instantly and keep reading. I don't have to reread the sentence when I look up the word. It's the most efficient way. But if I have to look up a lot of them, then it becomes very inefficient. If I have to look up more than 2%, then interlinear texts are better. If less than 1-2%, then ReadLand is ideal. I can look up 3-6 words per page and still read at a decent level. So the challenge is twofold. One, how do I acquire a large enough vocabulary so that I can pick up any text in any field and know 98% of words. There are a lot of different kinds of texts out there that use different subsets of vocabulary. Even If I know 98% of the words in a newspaper, it doesn't mean I can pick up Proust and still know 98%, or a scientific texbook, or a paper on physics, or an academic philosophy book. So how do I do it in the most efficient way? Right now there is no way. There are very few interlinear texts. The ones I found are: HypLern, which you can buy in print as well Interlinearbooks LeyerlePublications, print as well but extremely expensive. DoppelText - not trully interlinear though but can get by with Readlang and has the widest selection by far. These sources have more material in the big languages, German, French, Italian. Much fewer in Spanish. But if you want to learn a smaller language, like Finish, you're out of luck. There's almost nothing for you. And to acquire a vocabulary at an advanced level, you need to read a lot of books. Not just a dozen, but a few dozen, maybe 100. And given how diverse people's tastes are, we would need a few thousand books for each pair of languages to give enough content in various styles. Then there's the question of interest. People have different interests. Some like literature, other theology, other science. Some don't want to read fairytales but would devour a textbook on physics. Some don't care about science but would devour a Fantasy or Mystery or Romance novel. Because they don't have content they enjoy, they lose interest and give up, or make much slower progress than they would if they had the reading material that grabs them. So what is the solution? One solution is to get two books and read the original and translation side by side. But that's a pain. It's extremely inefficient. You read a sentence, then have to lift your eyes and look for the sentence in the other book, then come back and find your way back, then find the words you don't know and go back and forth a few times. This is just extremely inefficient. It's better than flashcards or looking up words in the dictionary, but you will make progress slowly. The other challenge with this is idiomatic translation. Often the translation is not word for word, and you can acquire the wrong definitions like this. The ideal solution is to have any text one might want to read available in both word for word and thought for thought in interlinear format. We might get there in a few years with LLMs, but we're not there yet. However, with LLMs we do have a practical solution. We can have the LLMs create an interlinear text like I slowed, have the human translation, the original and the word for word. The LLMs can parse a text, extract a sentence and match its translation and then generate a word for word translation. I am doing this right now. I wanted to ready Papyrus: The Invention of Books in the Ancient World by Irene Vallejo. I wanted to read it in English. Then I realized it's translated from Spanish, which is a language I'm learning now. So I that's how I got this idea. Why not read it in English as I would normally do, but after each sentence read the Spanish sentence as well? It would take me longer, but I would acquire vocabulary in the process. It's a proper translation, by a human. So I used ChatGPT to create a three-line interlinear text. And to deal with idiomatic translation or situations where the translator takes liberties, I have the ChatGPT word for word translation. I don't read that. I only look at it when I need help. It's not perfect, but good enough. And I'm reading it with ReadLang, so if I need even more help, I can get the meaning of a word like I would by reading texts where I already know 98% of vocabuary. I use ReadLang maybe 2-3 times per page. This solution is not perfect. The perfect solution would be a human generated tri-line translation with both word for word and though for though. If you want to see what I think is the perfect solution, look at Leyerle Publications. They have the original, then word for word, then thought for thought, then a commentary at the end. But it's only for German, Italian and French and they only do Opera Libretti and Songs. We will never have that done by humans at scale. But we don't need to. I think that what I'm proposing is good enough. I'm trying this out as we speak and it's working. I'm day 50 though my first Assimil manual and I'm reading a non-fiction text in Spanish. English first, then Spanish. And I can understand the Spanish text. And I building vocabulary before I even finish my first manual. When my Spanish will get good enough, I'll swap the order and have the Spanish line first then word for word and then though for thought human translation last. This is still useful to learn idioms. I could read like this for the rest of my life. In fact, I will. Since reading like this is so effortless, I will probably always have a language that I will acquire like this. I will only give up this format when I will know at least 99% of the vocabulary. There is absolutely no reason to give up this format. When my Spanish is good enough, I will just flip the order, have Spanish first. Then I will just read the Spanish row and rely on the two translations when I need to. Until I don't need it much anymore. Then I'll move to Readlang. But I could read the entire book by just reading the first Spanish row and have the translations just in case. I'm following Alexander Arguelles techniques to learn. Shadowing, listening and reading aloud. And reading a lot. He advises against flashcards or even looking things up in the dictionary. He advices reading interlinear texts and then two books side by side. We don't have enough interlinear books. I found 10 texts (Hyplern and InterlinearBoox). I need 100. So this is the solution. This is what Alguelles is recommending but much much more efficient. So, to answer your question, why wouldn't people want this? I can't imagine why anyone who is serious about learning a language at an advanced level would not want to read like this. There are two reasons I can think of: 1) they don't know about it and 2) they can't find the texts that grab them. The only texts are literary texts and some just don't want to read fairytale 5-10 times. I get it. The thing is, this solution doesn't exist (except Leyerle Publications), so how can people not want something that doesn't exist? And then for purely interlinear, how can we say people are not interested when the texts they are interested in don't exist? I can't see any reason why this will not be used widely once we can finetune the technology. Another advantage is that we don't have to read the same text 5-10 times. The reason we have to do that now, is because we only have a few texts. And since interlinear is the most efficient way to master vocabulary, we want to master all the vocabulary available in this form. But if we had an almost unlimited amount of texts at our disposal, we wouldn't have to read a text 10 times. Why would I read The Three Musketeers 10 times? The only reason is that it's the only Dumas text available in this format (published by DoppelTexts). But I could read all 10 books in the entire series in that time. And I could acquire an even broader vocabulary. I wouldn't lose anything by not reading one book 10 times and instead read 10 books one time. Quite the opposite, it would keep my interest all the way though. It wouldn't feel like a chore. It wouldn't feel like studying. It wouldn't feel like learning languages. It would feel like reading. It would take 2-3 times longer in the beginning, but much quicker towards the end. I'm super excited as you can tell. I believe this is the future of language learning. I'm very excited about LLMs. I think they will transform many aspects of the way we do things now and this will be one of them. The Hamiltonian method is effortless but it's boring because you read texts that don't grab you over and over. I think this is the first time in human history when it's possible to acquire a language almost effortlessly by reading whatever you want. I know it works because I'm doing it. The challenge is that I'm not a programmer and I'm using the chat and there is a limit to how much text you can input. And I have to break up a book into small bits to run the prompt which is a waste of time. We just need someone to build a back end to do the whole thing. Take two epubs, break them up, then run the prompt and put it back together. This was not possible with the 4o model because the model wouldn't complete the task. The o1 mini does complete the task. So I think this last o1 model is a portal to another universe, a universe where we can acquire languages to a high level effortlessly. There's a lot of people who enjoy reading their favorite books, but don't like learning languages (the old way). Now they can, because language learning can just become reading and if reading is enjoyable, language learning will be enjoyable too. I will keep breaking the book up like this until I find someone who's willing to implement the back end. Would you be able and willing to do it? If not, do you know someone who would? I just don't have the skills otherwise I would do it myself. If not, I'll keep asking. Regards, Marius

________________________________ From: jkmactavish ***@***.***> Sent: Saturday, December 28, 2024 1:04 AM To: parkchamchi/GlossySnake ***@***.***> Cc: mariuscmorar ***@***.***>; Author ***@***.***> Subject: Re: [parkchamchi/GlossySnake] Create Interlinear Text from two published books (Discussion #65) Marius, Do I understand correctly that you want to have book-length text displayed in interlinear form? and that you would read such long texts? If so, I believe I have misjudged the universe of those who want to have and use interlinear documents. Based on your answer, I will correct what I wrote on that blog. Thanks for a reply, kevin

On Wed, Dec 25, 2024 at 11:09 PM mariuscmorar ***@***.***> wrote: I wanted to run an idea by you guys. I'm not sure if you've tought about this. I was thinking to take two books, one English and one Spanish, then take the first English sentence and right under it have the equivalent Spanish sentence. This could be any pair of languages. Then take the next English sentence and the equivalent Spanish one and so on. This would be similar to what DoppelText is doing. It's not as good as the interlinear text but I think much better than having two parallel texts. If I have two books side by side, I lose a lot of time switching between them and finding my way in the paragraph. If I have two sentences right under each other, I don't lose any time. The order doesn't matter. In the more beginner phase, I find it more useful to read the English first and then the Spanish. In the more advanced stages, I can just read the Spanish and read the translation as needed. I have epub versions of books two books. The tricky part is manipulating the text. And it's not just taking the first sentence from each and combining them. The meaning has to match. If there is one sentence in one text and two equivalent ones in the other, they would have to match by taking two sentences at a time or breaking the long one up. I tried using ChatGPT to do this. The problem I'm having is that it can't handle large amount of texts. With the O1 model I was able to do about 5,000 words but then I run out of credits. Then I tried with the 4o model. I created three documents in the chat, one with English text one with Spanish and one with the interlinear. It only did about 7 pairs and then it stopped. I kept asking it to keep going and it would go a bit further and stop. I think an LLM model would be great at this because it understands meaning so it can match the exact content. We could also have the model checking its own work. I think it's possible to do it, but I don't have the skills. I'm not a programmer. This would open up the entire world of books out there. Anything that's translated. In Spanish, there's interlinear texts from HypLern and interlinearbooks, and then there's only DoppelText which does what I'm proposing, but they only have three books in Spanish. There's just not enough in Spanish in this format. One way would be to acquire vocabulary by reading these texts 10 times. Another would be to read 10 times more text once. Let's say I read the interlinear texts 5 times and then move to this format. Do I want to read the same 3 books 5 times? Why not read 15 books. Why read the Three Musketeers 5 times and not read 5 of Dumas' books once? One advantage is that I see the same vocabulary in a much broader context. The biggest advantage is that it keeps my interest. People like different things and want to read different things. I read a decent amount in English, Many books I read are available in other languages. I could just incorporate language learning into my current reading process. It would just take 2-3 times longer. Then learning a language and building vocabulary is not something I do on its own, a separate thing. I can incorporate it into something I already do. There's just so many advantages to doing something like this. The main advantage of doing this is that the quality of the translation can be better than a translation generated by LLMs. THe dissadvantage is the there might not be direct word for word translation. Another possibility is to do this and then do the word for word translation as well. Maybe have the user pick the order. I think the order depends on the level. If I'm at an late intermediate or late advanced stage, I would pick Spanish, then word for word, then thought for though tranlation since I would want to try to read mostly inspanish and rely on translation as needed. At an beginner or early intermediate I would do english thought for though first, then spanish, then english word for word. Let me know if you have any thoughts on this or if you would be interested in developing a solution for this. I think the LLMs open up a whole new range of possibilities when it comes to language acquisition. Here's the prompt I used with the O1-mini model. The output is not perfect but maybe it can be improved with better prompting. I will give you two texts. One in English and another one in another language (Spanish, German). I want you to create an interlinear text by taking breaking the english text up sentence by sentence and after each sentence paste the corespending sentence from the the other language without altering either text. For example, if I give you and English and a Spanish text, take the first english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. Then add a blank row. Then take the second english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the second English sentence. Then add a blank row. Then take the third english sentence exactly word for word, then take the coresponding sentence from the Spanish text exactly word for word and paste it after the English sentence. And so on until you get to the end of the paragraph. At the end of each paragraph, add two blank rows. The output should be a combination of the two texts I gave you broken out sentence by sentence. Each pair of english and spanish sentence should be separated by a blank row and each paragraph by two blank rows. It is absolutely critical that both english and spanish texts are copied word for word. There should be absolutely no modification to the text. The texts should be 100% exactly the same. The only operation you are performing is spliting the text into sentences and weaving the spanish text into the english text to create an interlinear text. The words of the text are not to be modified under any circumstances. Regards, Marius Here's the example of the output text: PROLOGUE PRÓLOGO Mysterious bands of men on horseback travel the roads of Greece. Misteriosos grupos de hombres a caballo recorren los caminos de Grecia. The country folk watch them with suspicion from their plots of land, or the doors to their huts. Los campesinos los observan con desconfianza desde sus tierras o desde las puertas de sus cabañas. They know from experience that only those who represent danger travel: soldiers, mercenaries, and slave traders. La experiencia les ha enseñado que solo viaja la gente peligrosa: soldados, mercenarios y traficantes de esclavos. They frown and grumble until the men disappear over the horizon. Arrugan la frente y gruñen hasta que los ven hundirse otra vez en el horizonte. Country folk do not look kindly upon armed strangers. No les gustan los forasteros armados. The horsemen ride on, paying the villagers no heed. Los jinetes cabalgan sin fijarse en los aldeanos. For months, they have climbed mountains, traversed ravines, crossed valleys, forded rivers, and sailed from island to island. Durante meses han escalado montañas, han franqueado desfiladeros, han cruzado valles, han vadeado ríos, han navegado de isla en isla. Their muscles have hardened and their endurance increased since they were sent on this peculiar mission. Sus músculos y su resistencia se han endurecido desde que les encargaron esta extraña misión. To achieve their task, they must venture into violent realms in a world that is almost continually at war. Para cumplir su tarea deben aventurarse por los violentos territorios de un mundo en guerra casi constante. — Reply to this email directly, view it on GitHub <#65>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHNWJVRCSFKA4FKNJATNDY32HMUK3AVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXG42DGOBZHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

— Reply to this email directly, view it on GitHub<#65 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMSQKOELDGZREOEEZBQVJV32HZSRRAVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRYGM3DCNI>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

parkchamchi · 2024-12-28T15:33:58Z

parkchamchi
Dec 28, 2024
Maintainer

Hi, thank you for visiting this repo.

The current implementation (and the design even) is unstable but I'm still interested in this project. I haven't been touching the codebase since I was doing a side project (collecting and structuring the existing interlinear corpora), but would be back on track. (Should learn Italian...) Using Langchain library would make the project stabilized and maintainable, I hope...

As per the sentence-level interlinear text generation using two given corpora: The simple prompt method would be unreliable, as you mentioned, due to the token limit and the unreliability of the LLM output. A solution I'd propose is using augmented subroutines, which can be easily implemented using Langchain library.

For example,

Divide the corpora as paragraphs and in which sentences. This can be done with traditional NLP tokenizers, e.g. NLTK.

The problem: The sentences are not parsed evenly: for example, English translator would divide one German sentence into multiple. Thankfully LLM would be versatile at merging the sentences.

Connect the matching sentences .

Since by the nature of LLM the output is not consistent, it should only emit the position, not the actual output.
Embedding the sentences and calculating the cosine similarity could be useful, but LLM itself would be sufficient to do the job.

Validate.

I would keep on the project. If you need any questions or help freely tell me.

0 replies

mariuscmorar · 2024-12-28T16:53:23Z

mariuscmorar
Dec 28, 2024
Author

Are you publishing somewhere the existing interlinear corpora? I would be interested as well. When you say the current LLM is unreliable, which one are you referring to? Are you referring to the o1 model? In my tests I found it to be stable. The 4o model is unstable. The o1 model can match multiple sentences to one sentence in the other language. It can even reliably break long sentences into smaller components. Proust sentences can run pages long so we will want to break them up. The limit on the back end is 128,000 tokens. My limit is smaller because I’m using the chat. We would need a script to break up the book into smaller components, some took that doesn’t rely on LLM to do this, or it relies only partially to check that the breakpoints match, let’s say chapter size or 10,000 words or 40,000 words or whatever and run the prompt like that. My problem is that I have no programing skills so I can only work with the actual chat, which will always limit the text I can work with. That’s why we need a backend solution. I don’t see why this couldn’t be done on the back end. Again, it works. The output is good. The o1 model is reliable. I just need a better way to break up a book into chapters that copy paste. The tech is there. When you were working on this the models weren’t ready. Now they are. Everything changed with the o1 model. Regards,MariusOn Dec 28, 2024, at 10:34 AM, Chanjin Park ***@***.***> wrote: Hi, thank you for visiting this repo. The current implementation (and the design even) is unstable and I'm still interested in this project. I hasn't been touching the codebase since I was doing a side project (collecting and structuring the existing interlinear corpora), but would be back on track. (Should learn Italian...) Using Langchain library would make the project stabilized and maintainable, I hope... As per the sentence-level interlinear text generation using two given corpora: The simple prompt method would be unreliable, as you mentioned, due to the token limit and the unreliability of the LLM output. A solution I'd propose is using augmented subroutines, which can be easily implemented using Langchain library. For example, Divide the corpora as paragraphs and in which sentences. This can be done with traditional NLP tokenizers, e.g. NLTK. The problem: The sentences are not parsed evenly: for example, English translator would divide one German sentence into multiple. Thankfully LLM is versatile at this task. Connect the matching sentences . Since by the nature of LLM the output is not consistent, it should only emit the position, not the actual output. Validate. I would keep on the project. If you need any questions or help freely tell me. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

jkmactavish · 2024-12-30T08:41:44Z

jkmactavish
Dec 30, 2024

No, Marius, I don't have the skills to do what you are asking for, but I am enthusiastic about it and await technology (e.g., LLMs and their evolution, as wells as AI generally) and skilled visionaries to grab the opportunity. You are on a track, learning other languages through deep and extensive reading, and this branch of my interests in interlinear translations I would like to highlight. To wit, would you be interested in having your thoughtful message or parts of it quoted or linked to via my modest effort at https://interlineardotworld.blogspot.com/ ? Linking to it would require target, like a site or something you are working on, professional profile perhaps. . . . Thanks for your thorough and convincing reply to my short message. kevinMessage ID: ***@***.***>

…

0 replies

mariuscmorar · 2024-12-30T12:56:24Z

mariuscmorar
Dec 30, 2024
Author

Hi Kevin, Understood. Yes, you can quote me. But I don't have a website or any other online presence. I'm too busy with work and my two boys to carve out more than 1-2 hours per day and I dedicate those to acquire languages. I'm excited by the tools but much more excited about reading. If I do anything it will come at the expense of reading so unfortunately at this point I can't carve out the time for a side project. I would encourage you to reach out to Kees from HypLern. He's very passionate about publishing Interlinear books. Just curious, do you know who built GlossySnake? I was thinking that one wouldn't need more skills to implement what I'm suggesting than doing that. Regards, Marius

________________________________ From: jkmactavish ***@***.***> Sent: Monday, December 30, 2024 12:42 AM To: parkchamchi/GlossySnake ***@***.***> Cc: mariuscmorar ***@***.***>; Author ***@***.***> Subject: Re: [parkchamchi/GlossySnake] Create Interlinear Text from two published books (Discussion #65) No, Marius, I don't have the skills to do what you are asking for, but I am enthusiastic about it and await technology (e.g., LLMs and their evolution, as wells as AI generally) and skilled visionaries to grab the opportunity. You are on a track, learning other languages through deep and extensive reading, and this branch of my interests in interlinear translations I would like to highlight. To wit, would you be interested in having your thoughtful message or parts of it quoted or linked to via my modest effort at https://interlineardotworld.blogspot.com/ ? Linking to it would require target, like a site or something you are working on, professional profile perhaps. . . . Thanks for your thorough and convincing reply to my short message. kevinMessage ID: ***@***.***>

— Reply to this email directly, view it on GitHub<#65 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMSQKODNQ3LUHJDOV75JTX32IEBN3AVCNFSM6AAAAABUGMASH6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRZGUZDANQ>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Interlinear Text from two published books #65

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Create Interlinear Text from two published books #65

mariuscmorar Dec 25, 2024

Replies: 8 comments

jkmactavish Dec 27, 2024

mariuscmorar Dec 27, 2024 Author

jkmactavish Dec 28, 2024

mariuscmorar Dec 28, 2024 Author

parkchamchi Dec 28, 2024 Maintainer

mariuscmorar Dec 28, 2024 Author

jkmactavish Dec 30, 2024

mariuscmorar Dec 30, 2024 Author

mariuscmorar
Dec 25, 2024

jkmactavish
Dec 27, 2024

mariuscmorar
Dec 27, 2024
Author

jkmactavish
Dec 28, 2024

mariuscmorar
Dec 28, 2024
Author

parkchamchi
Dec 28, 2024
Maintainer

mariuscmorar
Dec 28, 2024
Author

jkmactavish
Dec 30, 2024

mariuscmorar
Dec 30, 2024
Author