-
Find datasets for multiple languages
- English (Zahra)
- Italian (Paolo)
- Whatever is on Kaggle (Ettore)
-
Fix the vm (might be memory config)
-
Parse the data into raw txt containing text only
-
Write a version of the program that works and remember to add
- Combiner and In-Mapper Combiner
- Setup and Cleanup methods
- Add experiments with different number of reducers
-
Carry out experiments
- Letter frequency per language
- Statistics on execution
- execution time
- Memory usage
- Number of InputSplits
- Impact of In-Mapper Combiner
-
Fix bugs
-
Write a short report (LaTeX)