Due: Wednesday 09/08/21 Before 5:00 PM
Total Points = 30
In this assignment, you will:
- Read and understand the course AI policy and late policy
- Review (or learn) Scala basics
- Where to find/how to use Scala docs.
- How to edit/compile/debug code in Scala.
- How to find/read the Scala documentation and use standard libraries
- How to read and parse a CSV file in Scala
- How to write a file in Scala
Review the lecture notes and provided example code for some insight into the Scala syntax. You will also want to read the Scala references provided below:
- The Scala API
- Scala Tour
- Scala Resources
- ScalaTest: Writing your first test
- Maps in Scala
- Scala File I/O (Scala Cookbook Excerpt)
- Scala Exercises
The policy for late submissions on assignments is as follows. Your project grade is
the best grade over all per-submission grades (or a 0 if no submissions are made).
If a submission is made...
- ... more than 5 days before the deadline, your submission is assigned a grade of 5 bonus points + 100% of the points it earns.
- ... up to than 5 days before the deadline, your submission is assigned a grade of 1 bonus point per full day + 100% of the points it earns.
- ... within 24 hours of the deadline, your submission is assigned a grade of 100% of the points it earns.
- ... up to 24 hours after the deadline, your submission is assigned a grade of 75% of the points it earns.
- ... more than 24 hours after the deadline, but within 48 hours of the deadline, your submission is assigned a grade of 50% of the points it earns.
- ... more than 48 hours after the deadline, it will not be accepted.
You will have the ability to use three grace days throughout the semester, and at most two per assignment (since submissions are not accepted after two days). Using a grace day will negate the 25% penalty per day, but will not allow you to submit more than two days late. Please plan accordingly. You will not be able to recover a grace day if you decide to work late and your score was not sufficiently higher. Grace days are automatically applied to the first instances of late submissions, and are non-refundable. For example, if an assignment is due on a Friday and you make a submission on Saturday, you will automatically use a grace day, regardless of whether you perform better or not. Be sure to test your code before submitting, especially with late submissions in order to avoid wasting grace days.
Keep track of the time if you are working up until the deadline. Submissions become late after the set deadline. Keep in mind that submissions will close 48 hours after hte original deadline and you will not be able to submit your code after that time.
As a gentle reminder, please re-read the academic integrity policy of the course. I will continue to remind you throughout the semester and hope to avoid any incidents.
These bullets should be obvious things not to do (but commonly occur):
- Turning in your friend's code/write-up (obvious).
- Turning in solutions you found on Google with all the variable names changed (should be obvious). This is a copyright violation, in addition to an AI violation.
- Turning in solutions you found on Google with all the variable names changed and 2 lines added (should be obvious). This is also a copyright violation.
- Use of Github Autopilot (should be obvious). This is still in murky legal water, and may be a copyright violation, in addition to being an AI violation.
- Paying someone to do your work. You may as well not submit the work, since you will fail the exams and the course.
- Posting to forums asking someone to solve assignment problems (even if you do not receive the solution)
- Accessing solutions to assignment problems.
Note: Aggregating every { stack overflow answer, result from google, other source } because you "understand it" will likely result in full credit on assignments (if you are not caught), and then failure on every exam. Exams don't test if you know how to use Google, but rather test your understanding (i.e., do you understand the problem and material well enough to arrive at a solution on your own). Also, other students are likely doing the same thing, and then you will be wondering why 10 people that you don't know have your exact solution.
There is a grey area when it comes to discussing the problems with your peers, and I do encourage you to work with one another to discuss course concepts related to an assignment. That is the best way to learn and to overcome obstacles. At the same time, you need to be sure you do not overstep and not plagiarize. Discussions pointing to relevant course materials are OK. For example, the following is acceptable advice:
It would be helpful to review the usage of the stack in the recitation slides from week XX.
When working with your peers, we ask that you include attribution; In the header comment of the Main function of your submission, please list all peers who you have discussed the project with.
Explaining every step in detail and/or giving pseudocode that solves the problem is not ok. For example, the following is not acceptable advice:
I copied the algorithm from the week XX notes into my code at the start of the function, created a function that went through the given data and put it into a list, called that function, and then sorted the results.
The first example is OK. The second example, however, is a summary of your code and is not acceptable. Remember that you should never show any of your code to other students prior to any deadlines. Regardless of where you are working, you must always follow this rule: Never come away from discussions with your peers with any written work, either typed or photographed, and especially do not share or allow viewing of your written code.
If you have concerns that you may have overstepped or worked too closely with someone, please address this with me prior to deadlines for the assignment. Even if you have submitted an assignment that may have violated the course academic integrity policy, if you approach me prior to detection you will not face academic integrity proceedings. We will address options at that point.
With all of this said, please feel free to use any { files | examples | tutorials } that course staff provides, directly in your code. Feel free to directly use any materials from lecture or recitations. You will never be penalized for doing so, but must always provide attribution/citation for where you retrieved code from. Just remember, if you are citing an algorithm that is not provided by us, then you are probably overstepping.
More explicitly, you may use any of the following resources (with proper citation/annotation in your code:
- Any example files posted on the course webpage or Piazza (from lecture or recitation)
- Any code that the instructor provides
- Any code that the TAs provide
- Any code from the Tour of Scala
- Any code from Scala Collections
- Any code from Scala API
- Additional references may be provided as the semester progresses, but only those provided publicly by course staff are allowed for use. These will be listed on Piazza under Resources
Omitting citation/attribution will result in an AI violation (and lawsuits later in life at your job). This is true, even if you are using provided resources.
Again, if you think you are going to violate/have violated this policy, please come talk to a member of the course staff ASAP so we can figure out how to get you on track to succeed in the course. If you have a question about the validity of a resource, please ask a TA or your instructor prior to using it. If you have already used it, please discuss with the instructor to determine a workaround and, at the very least, avoid an academic integrity infraction. For example, you might send an email such as the following to the course instructor:
Clarus T Example
UBIT: ctexamp
Person #: 123456789Dear Dr. Kennedy,
I believe that I may have submitted work that is { not fully my own | not properly attributed }. I wish to retract my submission to preserve academic integrity in the course.
Signed,
Clarus T Example
This policy on assignments is here so that you learn the material and how to think for yourself. There is no cognitive benefit achieved by submitting solutions someone else has written (which likely already exist in some form).
The policy for collaboration on assignments is as follows:
- All work for this course must be original individual work.
- You must follow the limits on collaboration as defined in the AI policy (i.e., no shared code/etc...)
- You must identify any collaborators (first and last name) on every assignment. THis can be in a comment at the top of your code submissions or on the first page at the top of your written work, beside your name.
All references must be cited using a comment containing a direct link to the resource, as well as a brief description of what was used. For example, if you reference the textbook, a page number and description is sufficient. If you copy example code from the Scala Language API, then include the link to the class page within the API as well as where the example code resides.
Scala is based on the Java Virtual Machine, and so will run on most modern operating systems. However, only the following platforms are officially supported by this course.
- Ubuntu Linux
- MacOS with Homebrew
Instructions for Ubuntu should work without change for any Debian-based Linux distribution (Ubuntu, PopOS!, Debian, Mint).
The Windows Subsystem for Linux, along with the Ubuntu package will allow windows users to follow the Ubuntu instructions.
Course staff will make every attempt to assist you if you are using a platform that is not officially supported, but may lack the expertise needed to resolve your issue.
Instructions in class assignments will require running commands from the command line.
You'll need to access the command line with a terminal (type terminal
in MacOS's
spotlight or the Ubuntu launcher). You should see a command prompt. For example:
-bash-4.2$
You can type commands at this prompt. Commands usually have the form
[command name] [argument 1] [argument 2] [argument 3] ...
Common commands include:
pwd
: Print the current working directory (usually starts as/home/[username]
)ls
: List all the files in the current working directory- Files and directories starting with a
.
(dot) are "hidden". To show hidden files and directories as well, usels -a
- Files and directories starting with a
cd [dirname]
: Move the current working directory to[dirname]
..
is a special directory name that refers to the current directory. E.g., if your current working directory is/home/zaphod
thencd .
wouldn't change the directory at all...
is a special directory name that refers to the parent directory. E.g., if your current working directory is/home/zaphod
thencd ..
would move the working directory to/home
~
(tilde) is a special directory name that refers to your home directory (typically/home/[username]
).
man [command]
: Read the manual page for[command]
.cat [filename]
: Display (concatenate to the console) the contents of[filename]
.
A package manager is like an app-store for the command line. Ubuntu uses apt
. MacOS
does not have a built-in package manager, but there are several that you can
install. This course assumes that you are using Homebrew. To
install a piece of software, type:
- Ubuntu:
apt install [name of package]
- MacOS:
brew install [name of package]
To find the name of a package, you can use:
- Ubuntu:
apt search [keywords]
- MacOS:
brew search [keywords]
You will need a text editor and a Scala compiler for this course. Popular editors include
- Emacs (
{brew/apt} install emacs
) - Vim (
{brew/apt} install vim
) - SublimeText
Instructions for the course will be given using SBT, although other build tools exist. For example, mill and bloop can be a little bit faster, but are also a little less well-documented.
Install SBT with {brew/apt} install sbt
or the instructions here
At your option, you may find it more convenient to use IDE (an all-in-one system that includes both an editor and a compiler). A popular IDE for scala development that course staff are familiar with is IntelliJ. Installers for Ubuntu are available via Flatpak.
You will need to install the Scala plugin (File → Settings → Plugins → Scala).
GIT uses Public/Private keys to authenticate you to the server. Generate a public/private keypair (if you don't already have one) as follows:
- Open a terminal.
- Type
ssh-keygen
- Follow the default prompts
- Your public key is in
~/.ssh/id_rsa.pub
(remember that~
is your home directory and that directory names starting with a.
are hidden)- Your private key is in
~/.ssh/id_rsa
. Keep this file secret and safe!
- Your private key is in
Next, you'll need to provide your public key to the server.
- Go to https://microbase.cse.buffalo.edu/
- Enter your UB email address
- Click the link in your email. The token will be valid for 5 hours.
- In the Key Management section of the page, enter a name for your computer in the appropriate box
- Paste your public key (
~/.ssh/id_rsa.pub
) into the appropriate box. DO NOT UPLOAD YOUR PRIVATE KEY
- MacOS Shortcut:
cat ~/.ssh/id_rsa.pub | pbcopy
copies your public key so you can paste it with Cmd-V. - Ubuntu Shortcut:
cat ~/.ssh/id_rsa.pub | xclip -i -selection clip
copies your public key so you can paste it with Ctrl-V.
- Wait about 5 minutes. You should be able to access your repository at
microbase@odin.cse.buffalo.edu:{YOUR UBIT}.git
Initialize your repository as follows. Replacing {YOUR UBIT}
with your UBIT, run
the following from the working directory where you want your code to live.
git clone https://gitlab.odin.cse.buffalo.edu/cse-250/project-0.git
cd project-0
git remote rename origin instructor_code
git remote add origin microbase@odin.cse.buffalo.edu:{YOUR UBIT}.git
git push -u origin main
If you are using IntelliJ, you may be asked if you want to import an SBT project when you first open the directory. Say yes.
This template project contains both SBT and IntelliJ project definitions.
Answer the following questions by submitting your answers to Microbase via the "Programming Project 0" assignment. Make sure your submission is committed and pushed into your microbase GIT repository. Log into Microbase and submit your project.
Expect this project to take 8-10 hours of setting up your environment, reading through documentation, and planning, coding, and testing your solution.
(30 points)
We will be making use of a public dataset released by the NYS energy department (NYSERDA) of solar energy installation sites. You can download the dataset at the NYS Open Data Portal by clicking on Export → CSV.
Your task with this dataset will be to sanitize and summarize the data file. There are a number of columns that are not of interest to us, so we will create an updated data file without these columns, while also obtaining some summary statistics.
Note: Although the specific tasks you will perform in this assignment are simplified to make them viable in the time allotted for the project, they are representative of common data processing tasks used for data exploration, visualization, and transformation, as well as for machine learning.
Problem 1 (15 points): In the object cse250.pa0.DataProcessor
define the
Scala function:
splitArrayToRowArray(rowData: Array[String]): Array[String]
with the following behavior:
- Assume that
rowData
is the result from taking some line from the Solar Installations dataset and invokingsplit(',')
. - Given
rowData
, place the data into anArray
corresponding to the columns that would result from opening the original dataset file with a spreadsheet application.
Note that every row processed should produce a return result that contains the same
number of column entries as the header row for the document. This means that each
row, even if there are empty cells, should return an Array
with 31 columns (even
if some are empty strings). Hint: review the documentation for the split
method for the cases where there are empty entries in a row. Hint: Be mindful of
rows that contain cells with commas (see the CSV representation rules below).
A good way to test this functionality is to ensure that the first row of the dataset,
which contains the header, should return a copy of the row. The second row of the
dataset, which contains successive blank entries, should still return a row with 51
entries, but should have two empty values for the ELECTRIC_UTILITY
and
PURCHASE_TYPE
fields, respectively. Feel free to add the tests provided in this
handout.
Problem 2 (5 points): In the object cse250.pa0.DataProcessor
define the
Scala function:
rowArrayToSolarInstallation(rowArray: Array[String]): SolarInstallation
with the following behavior:
- Assume that the input
rowArray
is anArray
containing 51 entries, corresponding to a row that was correctly processed throughsplitArrayToRowArray
. - Return the
SolarInstallation
object that corresponds to the data stored within the row.
Note that SolarInstallation
is only meant to hold a limited number of headers from
the dataset. The headers that are required to be present are stored in the Seq
named SolarInstallation.REQUIRED_HEADERS
. A full list of all headers is stored
in the Seq
named SolarInstallation.HEADERS
.
The required headers correspond to the following headers/columns (where column number 0 corresponds to the first/left-most column):
Column | Label |
---|---|
0 | REPORTING_PERIOD |
2 | PROJECT_NUMBER |
4 | CITY |
5 | COUNTY |
6 | STATE |
7 | ZIP_CODE |
8 | SECTOR |
12 | PURCHASE_TYPE |
13 | DATE_APPLICATION_RECEIVED |
14 | DATE_COMPLETED |
15 | PROJECT_STATUS |
16 | CONTRACTOR |
17 | PRIMARY_INVERTER_MANUFACTURER |
18 | PRIMARY_INVERTER_MODEL_NUMBER |
19 | TOTAL_INVERTER_QUANTITY |
20 | PRIMARY_PV_MODULE_MANUFACTURER |
21 | PV_MODULE_MODEL_NUMBER |
22 | TOTAL_PV_MODULE_QUANTITY |
23 | PROJECT_COST |
24 | INCENTIVE |
25 | TOTAL_NAMEPLATE_KW_DC |
26 | EXPECTED_KWH_ANNUAL_PRODUCTION |
28 | AFFORDABLE_SOLAR |
29 | COMMUNITY_DISTRIBUTED_GENERATION |
30 | GREEN_JOBS_GREEN_NEW_YORK_PARTICIPANT |
When you are finished, a SolarInstallation
should contain exactly 25 entries (one
piece of data associated with each header). This will cause the resulting updated
output file after running the given code to contain exactly 25 columns in each
row. See more on the SolarInstallation
Objects section below.
Problem 3 (5 points) In the object cse250.pa0.DataProcessor
define the
Scala function:
computeUniqueInverterManufacturers(dataset: Array[SolarInstallation]): Int
that determines the number of unique Inverter Manufacturers (corresponding to the
PRIMARY_INVERTER_MANUFACTURER
column). You should ignore any empty entry from
your count, as well as the column header.
Problem 5 (5 points) In the object cse250.pa0.DataProcessor
define the
Scala function:
computeTotalExpectedKWHAnnualProduction(dataset: Array[SolarInstallation]): Float
that determines the sum of the total KWH of annual expected energy production
(corresponding to the EXPECTED_KWH_ANNUAL_PRODUCTION
column). Note that your
answer produced should make sense, so you should assume that the energy produced
should be positive. If no valid data is found, you should return 0.
To represent a single data record, you must use the structure
cse250.objects.SolarInstallation
provided in the code skeleton.
/**
* One specific solar installation site.
*/
class SolarInstallation
{
/**
* Key-value pairs representing data about the solar site. See [[SolarInstallation.HEADERS]] for a list of
* allowable keys, and [[SolarInstallation.REQUIRED_HEADERS]] for a list of mandatory keys.
*/
val fields: mutable.Map[String, String] = new mutable.HashMap[String, String]
Note that the file containing SolarInstallation
will be overwritten when your
code is graded, so any changes you make within will be reverted.
The information stored within a SolarInstallation
should be stored in the fields
attribute, which stores (key → value) pairs, where the header for the column
(value in the respective column in the first row) is the key and the value
is the data found within the row in the corresponding column. For example, the
first installation (second row of the file) should be loaded as
Key | Value |
---|---|
REPORTING_PERIOD |
07/31/2021 |
PROJECT_NUMBER |
0000000276 |
CITY |
Ithaca |
... | |
GREEN_JOBS_GREEN_NEW_YORK_PARTICIPANT |
No |
Note that the value for ELECTRIC_UTILITY
is simply an empty string, and there
should be an entry for every header.
The formatting for the data file is CSV (comma-separated values). CSV files are a way to represent columns of data by separating entries within a row by a comma. Each line represents a separate row of cells. Two special cases arise that you must handle:
- If a cell contains a comma (
,
) within, the cell contents are enclosed in double quotes ("
) at the start and end. For example:Comma, Cell
would be stored as"Comma, Cell"
. - If a cell contains a double quote (
"
) within, the cell contents are enclosed in double quotes ("
) at the start and end, and each double quote in the cell data is duplicated. For example,The "Best" Around
would be stored as"The ""Best"" Around"
It is also possible that a cell contains both. Note that if a data file contained the line
First Cell,Second Cell,"The ""Best"" Around","Comma, Cell","""Object-Orientation, Abstraction, and Data Structures Using Scala"""
It would correspond to the following cells (e.g., if opened in Excel):
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
First Cell | Second Cell | The "Best" Around | Comma, Cell | "Object-Orientation, Abstraction, and Data Structures Using Scala" |
Set up your microbase repository as detailed above and set up your programming environment
- There are three files in the
src
directory. Open them with your code editor of choice. - Launch sbt by navigating to the root of your project directory in a terminal window and typing
sbt
. - Set up continuous compilation by typing
~compile
at the sbt prompt.
- Open IntelliJ
- Select Open (possibly File → Open).
- Navigate the the directory you checked out above. Click OK
- Confirm the request to import an SBT project if prompted.
- Open
Main.scala
(src → main → scala → cse250 → pa0). Test that the code builds properly by right-clicking Main →Run "Main". If it ran correctly, you should see one of the following two errorsException in thread "main" scala.NotImplementedError: an implementation is missing
Exception in thread "main" java.io.FileNotFoundException: data/Solar_Electric_Programs_Reported_by_NYSERDA__Beginning_2000.csv (No such file or directory)
- Download the Solar Installations dataset and move it to the
project-0/data
directory.- The data file can then be opened by the filename
data/Solar_Electric_Programs_Reported_by_NYSERDA__Beginning_2000.csv
- I recommend you make a smaller test file of entries to work on. To do this, make a copy of the solar installations file and then remove all of the lines after the first 10 or 100, etc..., and then save the file.
- It is not recommended to make modifications to the file in Excel as there may be unintended formatting side-effects upon saving.
- If viewing the
.csv
file in IntelliJ, I recommend not installing the plugin so you can continue viewing it as text, instead of the view that would be provided by software like Excel. This is beneficial so you can see how the data you are manipulating looks.
- The data file can then be opened by the filename
- Update the copyright statements in the necessary files with your name and UBIT.
- Begin working on the problems requested.
- Note that when working on translating the rows from the csv data file, the last column of the data (when non-empty) contains a comma and should be treated as a single entry. There may be other data entries that contain commas, as well, so be sure you think about how to handle this (look at the data set in a text editor/IntelliJ to see what the format is).
- To check that this works, try adding print statements (
println(text)
), or by pausing the program in IntelliJ's debugger with breakpoints to check that the data is being read as you expected. - It is suggested to test your code via the file
DataProcessorTests
, although aMain
class is also provided.- To run the
Main
class in SBT, typerun
at the SBT prompt, or typesbt run
from the command line - To run the
Main
class in IntelliJ, right-click theMain.scala
file and select Run Main - To run tests in SBT type
test
at the SBT prompt, or typesbt test
from the command line - To run tests in IntelliJ, right click the folder
test
and select Run 'ScalaTests' in 'test'...
- To run the
- You are welcome to add more testing functions at your discretion, both in the
Main
object and inDataProcessorTests
. - If code is not running properly, make sure your sources and tests root are set properly. To do this, right click
src/main
, go to Mark Directory as, and select Sources Root. Similarly, right clicksrc/test
, go to Mark Directory as, and select Test Sources Root. Themain
folder should appear blue and thetest
folder should appear green if this is set up correctly.
- You may choose the collection classes you wish to use in your function implementation to solve these problems.
For Project 0 you will be allowed 5 submissions, without penalty. Starting from the 6th submission, you will receive a 2 points per-submission deduction from your score on the assignment.
Your score is the best score of any of your submissions. If you receive a score and resubmit, the higher score will be used. There is a maximum of 10 submissions.
- Fall 2021 - Oliver Kennedy (okennedy@buffalo.edu)
- Spring 2021 - Andrew Hughes (ahughes6@buffalo.edu)