Classification and Performance Evaluation

IT348/448 Intro to ML

Project 1: Classification and Performance Evaluation


You are to write a Naïve Bayes classification system.


The user of your program must be able to do each of the following (in any reasonable

order and repeatedly). This must be one menu item that then requests both required files.

You must not have a separate menu item requesting the meta file. The user must be able

to train on new files as often as desired.


1) Part 1: Cross validation a) Use the data file named: b) Ask the user to provide the number for cross validations. For example, if the user

input k, then you will create the k-cross validation.

c) Perform k-cross validation. Print each cross-validation status. That is, print out the accuracy of each cross-validation phase. For example, if it works for k-cross

validation, your program should print out the accuracy k times.

d) Generate the average accuracy for the k-cross validation. Print a report of the average accuracy of the k-cross validation to the screen.

2) Part 2: Confusion matrix. a) Use the data file named: car.train and car.test b) Have the system train based on training data (.train file). Ask the user for

the names of the files with the training data.

c) Have the system read a set of data that may or may not have classifications and provide classifications for each instance. Ask the user for input and output file

names. The output must be in exactly the same format as the training data. The

user must be able to classify as many different files as desired before retraining.

Set up your program so that it takes data without labels but can also accept data

with labels but ignores them.

d) Have the system read a set of data (.test file) that has labels and determine

its accuracy by comparing its computed labels to the actual labels. Ask the user

for the name of the data file.

e) Print a report of the accuracy to the screen. This must be completely independent of the previous bullet as far as the user is concerned (though obviously there will

be significant code reuse between the two).

f) Generate and print the confusion matrix for the test data. g) Also make sure the user can quit the program.


You may wish to have an option to print computed probabilities to the screen for

debugging purposes. That’s fine. However, you must not force me to deal with any sort

of debugging output during grading.


The training process will have 2 inputs: a metadata file that will list a set of attributes (or

feature or variables) and each attribute’s possible values, ending with the classification,

and a data file in which each line represents one example. Examples consist of a comma-



IT348/448 Intro to ML

separated list of attributes (in the same order as the metadata file), ending with the

classification. A key challenge here will be to design a data structure to hold all of the

counts you’ll need to compute your needed probabilities. A second key challenge is

getting the math right.


Smooth your data by adding one to all of your counts (so that you have no zero counts).


The program may be written in a language of your choice. Any of the available

languages should work fine.


Quality of the user interface counts. I am more concerned about functionality and

convenience than about looks, but it should look reasonably professional. It should not

be annoying to use. For example, I must be able to do things in different orders (as long

as they make sense) as indicated above and to repeat activities without having to repeat

other activities. So I must be able to train once and test on several different files.


Create a README file with a description of the program, instructions for compiling the

code, and instructions for using your program. A README that has little more than

“Follow the program instructions” will be deemed unacceptable and cost you points.

Writing good instructions for a user of your program is an important job skill. Here’s a

chance to practice.