\n",
" (0, 20579)\t1\n",
" (0, 19220)\t1\n",
" (0, 29697)\t1\n",
" (0, 6320)\t1\n",
" (0, 25926)\t1\n",
" (0, 34222)\t1\n",
" (0, 31398)\t1\n",
" (0, 17883)\t1\n",
" (0, 16809)\t1\n",
" (0, 34425)\t1\n",
" (0, 23460)\t1\n",
" (0, 21787)\t1\n",
" (0, 11068)\t1\n",
" (0, 29494)\t1\n",
" (0, 29505)\t1\n",
" (0, 18436)\t1\n",
" (0, 24025)\t1\n",
" (0, 25336)\t1\n",
" (0, 12577)\t1\n",
" (0, 27517)\t1\n",
" (0, 30641)\t1\n",
" (0, 5980)\t1\n",
" (0, 29104)\t1\n",
" (0, 27521)\t1\n",
" (0, 11100)\t1\n",
" :\t:\n",
" (0, 17310)\t1\n",
" (0, 25400)\t1\n",
" (0, 23118)\t1\n",
" (0, 31686)\t6\n",
" (0, 27158)\t1\n",
" (0, 18085)\t1\n",
" (0, 12580)\t1\n",
" (0, 2100)\t1\n",
" (0, 20381)\t1\n",
" (0, 32729)\t1\n",
" (0, 23854)\t2\n",
" (0, 11079)\t1\n",
" (0, 15109)\t2\n",
" (0, 20509)\t1\n",
" (0, 23858)\t1\n",
" (0, 26624)\t1\n",
" (0, 30377)\t1\n",
" (0, 16034)\t1\n",
" (0, 19099)\t1\n",
" (0, 13317)\t6\n",
" (0, 34790)\t6\n",
" (0, 9553)\t4\n",
" (0, 21852)\t5\n",
" (0, 18962)\t3\n",
" (0, 15373)\t1\n"
]
}
],
"source": [
"vec=CountVectorizer()\n",
"features=vec.fit_transform(train.data)\n",
"print(\"Type of feature matrix:\", type(features))\n",
"print(features[0,:]) # print the features of the first sample point"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The feature matrix is stored in sparse format, that is, only the nonzero counts are stored. How many words were in the first message?"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-24T19:29:20.739668Z",
"start_time": "2020-06-24T19:29:20.735790Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of words: 177\n",
"Word 'it' appears in the first message 2 times.\n",
"\n",
"From: jgfoot@minerva.cis.yale.edu (Josh A. Goldfoot)\n",
"Subject: Re: Organized Lobbying for Cryptography\n",
"Organization: Yale University\n",
"Lines: 21\n",
"Distribution: inet\n",
"Reply-To: jgfoot@minerva.cis.yale.edu\n",
"NNTP-Posting-Host: minerva.cis.yale.edu\n",
"X-Newsreader: TIN [version 1.1 Minerva PL9]\n",
"\n",
"Shaun P. Hughes (sphughes@sfsuvax1.sfsu.edu) wrote:\n",
": In article <1r3jgbINN35i@eli.CS.YALE.EDU> jgfoot@minerva.cis.yale.edu writes:\n",
"[deletion]\n",
": >Perhaps these encryption-only types would defend the digitized porn if it\n",
": >was posted encrypted?\n",
": >\n",
": >These issues are not as seperable as you maintain.\n",
": >\n",
"\n",
": Now why would anyone \"post\" anything encrypted? Encryption is only of \n",
": use between persons who know how to decrypt the data.\n",
"\n",
": And why should I care what other people look at? \n",
"\n",
"I was responding to another person (Tarl Neustaedter) who held that the\n",
"EFF wasn't the best organization to fight for crytography rights since the\n",
"EFF also supports the right to distribute pornography over the internet,\n",
"something some Crypto people might object to. In other words, he's\n",
"implying that there are people who will protect any speech, just as long\n",
"as it is encrypted.\n",
"\n",
"\n"
]
}
],
"source": [
"print(\"Number of words:\", features[0,:].sum())\n",
"col = vec.vocabulary_[\"it\"] # Get the column of 'it' word in the feature matrix\n",
"print(f\"Word 'it' appears in the first message {features[0, col]} times.\")\n",
"print()\n",
"print(train.data[0]) # Let's print the corresponding message as well\n",
"#print(vec.get_feature_names())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercise 1 (blob classification)
\n",
"\n",
"Write function `blob_classification` that gets feature matrix X and label vector y as parameters. It should then return the accuracy score of the prediction. Do the prediction using `GaussianNB`, and use `train_test_split` function from `sklearn` to split the dataset in to two parts: one for training and one for testing. Give parameter `random_state=0` to the splitting function so that the result is deterministic. Use training set size of 75% of the whole data.\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercise 2 (plant classification)
\n",
"\n",
"Write function `plant_classification` that does the following:\n",
"\n",
"* loads the iris dataset using sklearn (`sklearn.datasets.load_iris`)\n",
"* splits the data into training and testing part using the `train_test_split` function so that the training set size is 80% of the whole data (give the call also the `random_state=0` argument to make the result deterministic)\n",
"* use Gaussian naive Bayes to fit the training data\n",
"* predict labels of the test data\n",
"* the function should return the accuracy score of the prediction performance (`sklearn.metrics.accuracy_score`)\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercise 3 (word classification)
\n",
"\n",
"This exercise can give four points at maximum!\n",
"\n",
"In this exercise we create a model that tries to label previously unseen words to be either Finnish or English.\n",
"\n",
"Part 1.\n",
"\n",
"Write function `get_features` that gets a one dimensional np.array, containing words, as parameter. It should return a feature matrix of shape (n, 29), where n is the number of elements of the input array. There should be one feature for each of the letters in the following alphabet: \"abcdefghijklmnopqrstuvwxyzäö-\". The values should be the number of times the corresponding character appears in the word.\n",
"\n",
"Part 2.\n",
"\n",
"Write function `contains_valid_chars` that takes a string as a parameter and returns the truth value of whether all the characters in the string belong to the alphabet or not.\n",
"\n",
"Part 3.\n",
"\n",
"Write function `get_features_and_labels` that returns the tuple (X, y) of the feature matrix and the target vector. Use the labels 0 and 1 for Finnish and English, respectively. Use the supplied functions load_finnish() and load_english() to get the lists of words. Filter the lists in the following ways:\n",
"\n",
"* Convert the Finnish words to lowercase, and then filter out those words that contain characters that don't belong to the alphabet.\n",
"* For the English words first filter out those words that begin with an uppercase letter to get rid of proper nouns. Then proceed as with the Finnish words.\n",
"\n",
"Use get_features function you made earlier to form the feature matrix.\n",
"\n",
"Part 4.\n",
"\n",
"We have earlier seen examples where we split the data into learning part and testing part. This way we can test whether the model can really be used to predict unseen data. However, it can be that we had bad luck and the split produced very biased learning and test datas. To counter this, we can perform the split several times and take as the final result the average from the different splits. This is called [cross validation]().\n",
"\n",
"Create `word_classification` function that does the following:\n",
"\n",
"Use the function `get_features_and_labels` you made earlier to get the feature matrix and the labels. Use multinomial naive Bayes to do the classification. Get the accuracy scores using the `sklearn.model_selection.cross_val_score` function; use 5-fold cross validation. The function should return a list of five accuracy scores.\n",
"\n",
"The cv parameter of `cross_val_score` can be either an integer, which specifies the number of folds, or it can be a *cross-validation generator* that generates the (train set,test set) pairs. What happens if you pass the following cross-validation generator to `cross_val_score` as a parameter: `sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=0)`.\n",
"\n",
"Why the difference?\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercise 4 (spam detection)
\n",
"\n",
"This exercise gives two points if solved correctly!\n",
"\n",
"In the `src` folder there are two files: `ham.txt.gz` and `spam.txt.gz`. The files are preprocessed versions of the files from https://spamassassin.apache.org/old/publiccorpus/. There is one email per line. The file `ham.txt.gz` contains emails that are non-spam, and, conversely, emails in file `spam.txt` are spam. The email headers have been removed, except for the subject line, and non-ascii characters have been deleted.\n",
"\n",
"Write function `spam_detection` that does the following:\n",
"\n",
"* Read the lines from these files into arrays. Use function `open` from `gzip` module, since the files are compressed. From each file take only `fraction` of lines from the start of the file, where `fraction` is a parameter to `spam_detection`, and should be in the range `[0.0, 1.0]`.\n",
"* forms the combined feature matrix using `CountVectorizer` class' `fit_transform` method. The feature matrix should first have the rows for the `ham` dataset and then the rows for the `spam` dataset. One row in the feature matrix corresponds to one email.\n",
"* use labels 0 for ham and 1 for spam\n",
"* divide that feature matrix and the target label into training and test sets, using `train_test_split`. Use 75% of the data for training. Pass the random_state parameter from `spam_detection` to `train_test_split`.\n",
"* train a `MultinomialNB` model, and use it to predict the labels for the test set\n",
"\n",
"The function should return a triple consisting of\n",
"\n",
"* accuracy score of the prediction\n",
"* size of test sample\n",
"* number of misclassified sample points\n",
"\n",
"Note. The tests use the `fraction` parameter with value 0.1 to ease to load on the TMC server. If full data were used and the solution did something non-optimal, it could use huge amounts of memory, causing the solution to fail.\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}