It seems tokenizer worked well.Tokenizer separated the text file into some sections and then found the frequency of each words. By searching words using ctrl+f, there were 137 of ‘all’ but the result of sum was 182. I think this difference is caused when cutting sections. It seperate sections by chapter. “Selecting from 1 to 9” 1 “Selecting from 11 to 221” 1 “Selecting from 223 to 422” So, it can’t find the word between 9-11, 221-223 and so on. For improvement, Tokenizer should show the added-up count. As we had to paste the values of each section and calculated it on excel. The conjugation of each verb has to be added to the infinitive form automatically.1-bThe majority of top 30 words is used frequently because it has to be used without specific meanings, that search engine programmed to ignore these. The name:alice should not be in that stopwords list as it’s the main character’s name.1-cFollowing Zipf’s law can stand for when words are listed in order of their frequency of usage, the frequency of usage of all words is inverse to the ranking of the words. I could check this feature in the graph so I can say it follows Zipf’s law. For example, the frequency of ‘the’ is so high that the it looks like a rigid line. However the rank of ‘the’ is over 500. On the other hand, the frequency of the word on rank 1 is so rare so it is just a circle not a line.1-dHeap’s law is an empirical law which describes the number of distinct words in a document as a function of the document length. The text corresponds to Heaps law, because it looks like a typical heap’s-law plot and it’s decreasing upward as in the beginning there were new words and they are reduced as the text continues.2-aPrecision means the ratio of the relevant words among selected items. Recall means among all items, how many applicable words are retrieved. Precision and recall can be the standard for assessing TF-based search. First, to check precision, count the relatable words from the result of TF-based search. Then count the appropriate words that were not selected to find out recall.2-bThe whole text has to be scanned and every word that occurs has to be counted then showed in the TFIDF matrix. As the text is large in size, it’ll take greater time to scan it all and count each word.2-cRecall of TF-based search would be higher than TFIDF-based search since TF-based search suggests common words. However Precision of TFIDF-based search will provide better quality of wordlist.2-dTF-based search will give you the top 5 of common words that you can find them in the document but not really important or meaningful. IFIDF-based search will suggest top 5 of relatively rare but significant words.2-ePrecision and recall are showing the consequence of the research. So they are not something can be improved. However, tokenizer can be amended to make precision and recall get higher percentage.

Written by

I'm Colleen!

Would you like to get a custom essay? How about receiving a customized one?

Check it out