CHAPTER1Introduction1.1 OverviewInthis century we are surrounded by tremendous quantity of information and datathat means we are at information age.
And in this age we have not known that inwhat way we organizes our data as it is appears unbounded means there is notany limit on it. With access to vast volumes of data, decision makersfrequently draw conclusions from data repositories that may contain dataquality problems, for a variety of reasons. In decision-making, data quality isa serious concern. The incidence of data quality issues arises from the natureof the information supply chain 1, consumer of a data product may be severalsupply-chain steps removed from the people or groups who gathered the originaldatasets on which the data product is based. Figure 1-1.The method of KnowledgeDiscovery in databasesTheseconsumers use data products to make decisions, often with financial and timebudgeting implications. The separation of the statistics buyer from the dataproducer creates a situation where the consumer has little or no idea about thelevel of quality of the data 2, leading to the potential for poordecision-making and poorly allocated time and financial resources.
Figure1.1 Data Mining Definition 1.2 MotivationThe association rule of data mining is a elementary topicin mining of data .Association rule mining discovery frequent patterns, associations, correlations, or fundamental structures alongwith sets of items or objects in transaction databases, relationaldatabases ,andotherinformation repositories.The amount of data increasing significantlyas the data generated byday-to-dayactivities.Therefore ,miningassociation rules fromhuge amount of data in the data base is concerned form any industries which canhelp in many business decisionmaking processes, such ascross-marketing, Basket dataanalysis, andpromotion assortment.
The techniques for discovering association rulesfrom the data haveconventionally focusedon identifying relationshipsbetween items telling somefeature of human behavior, usually trade behavior fordetermining itemsthat customers buy together. All rules of this typedescribea particular local pattern. The groupof association rules canbesimply interpretedand communicated. It is fundamentally important to declare that the primekey to understand andrealize the data mining technology is the ability to makedifferent between data mining, operations,applications and techniques 2, asshown in Fig 1.2 A lot of studieshave been done in the region of association rulesmining.
First introducedthe association rules miningin. Many studies have been conductedto address various conceptual,implementation, and applicationissuesrelating to the associationrules miningtask. 1.
3 AssociationRuleminingThe techniquesfor discovering associationrules from the data have conventionallyfocusedon identifying relationshipsbetween items telling me feature of human behavior,usually trade behaviorfor determining items that customersbuy together. Allrules of this type describe a particular localpattern. The group of associationrules can be simplyinterpretedand communicated.
Theassociation rule x?yhas support s in D if the probability of atransaction in D contains both X and Y is s. The task of miningassociation rules is to find all the association rules whose support is largerthan a minimum support threshold and whose confidence is larger than a minimumconfidence threshold 1. These rules are called the strong association rules.1.4 HADOOPHadoop is an open source frameworkfrom Apache and is used to store process and analyze data, which are very hugein volume. Hadoop runs applications using the MapReduce algorithm, where thedata is processed in parallel with others.
In short, Hadoop is used to developapplications that could perform complete statistical analysis on huge amountsof data.Hadoop ArchitectureAt its core, Hadoop has two majorlayers namely:· Processing/Computationlayer (MapReduce), · Storage layer(Hadoop Distributed File System) Fig.4 : Hadoop Architechure 1.4.1 MapReduceTo take the advantage of parallel processing of Hadoop, the querymust be in MapReduce form. The MapReduce is a paradigm, which has two phases,the mapper phase and the reducer phase.
In the Mapper the input is given in theform of key value pair. The output of the mapper is fed to the reducer asinput. The reducer runs only after the mapper is over.
The reducer too takesinput in key value format and the output of reducer is final output. Figure1.5 : Map Reduce flow diagram1.
4.2 Steps in Map Reduce Map takes a data in the form of pairs and returns a list of
This sort and shuffle acts on these list of
Well-organized algorithms formining frequent itemsetsare necessary formining associationrules a well as for manyotherdataminingtasks.The mostimportant challenge create infrequentpatternminingis alargeamoun o f resultpatterns.Astheminimumthreshold becomeslower,anexponentiallyhugenumberofitemsetsaregenerated.Therefore, pruningunimportantpatternscanbedoneefficientlyinminingprocessandthatbecomesoneofthemost importanttopics in frequentpatternmining.Therefore,themainaimisto optimizetheprocessoffindingpatternswhichshouldbeefficient,scalableandcanclassifythe important patterns whichcan beused in differentways. 1.6Aim& ObjectivesThemain objective of research work is to improved classical version of AprioriAlgorithm based on top down approach by using association rule with Hadoopmap-reduce programming by giving them a hands-on experience in developing theirHadoop based Word-Count application.
Hadoop MapReduce Word-Count example is astandard example. Where in therules avoiding generation of un-necessary patterns generates. This improved Apriorialgorithm is used in various type of mining.Theproblemsorthelimitationsdefinedintheabovesectionofthischapterareproposedto be solved by: 1. Installation of Hadoop on Linuxenvironment for Singe Node.2.
ImplementMap-Reduce with Word-Count Problem.3. Todetect and achieveofvariousaccessiblealgorithmsforminingfrequentitemsets on variousdatasets.4. To advise a new ideafor miningthe frequent itemsets for trader transactionaldatabasei.e.
forthe aboveproblem.5. To validatethenew scheme on dataset. 1.7 Thesis OutlineThesis organized in following way: Chapter-1Introduction: Thischapter deals with all the introductory requirements for understanding thedomain area.
It gives the details, which are necessary to understand the work,and measures its outcomes. It provides the motivations, Background, problemsunderstanding and a view of proposed solution. This is very first and essentialpart of the report, which contains the brief details about theAssociation Rulewith Hadoop. Chapter-2Literature Review: Itpresents a survey on technologies available with the domains. In this a widevariety of existing mechanism, algorithms and architectures is studied foridentifying the issues removed and remains in Association Rule with Hadooparea.
Chapter-3Problem Identification: Inthis chapter we identify problem in existing system. Later on, this will give abrief categorization of various approaches, which has been suggested over thelast few years on Association Rule with Hadoopusing Data mining approaches. Chapter-4Proposed Work:Afterstudying the different existing mechanism this identifies the SystemPreliminary. It gives a clear understanding the Algorithm with its steps. Itwill help the solution to provide better resolution of the current situationsof security.
This chapter also gives implementation plan and Testing Strategy ofabove security problems by suggesting an architectural solution. Here in thischapter the implementation of our proposed system will be done. Theimplementation is working on which platform, what kind of theme and approach isfollowed is referred in this section. Chapter-5Result Analysis:Developinga solution is an approach proving mechanism but to prove its results is acomplicated task because it measures each and every step of the solution andlet it compare with the existing mechanisms.
Either the proposed system, whichwe have implemented, is working properly or not will be discussed in thissection. The results are going to be verified on the basis of the analysis.Chapter-6Conclusion and Future Work: Thischapter gives concluding remarks on the dissertations and gives a finalanalysis and comparisons along with some future directions of the work. Thefuture scope and the short summary will be discussed. It gave an idea how wecan expand the work in future which we have performed in this report. 1.8 Summary Thischapter allocates with all the introductory requirements for understanding thedomain area. It gives the details, which are necessary to understand the work,and measures its outcomes.
It provides the motivations, background, problemsunderstanding and a view of proposed solution.