- The process of extracting interesting (non-trivial, implicit, previously unknown and potentially useful) knowledge or patterns from data in large databases
- Discover knowledge that characterizes general properties of data
- Discover patterns on the previous and current data in order to make predictions on future data
Knowledge discovery in databases (KDD)
- Determining business objectives: Gathering background information, compiling the business background, and defining business objectives
- Assessing the situation: Requirements, assumptions, and constraints, What sort of data are available for analysis? Do you have access?
- Determining data science goals: Data science goals, Data science success criteria
- Collect initial data: Existing data, purchased data, and additional data
- Describe data: Amount of data and value types
- Verify data quality: Missing data and data errors
- Select Right data: Select training examples and featurs, is a given attribute relevant to your data mining goals
- Clean data: Fill in missed data, correct data errors
- Format data: Put data in a format for training the model
- Select modelling techniques: Select data types available for analysis, select an algorithm or a model, define modelling goals, state specific modeling requirements
- Set up hyper parameters and build the model: Train the model, describe the result
- Asses the model: Overfitting and under fitting
- Evaluate the results: Are results presented clearly? Are there any novel findings? Can models and findings be applicable to business goals? How well do the models and findings answer business goals? What additional questions the modeling results have risen?
- Review the process: Did the stage contribute to the value of the results? What went wrong and how it can be fixed? Are there alternative decisions which could have been executed?
- Determine the next steps
- Planning for deployment: Summarize models and findings, For each model create a deployment plan, Identify any deployment problems and plan for contingencies
- Plan Monitoring and maintenance: Identify models and findings which require support, How can the accuracy and validity be evaluated?, How will you determine that a model has expired?, What to do with the expired models?
- Conduct a final project review
A simplified representation of reality created to serve a purpose. Examples include maps, prototypes, black-scholes model, etc.
An estimate of an unknown value
- A formula for estimating the unknown value of interest: the target
- The formula can be mathematical, logical statement
- Represents a fact or a data point
- Described by a set of attributes (fields, columns, variables, or features)
The input data to create the model
- Numeric: Anything that has some order like numbers, dates
- Categorical: Stuff that does not have order like text
- Classification and class probability estimation
- Regression
- Similarity Matching
- Clustering
- Co-occurrence grouping and association rules
decsion tree
- It finds a function from data which relates a real-valued variable with one or more other variables
- For example, predict daily water demand
- To group data to form classes (clusters)
- Class label is unknown in the training data
- Principle: maximizing the intra-class similarity and minimizing the inter-class similarity
- Applications include market/customer segmentation
- A supervised technique is given a specific purpose for the grouping—predicting the target.
- Supervised tasks require different techniques than unsupervised tasks and are more useful
- Classification and regression
- They are distinguished by the type of target
- Binary
- Categorical target
This is also a classification problem, with a three-valued target.
This is a classification problem because it has a binary target (the customer either purchases or does not).
This is a regression problem because it has a numeric target. The target variable is the amount of usage (actual or predicted) per customer
- Clustering and classification of customers for targeted marketing
- Identify customer groups or associate a new customer to an appropriate customer group
- Discover customer shopping patterns and trends
- Re-arrange store layout
- Purchase recommendation and cross-reference of items
- Association analysis: identification of co-occurring gene sequences
- Most diseases are not triggered by a single gene but by a combination of genes acting together
- Association analysis may help determine the kinds of genes that are likely to co-occur together in target samples
- It is the sum of the dimensions of the features
- It the sum of the number of numeric features and the number of values of categorical features
- It is widely used for market basket or transactional data analysis
- Classification
- Regression
- Casual modeling
- similarity matching
- Link predicition
- Data reduction
- Similarity matching
- link prediction
- data reduction
- clustering
- co-occurence grouping
- profiling
What are some classical pitfalls in data mining setup?