getting to know a situation through and through in order to be able to decide
getting to know an individual's psychological functioning
testing validity and reliability of a test, that is ideally repeatable.
-confirmation bias: interpreting new evidence as confirmation of one's existing beliefs or theories
-availability heuristics: tendency to check for symptoms that are related to disorders with a high prevalence, assumed before the experiment.
is a standardized procedure for sampling behavior and describing it with categories or scores.
-standardized procedures
-behavior sample
-scores/categories
-norm or standard
- prediction of nontest behaviors
problem analysis, classification and diagnosis, treatment planning, program/treatment evaluation, self-knowledge, and scientific research
intelligence, aptitude, achievement, creativity, personality, interest inventory, behavioral procedures, and neuropsychological/cognitive
every test must rely on some degree of measurement error, expressed by the formula: X=T+e
X observed score
T true score
e positive or negative error
to keep low error probability, through: repeatability, integrality, use of scores or categories, interpretation of scores, prediction of nontest behavior
appraising or estimating the magnitude of one or more attributes in a person
COTAN psychometric criteria
NIP guideline for test use
committee on test and testing in the Netherlands, informs users of the quality of the test and help check systematic errors of instruments.
X=T+e
determines the correlation between tests scores after repeated assessment.
t=consistency of constructs
e=inconsistency of errors
information about the relationship between test scores after related assessment or between items in the test.
behavioral differences
measurement errors: item selection, test administration, test scoring
1. systematic errors; measure of validity, cause it tells how close to the actual behavior, either positive or negative.
2. unsystematic errors; measure of reliability, affects constancy of scores, can be positive and negative.
are + and -, with an average of 0. are not related to T, not related to each other, are normally distributed.
the measure of consistency over multiple assessments.
r=T/X
the closer T is to X, the closer r is to 1.
test-retest
alternate forms
split half & spearman Brown
coefficient alpha
KR-20
inter-scorer/rater
assess the relation between scores of a group on a test with repeated assessments. assumptions:
true scores are correlated, measurement errors are not correlated between them and with true scores.
estimates: random fluctation within individuals and due to environment.
assess relationship between scores of a group on a test with 2 alternate assessments, 2 versions of the same test. estimation of: random fluctuation (indiviudual and due to environment, and due to sample of items).
assess the relation between scores of a group on two test halves. split the test in two halves then calculate the correlation between them.
rsb=2rhh/1+rhh
rsb= estimated reliability complete test using spearman brown
rhh= correlation between 2 test halves
estimation of: random fluctuation due to sample of items.
assess the relation between scores on all possible split halves, with Spearman brown correction.
rα=(N/N-1)(1-Σ σj^2/σ^2)
α = reliability in terms of coefficient alpha
N = number of items test
σj^2= variance of item
Σσj^2 = sum of variances all items
σ^2= variance of the total test score
increases when you have more items and the variance in e increase.
questions only answered with yes or no, measures internal consistency, when questions are alike-> overestimation of reliability.
assess relation between scores of different examiners. when using more subjective methods or diagnostic interviews.
3 assumptions:
1. variance of measurement error on a test is the same for all individuals taking the test
2. the measurement error is normally distributed
3. is a measure of the expected deviation of X relative to T.
=SDe= standard deviation of measurement error
for a perfectly reliable test R is 1 and SEM is 0.
depends on:
-context: what you'll do with the scores
-level of the decision, individual needs higher reliability, group less.
-- Important decisions:
Good ≥ .90
Sufficient ≥ .80 - .90
Insufficient < .80
- Less important decisions:
Good ≥ .80
Sufficient ≥ .70 - .80
Insufficient < .70
- Group level:
Good ≥ .70
Sufficient ≥ .60 - .70
Insufficient < .60
the more important the decision is, the more strict the reliability needs to be.
item response theory: measurement error depends on test score. is more about test construction, whether the relevant items are included or not.
68%CI-Xi +/- 1SDe
90%CI-Xi+/- 1.65SDe
95%CI-Xi+/- 1.96SDe
90%CI-Xi+/- 3SDe
= summary of distribution of characteristics in a representative sample.
two types:
1. relative: classify on a continuum set of data. compare the single to the norm group.
2. absolute: determine if a specific requirement is reached.
percentiles: represents cumulative frequency distributions divided in groups of the same size. is a measure of ranking: means that a % of the norm group score lower than X.
P50 =median
P2= individual with scores more than 2SD below the mean
P16= <-1SD
P84=<+1SD
P98= <+2SD
also referred to as z-score, M=0 and SD=1, reflect raw scores, they have the same distribution. can be positive and negative.
t-scores, to avoid negative/decimated scores. is a normalized score that tells you about your score relative to the norm group.
in order to compare norm scores. you need participants scores, transform them into percentiles, then to equivalent z-score= normalized standard score, and then use it as the normalized standard score. if data is skewed is harder to compare. it's a non-linear transformation. don't assume that is always normally distributed.
other types of standard scores: M=5 and SD=2
summative assessment: did you learn enough
formative assessment: where you at, what you need to improve.
sex, age, class, educational level, etnicity, regional variance, city/rural areas. 3 ways to build one: random sampling, stratified random sample, arbitrary sample. size of sample>400 good, N<300 insufficient. plus durability: norms should be timely.
to give them meaning, when behavior is not anymore typical or normal, and to find the reason for this.
it is a criterion, a number, that can be qualified as weak, acceptable, or good. it reflects the individual functioning in the behavior you want to assess.
content validity
criterion validity
construct validity
the extent to which the contents of the test are representative of the behavior/construct you're assessing. to check it you ask two experts and divide the number of items they both consider relevant and divide them by the total of items: if it's high, high validity.
to know the correlation between test scores and behavior you want to predict. 2 types: concurrent (predicting something that is at the same time assessed), predictive (predicting something that will be assessed in the future). acceptable level of validity: decision theory (hits, false positive, false negatives).
hits, false positive, false negatives
sensitivity, specificity, test prevalence, actual prevalence.
the extent to which the test score is a good reflection of the construct you want to assess.
a variable that cannot be measured directly but is based on theoretical assumptions.
research on the relation between the test scores and scores on other variables.
1- convergent: same construct, different tests
2- discriminant: different construct, different tests
statistical analysis of the components of a test. exploratory factor and confirmatory factors.
theory consistent group difference and theory consistent intervention effects. multimethod and multitrait.
low validity leads to less useful test results.
-not reliable
-theory is incorrect
-test does not measure the right thing but something else
-to assess construct validity, both convergent and discriminant
-reliability> convergent validity> discriminant validity
- a>c>b>d
-if discriminant validity is higher than the convergent, your test is not good.
1. define test purpose
2. choose scaling method
3. item construction and analysis
4. revision
only with a defined and specific purpose you can determine psychometric properties (validity, reliability..). the goal also determines which scale you'll use.
scale= is a collection of items to which the response is scored and combined into a scale score, which tells you about the bahevior of that person.
-unidimentional scaling method: opinions on a topic, individual difference, make predictions
-expert ranking (Glasgow Coma Scale)
-Thurstone scale
-Absolute scale (categorizes items based on the absolute deviation from one reference group)
- Likert scale
-Guttman scale
-Empirical scale
to optimize the item quality, is applicable to all test and has an item-characteristic curve.
you define 4 properties:
-difficulty index
-discrimination index
-reliability index
-validity index
represents (pi) proportion of participants who correctly answered an item (i). 0<pi<1, best value is 0.3<pi<07.
pi= 1+g/ 2
g=chance success level
-item characteristic curve
describes the relation between the value of the characteristic and the likelihood of a correct answer.
how efficiently does an item discriminate (di) between person who obtain high and low scores on the entire test. in the graph, the most discriminat is the steepest line.
di= Uc-Lc / 100
Uc: % who answered correctly in the upper range
Lc: % who answered correctly in the lower range
di>0: is a discriminatory item
di<0: negative discriminatory item
di=0: not a discriminatory item
ideally: 0.3<di<0.6, so positive.
which item lower the reliability of the test.
SDi x riT
riT= correlation between item score and the rest of the test
SDi= standard deviation of item score
higher=better-> more variance, stronger relation between item and test
which item causes low validity of the test
SDi x riC
riC= correlation between item score and criterion
SDi= standard deviation of item scores
higher=better-> more variance, stronger relation between item and criterion
re-assess the quality of the items in a new try-out sample of the test's target population
it is what the test for it assesses. BUT IQ tests are not for defining it, rather for measuring it.
it reflects global capacity of the individual to act purposefully, to think rationally and to deal effectively with the environment.
to map mental skills, predict academic outcomes, explain why some have difficulties in some stuff, assesses relationship between disorders and intelligence.
Galton-intelligence from senses
Spearman-global capacity G and specific factors s
Thurstone- 7 individual factors
Luria-simultaneous and successive processing
Guilford-creative thinking
Cattel-Horn-Carrol
Gardner
Sternberg
intelligence has 3 levels:
level3- overall capacity G
level2- broad cognitive capabilities (fluid vs crystallized)
level1- narrow cognitive abilities
critiques the factor g, basing his theory on evidence from the brain studies and the 'savant': mentally deficient people but with one developed talent.
critiques to g. 3 levels:
analytical
experimental
contextual
is a statistical method to investigate how many relatively independent constructs the test consists of.
2 types:
1. exploratory factors analysis (for developing theories)
2. confirmatory factors analysis (the test theories)
1-factor solution and 5 factor-solution, between which there should not be strong wide difference.
verbal comprehension index
ability to access and apply word knowledge
similarities and vocabulary
visual spatial index
spatial relationships, speed is important.
block design and visual puzzle
fluid reasoning index
fluid intelligence, use reasoning to identify and apply rules.
matrix reasoning and figure weights
working memory index
ability to register, maintain and manipulate information
digit span and picture span
processing speed index
how quickly you identify something
coding and symbol search
full scale IQ M=100 SD=15
individual subtests M=10 SD=3
- arbitrary cutoffs
- specific clinical interpretation
reliability- coefficient α and test-retest reliability
validity- factor analysis, convergent, discriminant, predictive validity.
1. difference scores
2. SEdiff
3. 95% CI around difference score.
detecting relations between figures by means of perceptual similarity or analogy.