special sample that includes everyone in the entire population
too expensive, undercoverage, time consuming
numerical summary of a population , with a census but they use sample statisitcs instead
population inference: results from the sample can be generalized to an entrie population (as estimate)
causal inference: the difference in the responses is caused by the difference in treatments when comparing the results from two treatment groups
when we have random sampling
eliminates the effect of unknown extraneous factors, makes sure that on avg the sample looks like rest of the pop
biased results
1. simple random samples
2. stratified random sampling
3. systematic random sampling
4. cluster random sampling
each sample of size n in the pop has the same chance of being selected
sampling variability
population is divided into homogenous groups called strata then take an SRS within each stratum before the results are combined. can reduce bias and variabilty of results
starts with a random indivudal then sample every x^th person, make sure the order of the list is not associated with the responses sought
split pop into clusters (similar groups), then select 1 or a few clusters at random and perform a census
the tendency for a sample to differ from the corresponding population in some systematic way
1) selection bias (undercoverage)
2) response bias
3) voluntary response bias
4) nonresponse bias
some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population (usually these poeple differ from the rest of the pop)
anything in the survery design that influences the responses (ex. being asked about illegal or unpopular behavior)
when indivudals can choose on their own wheter to participate in the sample
when a large proportion of those sampled fail to respond
we should only make causal inf when we have random allocation, when there is no random allocation the difference in responses could be caused by lurking variables
variables that are related to both group memberships and to the responses
1) observational studies
2) randomized experiment
the investigator observes indiviudals and measures variables of interest but does NOT attempt to influence the response, good for trends and possible relationships
1) retrospective
2) prospective
identify subjects and collect data at that moment in time of caused effects
identify subjects in advance and collect data as events unfolded over the next period of time (future)
allows us to prove a cause and effect relationship by doing the following
a) manipulates factor levels to create treatments
b) randomly assigns subjects to these treatment levels
c) compares the responses of the subject groups across treatment levels
no
causal inf are allowed
population inf are allowed
bar chart and pie chart
dot plots, stem plots, histograms, time blots, box plots, scatterplots
what values have been observed? how often did every value occur?
each possible category, frequency of individuals who fall into each category or relative frequency of individuals who fall into each category
freuqency- number, relative frequency- percentage
frequency/ number of observations
100%
the frequencies or percent in different categories
the relationship between parts and the whole
true
category relative freq X 360 degrees
contingency table
the margins of the table give totals and the frequency distributions for each of the variables
a marginal distribution of its respective variable
total for the variable/ total of the sample
shows the distribution of one variable for just the observations that satisy a condition on another variable. for example instead of looking at the total of two variables you only look at one variable ex. instead of arts AND science students in total for full time and part time you look at art and science alone
if two variables are dependent because theres a difference in their distributions between the variables
if the conditional distribution of one variable is not the same for each category for another, there is an association between these variables
if the conditonal distribution of one variable is the same for each category of another, no association between these variables
sample size, number of observations of the variable y
the variable of interest, what we have sample data of
y1 is the first sample observation of the variable y where as y2 is the second sample observation of the variable y
the mean (center), the median and the mode
sample mean
sum of all observations/ number of observations
no because it is not resistant to outliers
median
the value that divides the ordered sample into two sets of the same size, one hald below M and the other half above M
order data, smallest to largest. If n is odd use the single middle value is n is even use the avg of the middle 2 values
an alternative measure for the centre, it is the value that occurs with the highest frequency in a data set
pro- easy to locate, con- data may have none or more than 1 mode whereas it will only have one mean and one median
tallest
cut into two parts of the same area
the balance point of the distribution
mean= median= mode
mean > median> mode
mean < median < mode
measure of spread. max- min
range, SD, IQR
sum of deviations is equal to 0 always
Q3 (75th percentile) - Q1 (25th percentile)
they are used to determine outliers.
upper fence = Q3 + 1.5 X IQR , every measure above is an outlier
lower fence = Q1- 1.5 X IQR, every measure below is an outlier
median line in center of box and whiskers of equal length
median line left of center and long right whisker
median line right of center and long left whisker