How can I generate conditional data distributions by taking pieces of scatterplots?

Question

How can I generate conditional data distributions by taking pieces of scatterplots?

I am taking my first course in multiple linear regression, so I'm still a beginner in R. We recently learned a little about how to take pieces of two-dimensional scatterplot data, both horizontally and vertically. What I would like to know is how to go beyond the basic scatter diagram and use the conditional grouping of data in parts to check patterns.

For example, I work with high-octane data from a bank in which we regress the current csalary salary to the bsalary salary. This is what my dataframe looks like.

 > str(data) 'data.frame': 474 obs. of 10 variables: $ id : num 628 630 632 633 635 637 641 649 650 652 ... $ bsalary: num 8400 24000 10200 8700 17400 ... $ gender : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ... $ time : num 81 73 83 93 83 80 79 67 96 77 ... $ age : num 28.5 40.3 31.1 31.2 41.9 ... $ csalary: num 16080 41400 21960 19200 28350 ... $ educlvl: num 16 16 15 16 19 18 15 15 15 12 ... $ work : num 0.25 12.5 4.08 1.83 13 ... $ jobcat : Factor w/ 7 levels "Clerical","Office Trainee",..: 4 5 5 4 5 4 1 1 1 3 ... $ ethnic : Factor w/ 2 levels "White","Non-White": 1 1 1 1 1 1 1 1 1 1 ...
> str(data) 'data.frame': 474 obs. of 10 variables: $ id : num 628 630 632 633 635 637 641 649 650 652 ... $ bsalary: num 8400 24000 10200 8700 17400 ... $ gender : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ... $ time : num 81 73 83 93 83 80 79 67 96 77 ... $ age : num 28.5 40.3 31.1 31.2 41.9 ... $ csalary: num 16080 41400 21960 19200 28350 ... $ educlvl: num 16 16 15 16 19 18 15 15 15 12 ... $ work : num 0.25 12.5 4.08 1.83 13 ... $ jobcat : Factor w/ 7 levels "Clerical","Office Trainee",..: 4 5 5 4 5 4 1 1 1 3 ... $ ethnic : Factor w/ 2 levels "White","Non-White": 1 1 1 1 1 1 1 1 1 1 ...

To explore the relationship between bsalary and csalary I created a scatter chart using some of the features of the lattice library. I arbitrarily drew vertical lines at intervals of $ 5,000 along bsalary .

 library (lattice) # Constructing vertical "slices" of our csalary ~ bsalary data # First we define a vector with our slice points, in this case # $5,000 bsalary increments bslices = seq (from = 5000, to = 30000, by = 5000) length (bslices) xyplot (csalary ~ bsalary, main = "Current Bank Employee Salary as Predicted by Beginning Salary", xlab = "Beginning Salary ($USD)", ylab = "Current Salary ($USD)", panel = function(...){ panel.abline(v = bslices, col="red", lwd=2); panel.xyplot(...); } )
library (lattice) # Constructing vertical "slices" of our csalary ~ bsalary data # First we define a vector with our slice points, in this case # $5,000 bsalary increments bslices = seq (from = 5000, to = 30000, by = 5000) length (bslices) xyplot (csalary ~ bsalary, main = "Current Bank Employee Salary as Predicted by Beginning Salary", xlab = "Beginning Salary ($USD)", ylab = "Current Salary ($USD)", panel = function(...){ panel.abline(v = bslices, col="red", lwd=2); panel.xyplot(...); } )

The above code gives me this.

_{(source: skitch.com )}

It is fantastic. But I feel that there should be an easy way to generate graphs from my data that group the slice data into boxed charts:

_{(source: skitch.com )}

Or scatter diagrams with folded points, again grouped by slices, for example like this:

_{(source: skitch.com )}

Ultimately, my question is how to turn raw scatterplot data into conditionally grouped data. I feel that there are some simple, underlying lattice features (or even simpler build commands that do not require this) that would allow me to start slicing my data to find patterns.

Thank you in advance!

+4

r statistics

briandk Feb 22 '10 at 5:24

source share

4 answers

Do you really want to do this? Turning a continuous variable into a sequence number discards information, since the different values of the variable X end in the same box. I think your graphic box drawing is much less information than your scattered graphic.

If you are dissatisfied with the scatter chart due to overlapping points, one way to save the information would be to add a smooth curve that reflects the trend. See the documentation for lowess for an example.

On your chart, three observations with a salary above $ 20,000 push the remaining observations into a corner. Throwing them away and re-collecting will give a better schedule.

Another approach for garbled data like yours is to build the logarithms of the variables instead of the variables themselves.

+2

Jyotirmoy bhattacharya Feb 22 '10 at 6:42

source share

Instead of trimming data by the value of a conditional variable (turning a continuous variable into a discrete variable), a condition using the kernel function is more efficient. There is a package that does this: hdrcde . Check out the examples in the help files.

+2

Rob hyndman Feb 22 '10 at 21:18

source share

This page explains it to you http://www.statmethods.net/advgraphs/trellis.html

You basically want to change the equation for the graphs. They should be more like

csalary ~ bsalary | gender

must break down the charts separately based on different gender values. There is a bunch of control language for continuous conditional variables.

0

TheSteve0 Feb 22 '10 at 6:05

source share

Ian fellows · Accepted Answer · 2010-02-22T06:12:15+0000

you can use the cut () function to trim your data into ordinal categories. Then the ggplot2 qplot function can very easily create the graphs you need.

 library(ggplot2) #fake data csalary <- rnorm(100,,100) bsalary <- csalary +rnorm(100,,10) #Regular Scatter Plot qplot(bsalary,csalary) #Stacked dot plot qplot(cut(bsalary,10),csalary) #box-plot qplot(cut(bsalary,10),csalary,geom="boxplot")

How can I generate conditional data distributions by taking pieces of scatterplots?

More articles: