From Stata to R: Create a scatter chart with vertical date lines on a subset

Introduction

I am trying to replicate in R a scatterplot that I created in Stata on a subset of the data. The scatter plot has the time variable 'date' along the x axis (mm / dd / yyyy) and the integer variable 'cost' along the y axis (amount of money, in US dollars). Marker labels have a categorical variable, "company name".

The actual data set is very large, but the sample will look like this (see R-code below), with observations (i.e. rows) indicating transactions (column 1), followed by variables indicating the date of the transaction (column 2) , the transaction cost (column 3) and the name of the company that initiated the transaction (column 4).

#Sample Data Frame (R Code) transactionID <- c(1, 2, 3, 4) date <- as.Date(c("2006-08-06", "2008-07-30", "2009-04-16", "2013-02-05")) cost <- as.integer(c(1208, 23820, 402, 89943)) company <- c("ACo", "BInc", "CInd", "DOp") thedata <- data.frame(transactionID, date, cost, company) 

Doing this in Stata h1>

The scatter chart that I want will have a “date” along the x axis and a “value” along the y axis, the “company” indicated as marker marks, and will also have 3 vertical lines of various formatting to indicate important events. Creating Steps this in Stata p>

  • Define the x-axis points for the vertical lines on the dates September 10, 2007, January 28, 2008, January 18, 2012, and February 5, 2013.

mdy display (9,10,2007)

display mdy (1, 28, 2008)

display mdy (2, 5, 2013)

The three display commands above return values ​​17419, 17559, 19394, which, like Stata reads those days inside, and which are embedded in the code below to graphically display the scatterplot.

  1. Create a scatter chart by adding the three vertical lines from step 1, formatting them as dashed, dashed, and solid lines of red, blue, and green and of different thicknesses with a “cost” along the y axis, a “date” on the x axis, and a company name as marker marks , only for transactions that were less than or equal to $ 3,000:

Graph of bipolar spread of cost, if cost <= 3000, mlabel (company) xline (17419, lpatt (dot) l Width (thick) lcol (red)) xline (17559, lpatt (dash) lwidth (medthick) lcol (blue)) xline (19394, lpatt (solid) lwidth (thin) lcol (green))

Problems performed in R

When I tried to replicate it to R, I ran into the following problems

  • He cannot understand how to add vertical lines to these specific dates, or how to change the formatting of their size.
  • The Y axis ("value") is in scientific notation (i.e. 2e + 05) instead of the usual numbers (i.e. 200,000)
  • I do not quite understand the subset in R; in Stata, I can easily add “if” qualifiers to test certain subsets of data (for example, “if the cost is> 3000 and the transaction identifier is <5”), and then easily change them to re-run the analyzes or plot on different different subsets. But in R it seems that there are additional steps in which you need to first multiply the data and save it as a new object, and then start the analysis of this object. It is right? I see some advantages for this, but also some disadvantages (for example, when hundreds of different objects clutter up your work environment when studying data, for example).

So far I have compiled the following code. At first I tried to do this using the basic installation commands () and text (), but it seems that it is impossible to do in the R base. Then I tried to use the ggplot2 package, but still can’t figure out what it looks like I could in Stata :

 library(ggplot2) ggplot(thedata, aes(date, cost)) + geom_text( label = thedata$company, color="blue", vjust = 0) + geom_vline( xintercept = as.numeric( thedata$date[ c(I don't know what goes here, or here)]), linetype="dotted", color="red") 

As you can see, I cannot understand how the coordinates for the xintercept of the geom_vline command work (and cannot find it in the official help file), especially when I want them to be dates (in particular, dates that may or may not be in data frame), and I cannot figure out how to change the line thickness.

+5
source share
2 answers

the question is very beautifully made. If you are still interested in the basic solution:

 transactionID <- c(1, 2, 3, 4) date <- as.Date(c("2006-08-06", "2008-07-30", "2009-04-16", "2013-02-05")) cost <- as.integer(c(1208, 23820, 402, 89943)) company <- c("ACo", "BInc", "CInd", "DOp") thedata <- data.frame(transactionID, date, cost, company) par(mar = c(5,7,3,2), tcl = .2, las = 1) with(thedata, plot(date, cost, xlab = 'Date', ylab = '', axes = FALSE, main = 'a plot')) dseq <- seq.Date(as.Date('2006-01-01'), as.Date('2013-01-01'), by = 'year') axis.Date(1, at = dseq, labels = format(dseq, format = '%Y')) # axis.Date(1, at = seq.Date(min(date), max(date), by = 'year')) axis(2, at = pretty(cost), labels = format(pretty(cost), scientific = FALSE, big.mark = ',')) ## add lines at specified dates abline(v = as.Date(c('2007-09-10','2008-01-28','2012-01-18')), lwd = 1:3, lty = c('dotted','dashed','solid'), col = c('red','blue','green')) ## add company labels text(x = date, y = cost, pos = 3, xpd = NA, labels = ifelse(cost <= 3000, company, '')) title(ylab = 'Cost', line = 5) box('plot', bty = 'l') 

enter image description here

To address some specific issues:

  • I am using as.Date . R stores dates similarly to stata p>

     abline(v = as.Date(c('2007-09-10','2008-01-28','2012-01-18')), lwd = 1:3, lty = c('dotted','dashed','solid'), col = c('red','blue','green')) 
  • use formatting

     format(pretty(cost), scientific = FALSE, big.mark = ',') # [1] " 0" " 20,000" " 40,000" " 60,000" " 80,000" "100,000" 
  • you can, of course, create some subsets if you are more comfortable with this, but there is usually a way to do single-line in r

     ifelse(cost <= 3000, company, '') # [1] "ACo" "" "CInd" "" 

Most basic graphics features are vectorized, so it’s so simple. And I'm not a ggplot wizard, and this usually leads to a headache for me when I try to make very accurately formatted charts like these. Generally, ggplot is good for nice, fast, dirty charts. If you want something very specific or publish something, basic r graphics is the way to go.

+3
source

So, here is the ggplot method, which I think creates what you ask.

 library(ggplot2) key.events <- data.frame(date=as.Date(c("2007-09-10","2008-01-28","2012-01-18"))) ggplot(thedata[thedata$cost>3000,],aes(x=date,y=cost))+ geom_point(shape=1,size=3)+ geom_text(aes(label=company),vjust=-1)+ scale_y_continuous(expand=c(0.2,0.2))+ geom_vline(data=key.events, size=1, aes(xintercept=as.integer(date),color=factor(date),linetype=factor(date)))+ scale_color_manual(values=c("red","blue","green"))+ scale_linetype_manual(values=c("dotted","dashed","solid"))+ theme_bw() 

+3
source

Source: https://habr.com/ru/post/1209382/


All Articles