R or Python to manage files

Question

R or Python to manage files

I have 4 fairly complex r scripts that are used to manage csv and xml files. They were created by another department, where they work exclusively in r.

My understanding is that although r works very quickly with data, it is not optimized for working with files. Can I expect to get a significant increase in speed by converting these scripts to python? Or is it a waste of time?

+3

performance python file r

danspants May 05 '10 at 1:24

source share

6 answers

Python script, (280 ) CSV. , dbpedia, ISIN. R, , , R script 10 , python script (10 1 ). , R, , script. python

from time import clock

clock()
infile = "infobox_de.csv"
outfile = "companies.csv"

reader = open(infile, "rb")
writer = open(outfile, "w")

oldthing = ""
isCompany = False
hasISIN = False
matches = 0

for line in reader:
    row = line.strip().split("\t")
    if len(row)>0: thing = row[0]
    if len(row)>1: key = row[1]
    if len(row)>2: value = row[2]
    if (len(row)>0) and (oldthing != thing):
      if isCompany and hasISIN:
        matches += 1
        for tup in buf:
          writer.write(tup)
      buf = []
      isCompany = False
      hasISIN = False
    isCompany = isCompany or ((key.lower()=="wikipageusestemplate") and (value.lower()=="template:infobox_unternehmen"))
    hasISIN = hasISIN or ((key.lower()=="isin") and (value!=""))
    oldthing = thing
    buf.append(line)

writer.close()
print "finished after ", clock(), " seconds; ", matches, " matches."

R script ( , , csv ISIN):

infile <- "infobox_de.csv"
maxLines=65000

reader <- file(infile, "r")
writer <- textConnection("queryRes", open = "w", local = TRUE)
writeLines("thing\tkey\tvalue\tetc\n", writer)

oldthing <- ""
hasInfobox <- FALSE
lineNumber <- 0
matches <- 0
key <- ""
thing <- ""

repeat {
  lines <- readLines(reader, maxLines)
  if (length(lines)==0) break
  for (line in lines) {
    lineNumber <- lineNumber + 1
    row = unlist(strsplit(line, "\t"))
    if (length(row)>0) thing <- row[1]
    if (length(row)>1) key <- row[2]
    if (length(row)>2) value <- row[3]
    if ((length(row)>0) && (oldthing != thing)) {
      if (hasInfobox) {
        matches <- matches + 1
        writeLines(buf, writer)
      }
      buf <- c()
      hasInfobox <- FALSE
    }
    hasInfobox <- hasInfobox || ((tolower(key)=="wikipageusestemplate") && (tolower(value)==tolower("template:infobox_unternehmen")))
    oldthing <- thing
    buf <- c(buf, line)
  }
}
close(reader)
close(writer)
readRes <- textConnection(queryRes, "r")
result <- read.csv(readRes, sep="\t", stringsAsFactors=FALSE)
close(readRes)
result

, , , readLines 65000 . , , 500 . .

+2

Karsten W. 09 '10 16:35

, . R IO ( ), . , , . IO, , , , .

+1

Eloff 05 '10 1:53

" "? , , .., , , bash .., , , , .., , , Python R., , R , , , .

+1

wescpy 05 '10 3:16

, , , . . , .

R, , ( ). R - , , , , python - .

0

dlamotte 05 '10 1:35

R . :

data.frames (, )

Find R-time optimization and profiling, and you will find many resources to help you.

0

Tal galili May 05 '10 at 9:22

source share

Judowill · Accepted Answer · 2010-05-05T03:01:57+0000

I write in R and Python regularly. I find Python modules for writing, reading, and analyzing information easier to use, maintain, and update. Small subtleties, such as the way python allows you to process lists of elements by indexing R, make reading much easier.

I very much doubt that you will get significant acceleration by switching the language. If you become the new “maintainer” of these scripts, and you find that Python is easier to understand and extend, then I would say for that.

Computer time is cheap ... programmer time is expensive. If you have other things to do, I just limp along with what you have until you have a free day to put them on them.

Hope this helps.

R or Python to manage files

More articles: