R or Python to manage files

I have 4 fairly complex r scripts that are used to manage csv and xml files. They were created by another department, where they work exclusively in r.

My understanding is that although r works very quickly with data, it is not optimized for working with files. Can I expect to get a significant increase in speed by converting these scripts to python? Or is it a waste of time?

+3
source share
6 answers

I write in R and Python regularly. I find Python modules for writing, reading, and analyzing information easier to use, maintain, and update. Small subtleties, such as the way python allows you to process lists of elements by indexing R, make reading much easier.

I very much doubt that you will get significant acceleration by switching the language. If you become the new โ€œmaintainerโ€ of these scripts, and you find that Python is easier to understand and extend, then I would say for that.

Computer time is cheap ... programmer time is expensive. If you have other things to do, I just limp along with what you have until you have a free day to put them on them.

Hope this helps.

+10
source

Python script, (280 ) CSV. , dbpedia, ISIN. R, , , R script 10 , python script (10 1 ). , R, , script. python

from time import clock

clock()
infile = "infobox_de.csv"
outfile = "companies.csv"

reader = open(infile, "rb")
writer = open(outfile, "w")

oldthing = ""
isCompany = False
hasISIN = False
matches = 0

for line in reader:
    row = line.strip().split("\t")
    if len(row)>0: thing = row[0]
    if len(row)>1: key = row[1]
    if len(row)>2: value = row[2]
    if (len(row)>0) and (oldthing != thing):
      if isCompany and hasISIN:
        matches += 1
        for tup in buf:
          writer.write(tup)
      buf = []
      isCompany = False
      hasISIN = False
    isCompany = isCompany or ((key.lower()=="wikipageusestemplate") and (value.lower()=="template:infobox_unternehmen"))
    hasISIN = hasISIN or ((key.lower()=="isin") and (value!=""))
    oldthing = thing
    buf.append(line)

writer.close()
print "finished after ", clock(), " seconds; ", matches, " matches."

R script ( , , csv ISIN):

infile <- "infobox_de.csv"
maxLines=65000

reader <- file(infile, "r")
writer <- textConnection("queryRes", open = "w", local = TRUE)
writeLines("thing\tkey\tvalue\tetc\n", writer)

oldthing <- ""
hasInfobox <- FALSE
lineNumber <- 0
matches <- 0
key <- ""
thing <- ""

repeat {
  lines <- readLines(reader, maxLines)
  if (length(lines)==0) break
  for (line in lines) {
    lineNumber <- lineNumber + 1
    row = unlist(strsplit(line, "\t"))
    if (length(row)>0) thing <- row[1]
    if (length(row)>1) key <- row[2]
    if (length(row)>2) value <- row[3]
    if ((length(row)>0) && (oldthing != thing)) {
      if (hasInfobox) {
        matches <- matches + 1
        writeLines(buf, writer)
      }
      buf <- c()
      hasInfobox <- FALSE
    }
    hasInfobox <- hasInfobox || ((tolower(key)=="wikipageusestemplate") && (tolower(value)==tolower("template:infobox_unternehmen")))
    oldthing <- thing
    buf <- c(buf, line)
  }
}
close(reader)
close(writer)
readRes <- textConnection(queryRes, "r")
result <- read.csv(readRes, sep="\t", stringsAsFactors=FALSE)
close(readRes)
result

, , , readLines 65000 . , , 500 . .

+2

, . R IO ( ), . , , . IO, , , , .

+1

" "? , , .., , , bash .., , , , .., , , Python R., , R , , , .

+1

, , , . . , .

R, , ( ). R - , , , , python - .

0

R . :

  • data.frames (, )

Find R-time optimization and profiling, and you will find many resources to help you.

0
source

Source: https://habr.com/ru/post/1744071/


All Articles