Download CSV file with fixed format

I have a CSV file that my users upload, enter some data and upload to my site.

Is there a better way to ensure successful data loading based on my snippet below? What else should I check? Would using a dialect be better?

def import(resident_file): try: file = resident_file.file.path reader = csv.reader(open(file, 'rU'), delimiter=',', quotechar='"') headerline = reader.next() for row in reader: try: # do stuff except Exception, e: print e except Exception, e: print e 

An example of a problem that I encounter is that when a user opens a file, enters data and saves them, the delimiters change from , to ; . How can I cover the different types of separators so that the document can be saved because it was opened in different programs, for example, Excel in Excel, Excel in Mac, open office on Mac, open office on Linux, etc.

Another example of a problem is when a user tries to copy and paste data into the provided template, all hell breaks.

UPDATE Now I use the Sniffer class as indicated in one of the answers below, but its still not proof of a fool.

UPDATED SNIPET CODE

 def bulk_import_residents(condo, resident_file): """ COL 1 COL 2 COL 3 COL 4 COL 5 first_name last_name contact_number unit_number block_number """ file_path = resident_file.file.path csvfile = open(file_path, 'rb') dialect = csv.Sniffer().sniff(csvfile.read(1024)) csvfile.seek(0) reader = csv.reader(csvfile, dialect) headerline = reader.next() for row in reader: try: data = ResidentImportData() data.condo = condo data.file = resident_file data.first_name = row[0] data.last_name = row[1] data.contact_number = row[2] data.unit_number = row[3] data.block_number = row[4] data.save() except Exception, e: print '{0}'.format(e) raise Http404('Wrong template format') 
+6
source share
9 answers

CSV is not a format. The Sniffer class is not reliable because it is in fact impossible to 100% reliably detect all these dialects.

I think you will have to use Sniffer for 90% of the time it is running, and it will capture invalid input files, parse them and extend Sniffer to catch them.

+2
source

I totally agree with nfirvine (CSV IS A NON FORMAT) - well, not that harsh. But this is the minimum format. Its very loose. If you use CSV, you can often break off, as it seems that you are already experiencing it.

I also agree with Mike Bynum - use something like XML.

But I understand that even if there is a better way, there is often a pragmatic way. Maybe you should stick to your format with many reasons ... like this: two routes.

Route 1: CSV

I have made this route now. My users update data daily (a couple of thousand records). Given the frequency and number of updated records, I really want me to go the second route: when processing a significant amount of data or updates, reliable data verification is a huge time saver.

That said. When you are stuck in CSV. I suggest you do the following:

  • Provide your users with a good / general definition of CSV , namely RFC 4180 . Make sure your client understands what you expect from their file:
    • Title bar.
    • Commas as Separation
    • Quotes around any data containing commas.
  • Along with this definition, give your users a CSV sample (which sounds good like you!). Explain that you cannot process a CSV file that does not match your data definition.
  • Make sure the type of the text file is what you expect from it before importing it - see convert to / from Unix / Windows .
  • In your CSV parser, you need to adopt a fail-safe methodology and make sure that you have a mechanism to notify your users when the CSV file does not meet the standard that you expect. Provide them with as much information as possible (indicate the details of the exception ... if not for them, at least for you).
  • This problem, which you encounter with a single client file, indicates that you might want to give your clients some direction as the editors are well known. Excel should work, or Open Office. I offer an application for working with b / c spreadsheets, they export well to CSV and take care of quoting, etc.

Not that you cannot support some oddities, but overall - you want to avoid this, and you want to avoid accidentally importing poorly formed data.

Route 2: XML

I suggest you do the following:

  • Determine what data your users should import using the schema definition (XSD). I like to keep w3c definitions on hand. But there are good tutorials to help you write your own XSD definition.

  • Give your users a sample XML file to fill out and an offer for the editor. There are excellent commercial ones and reasonable free ones .

  • You can read your custom XML files and be sure that if it checks , then that’s good. In this case, your users can check before sending it to you.

+2
source

Ah just found a sniffer class.

 csvfile = open("example.csv", "rb") dialect = csv.Sniffer().sniff(csvfile.read(1024)) csvfile.seek(0) reader = csv.reader(csvfile, dialect) # ... process CSV file contents here ... 
+1
source

Take a look at csv.Sniffer to help you guess which csv dialect the file uses.

Once you have a guess from the sniffer, try actually parsing the file with this dialect. If there are any data properties that you can depend on (for example, a certain number of fields), apply them to each received record as a health check.

You can also perform a two-step download process. First download the file and sniff out the dialect. Then show the user that a few lines of data look after you parse, and give the user the opportunity to override some dialect settings if he is mistaken. Then process the csv after confirmation. (The Excel import dialog box uses this multi-step method.)

+1
source

I suggest this method that checks all characters occurring n-1 times (n is the number of columns you need). This may give you the first possible answer or check the whole file.

 from collections import Counter def snif_sep(txt, nbcol, force_all=False): pseps = None for line in txt.split('\n'): if line: psep = [k for k,v in Counter(line).items() if v==nbcol-1] if pseps is None: pseps = set(psep) else: pseps.intersection_update(psep) if len(pseps)==1 and not force_all: return pseps.pop() if len(pseps)==0: return None if len(pseps)==1: return pseps.pop() 
0
source

Have you considered using the XML format? Excel has an XML format that can be easier to parse and opens easily in Excel.

You can also create your own XML format.

http://msdn.microsoft.com/en-us/library/aa140066(office.10).aspx

  <?xml version="1.0"?> <?mso-application progid="Excel.Sheet"?> <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"> <Styles> <Style ss:ID="sBold"> <Font ss:Bold="1"/> </Style> <Style ss:ID="sDate"> <NumberFormat ss:Format="General Date"/> </Style> </Styles> <Worksheet ss:Name="2100Q is 2009-Nov-11_17_43_13 "> <Table> <Row> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Date &amp; Time</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Operator ID</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Reading Mode</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Sample ID</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Sample Number</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Result</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Unit</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Notice</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Cal.Curve</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Cal.Time</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Cal.Status</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Std. 1 Nom. Value</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Std. 1 Act. Value</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Std. 2 Nom. Value</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Std. 2 Act. Value</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Std. 3 Nom. Value</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Std. 3 Act. Value</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Std. 4 Nom. Value</Data> </Cell> <Cell ss:StyleID="sBold"> <Data ss:Type="String">Std. 4 Act. Value</Data> </Cell> </Row> <Row> <Cell ss:StyleID="sDate"> <Data ss:Type="DateTime">2009-11-10T11:23:30</Data> </Cell> <Cell> <Data ss:Type="String">BARBARA</Data> </Cell> <Cell> <Data ss:Type="String">Normal</Data> </Cell> <Cell> <Data ss:Type="String">ABC-abc-1234</Data> </Cell> <Cell> <Data ss:Type="Number">001</Data> </Cell> <Cell> <Data ss:Type="Number">1.01</Data> </Cell> <Cell> <Data ss:Type="String">FNU</Data> </Cell> <Cell> <Data ss:Type="String"/> </Cell> <Cell> <Data ss:Type="String">StablCal</Data> </Cell> <Cell ss:StyleID="sDate"> <Data ss:Type="DateTime">2009-11-10T10:22:06</Data> </Cell> <Cell> <Data ss:Type="String">OK</Data> </Cell> </Row> <Row> <Cell ss:StyleID="sDate"> <Data ss:Type="DateTime">2009-11-10T10:24:15</Data> </Cell> <Cell> <Data ss:Type="String"/> </Cell> <Cell> <Data ss:Type="String">Cal.Verification</Data> </Cell> <Cell> <Data ss:Type="String"/> </Cell> <Cell> <Data ss:Type="String"/> </Cell> <Cell> <Data ss:Type="Number">1.01</Data> </Cell> <Cell> <Data ss:Type="String">FNU</Data> </Cell> <Cell> <Data ss:Type="String">Verify Cal: Passed</Data> </Cell> <Cell> <Data ss:Type="String">StablCal</Data> </Cell> <Cell ss:StyleID="sDate"> <Data ss:Type="DateTime">2009-11-10T10:22:06</Data> </Cell> <Cell> <Data ss:Type="String">OK</Data> </Cell> </Row> </Table> </Worksheet> </Workbook> 
0
source

Instead, did you read the TAB delimited file? It is easy to read and write with all the software you mentioned, and gave me a lot less problems than CSV.

Although, I suggest an enthusiasm of +1 for the idea of ​​forcing users to edit the file in a well-known online editor. Google Docs, Zoho, etc. They offer shared files and the ability to export data, which makes you responsible for the format and simplifies the analysis.

If you come with TSV, be sure to clear the data looking at quoted strings between quotes? You can always use .strip ...

0
source

Honestly, if they have files with different formats, the easiest solution would be to give them a drop-down menu that allows them to choose which program they use. Then perform a process specifically designed for this program.

It is impossible to create one process that can cover all possible formatting options. But by breaking it this way, you can add them as needed and simplify its maintenance for yourself.

0
source

Tools that allow you to view or import CSV files face this common problem. Tools include database import tools, Excel, open office, etc. I know SOFA was made in python and allows you to import csv.

All of these tools have a data preview so that the user can make sure that he looks normal. At least if the preview doesn’t look right, they can choose the csv delimiter that they want to fix. The tool they use to create the csv file needs to be consistent throughout, so if it looks in the preview, it will probably be okay for the rest of the file. APART from complex, rare situations where data is hidden or enclosed in quotation marks.

If the file is not too large, try creating a set of all the characters that appear in the file (characters that are not az or 0-9). Now make sure that in your preview you specify a string for each of the characters that occur. Then, if this part of the preview is confused, the user can change the quotation marks. This is a bit overhead, which makes a good viewer. You want the preview to show the lines in order to ... represent the lines that you intentionally missed.

If previewing is not possible, God may be with you.

0
source

Source: https://habr.com/ru/post/906413/


All Articles