Import a large CSV file into MySQL

I am trying to import a csv file into a mysql table, and currently I have a script that runs line by line because I need to hash the identifier in combination with another identifier, and also format the date for the mysql format.

There are MORE columns in the csv file than I am importing. Is it easier to just import all the columns?

I read about LOAD DATA INFILE (http://dev.mysql.com/doc/refman/5.1/en/load-data.html), but I'm wondering how can I use this and hash identifiers and format the date without doing line by line execution. My current script is taking too much time and causing website performance issues while running.

Here is what I have:

$url = 'http://www.example.com/directory/file.csv'; if (($handle = fopen($url, "r")) !== FALSE) { fgetcsv($handle, 1000, ","); while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) { $EvID = $data[0]; $Ev = $data[1]; $PerID = $data[2]; $Per = $data[3]; $VName = $data[4]; $VID = $data[5]; $VSA = $data[6]; $DateTime = $data[7]; $PCatID = $data[8]; $PCat = $data[9]; $CCatID = $data[10]; $CCat = $data[11]; $GCatID = $data[12]; $GCat = $data[13]; $City = $data[14]; $State = $data[15]; $StateID = $data[16]; $Country = $data[17]; $CountryID = $data[18]; $Zip = $data[19]; $TYN = $data[20]; $IMAGEURL = $data[21]; $URLLink = $data[22]; $data[7] = strtotime($data[7]); $data[7] = date("Ymd H:i:s",$data[7]); if((($PCatID == '2') && (($CountryID == '217') or ($CountryID == '38'))) || (($GCatID == '16') or ($GCatID == '19') or ($GCatID == '30') or ($GCatID == '32'))) { if(!mysql_query("INSERT IGNORE INTO TNDB_CSV2 (id, EvID, Event, PerID, Per, VName, VID, VSA, DateTime, PCatID, PCat, CCatID, CCat, GCatID, GCat, City, State, StateID, Country, CountryID, Zip, TYN, IMAGEURL) VALUES ('".md5($EventID.$PerformerID)."','".addslashes($data[0])."','".addslashes($data[1])."','".addslashes($data[2])."','".addslashes($data[3])."','".addslashes($data[4])."', '".addslashes($data[5])."','".addslashes($data[6])."','".addslashes($data[7])."','".addslashes($data[8])."','".addslashes($data[9])."', '".addslashes($data[10])."','".addslashes($data[11])."','".addslashes($data[12])."','".addslashes($data[13])."','".addslashes($data[14])."', '".addslashes($data[15])."','".addslashes($data[16])."','".addslashes($data[17])."','".addslashes($data[18])."','".addslashes($data[19])."', '".addslashes($data[20])."','".addslashes($data[21])."')")) { exit("<br>" . mysql_error()); } } } fclose($handle); } 

Any help is always appreciated. Thanks in advance.

+4
source share
3 answers

try optimizing your scripts first. First, never run single queries on import, unless you have another choice, network overhead can be a killer.

Try something like (explicitly untested and encoded in the SO text box, check the brackets for ect compliance):

 $url = 'http://www.example.com/directory/file.csv'; if (($handle = fopen($url, "r")) !== FALSE) { fgetcsv($handle, 1000, ","); $imports = array(); while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) { $EvID = $data[0]; $Ev = $data[1]; $PerID = $data[2]; $Per = $data[3]; $VName = $data[4]; $VID = $data[5]; $VSA = $data[6]; $DateTime = $data[7]; $PCatID = $data[8]; $PCat = $data[9]; $CCatID = $data[10]; $CCat = $data[11]; $GCatID = $data[12]; $GCat = $data[13]; $City = $data[14]; $State = $data[15]; $StateID = $data[16]; $Country = $data[17]; $CountryID = $data[18]; $Zip = $data[19]; $TYN = $data[20]; $IMAGEURL = $data[21]; $URLLink = $data[22]; $data[7] = strtotime($data[7]); $data[7] = date("Ymd H:i:s",$data[7]); if((($PCatID == '2') && (($CountryID == '217') or ($CountryID == '38'))) || (($GCatID == '16') or ($GCatID == '19') or ($GCatID == '30') or ($GCatID == '32'))) { $imports[] = "('".md5($EventID.$PerformerID)."','".addslashes($data[0])."','".addslashes($data[1])."','".addslashes($data[2])."','".addslashes($data[3])."','".addslashes($data[4])."', '".addslashes($data[5])."','".addslashes($data[6])."','".addslashes($data[7])."','".addslashes($data[8])."','".addslashes($data[9])."', '".addslashes($data[10])."','".addslashes($data[11])."','".addslashes($data[12])."','".addslashes($data[13])."','".addslashes($data[14])."', '".addslashes($data[15])."','".addslashes($data[16])."','".addslashes($data[17])."','".addslashes($data[18])."','".addslashes($data[19])."', '".addslashes($data[20])."','".addslashes($data[21])."')"; } } $importarrays = array_chunk($imports, 100); foreach($importarrays as $arr) { if(!mysql_query("INSERT IGNORE INTO TNDB_CSV2 (id, EvID, Event, PerID, Per, VName, VID, VSA, DateTime, PCatID, PCat, CCatID, CCat, GCatID, GCat, City, State, StateID, Country, CountryID, Zip, TYN, IMAGEURL) VALUES ".implode(',', $arr)){ die("error: ".mysql_error()); } } fclose($handle); } 

Play with the number in array_chunk too large, and this can cause problems, such as a request that is too long (yes, my.cnf has a custom limit), too little and its unnecessary overhead.

You can also refuse to use the variable $ data [x] for variables as its waste, considering how small the script is, just use $ data [x] directly in your ect request (they won’t give a massive improvement, but depending on your import size he can save a little).

The next step would be to use low priority inserts / updates, check this out for more information about this, so you can get started: How do I prioritize certain queries?

after all this, you might think about optimizing your mysql configuration, but it's for Google to really explain that the best settings for each and their unique situations are different from each other.

Edit: Another thing I did before is that you have many keys that are not required for import, you can temporarily drop these keys and add them back when the script is done. It can also bring good time improvements, but since you are working in a live database, there are problems if you go along this route.

+5
source

Try batch insertion using the implode () function. For further explanation and example see this thread insert multiple rows through php array in mysql

+1
source

I used this query

 $sql = " LOAD DATA LOCAL INFILE 'uploads/{$fileName}' REPLACE INTO TABLE `order` FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n' IGNORE 1 LINES (product_id, `date`, quantity) "; 

he is super fast

+1
source

Source: https://habr.com/ru/post/1436608/


All Articles