Varchar value correlation

Is there a built-in way in Oracle 11 to check the correlation of values ​​in the varchar2 field? For example, given a simple table such as:

MEAL_NUM INGREDIENT -------------------- 1 BEEF 1 CHEESE 1 PASTA 2 CHEESE 2 PASTA 2 FISH 3 CHEESE 3 CHICKEN 

I want to get a numerical indication that on the basis of MEAL_NUM, CHEESE is mainly paired with PASTA and reduces the degree with BEEF, CHICKEN and FISH.

My first slant is to use the CORR function and convert the strings to a number, perhaps either by listing them in advance or by capturing rownum from a unique choice.

Any suggestions on this?

+6
source share
4 answers

You don’t want to use CORR - if you create a “food number” and assign Beef = 1, Chicken = 2 and Pasta = 3, then the correlation coefficient will tell you whether the increased cheese correlates with the increased amount of food. But “amount of food” above or below does not mean anything since you did it. Thus, do not use CORR unless your products are actually sorted in some way, like numbers.

The way statisticians talk about it is the measurement levels . In the language of a related article, MEAL_NUM is a nominal measure - or maybe an ordinal measure if the food was in order, but in any case it is really a bad idea to use correlation coefficients on it.

You probably want to find something like, "What percentage of beef dishes also have cheese?" For each ingredient, the following amount of nutrients will be indicated, as well as the amount of food and cheese it contains. The trick is that COUNT only considers non-zero values.

 SELECT Other.Ingredient, COUNT(*) AS TotalMeals, COUNT(Cheese.Ingredient) AS CheesyMeals FROM table Other LEFT JOIN table Cheese ON (Cheese.Ingredient = 'Cheese' AND Cheese.Meal_Num = Other.Meal_Num) GROUP BY Other.Ingredient 

Warning: returns incorrect results if you include the ingredient twice in one meal.

Edit: Turns out you are not particularly interested in cheese. You really want all pairs of “correlations”. So, we can abstract from the "Cheese" and call them only the first and second ingredients. I have added “Possible plot” to this one, which tries to act as percentage nutrition but does not give a strong result if there are very few copies of the ingredient.

 SELECT First.Ingredient, Second.Ingredient, COUNT(*) AS MealsWithFirst, COUNT(First.Ingredient) AS MealsWithBoth, COUNT(First.Ingredient) / (COUNT(*) + 3) AS PossibleScore, FROM table First LEFT JOIN table Second ON (First.Meal_Num = Second.Meal_Num) GROUP BY First.Ingredient, Second.Ingredient 

When sorting by invoice, this should return

 PASTA CHEESE 2 2 0.400 CHEESE PASTA 3 2 0.333 BEEF CHEESE 1 1 0.250 BEEF PASTA 1 1 0.250 FISH CHEESE 1 1 0.250 FISH PASTA 1 1 0.250 CHICKEN CHEESE 1 1 0.250 PASTA BEEF 2 1 0.200 PASTA FISH 2 1 0.200 CHEESE BEEF 3 1 0.167 CHEESE FISH 3 1 0.167 CHEESE CHICKEN 3 1 0.167 
+3
source

Make a self-join to get all the combinations of ingredients, then corr with two meal_nums files

 SELECT t1.INGREDIENT, t2.INGREDIENT, CORR(t1.MEAL_NUM, t2.MEAL_NUM) FROM TheTable t1, TheTable t2 WHERE t1.INGREDIENT < t2.INGREDIENT GROUP BY t1.INGREDIENT, t2.INGREDIENT 

Should give you something like:

 BEEF CHEESE 0.999 BEEF PASTA 0.998 CHEESE PASTA 0.977 

UPDATE: as Chris points out, this will not work as it is. I was hoping there might be some way to clean up the mapping from ordinal meal_num to interval (@Chris value, thanks for the link). This may not be possible, in which case this answer will not help.

+2
source

Try DBMS_FREQUENT_ITEMSET :

 --Create sample data create table meals(meal_num number, ingredient varchar2(10)); insert into meals select 1, 'BEEF' from dual union all select 1, 'CHEESE' from dual union all select 1, 'PASTA' from dual union all select 2, 'CHEESE' from dual union all select 2, 'PASTA' from dual union all select 2, 'FISH' from dual union all select 3, 'CHEESE' from dual union all select 3, 'CHICKEN' from dual; commit; --Create nested table type to hold results CREATE OR REPLACE TYPE fi_varchar_nt AS TABLE OF VARCHAR2(10); / --Find the items most frequently combined with CHEESE. select bt.setid, nt.column_value, support occurances_of_itemset ,length, total_tranx from ( select cast(itemset as fi_varchar_nt) itemset, rownum setid ,support, length, total_tranx from table(dbms_frequent_itemset.fi_transactional( tranx_cursor => cursor(select meal_num, ingredient from meals), support_threshold => 0, itemset_length_min => 2, itemset_length_max => 2, including_items => cursor(select 'CHEESE' from dual), excluding_items => null)) ) bt, table(bt.itemset) nt where column_value <> 'CHEESE' order by 3 desc; SETID COLUMN_VAL OCCURANCES_OF_ITEMSET LENGTH TOTAL_TRANX ---------- ---------- --------------------- ---------- ----------- 4 PASTA 2 2 3 3 FISH 1 2 3 1 BEEF 1 2 3 2 CHICKEN 1 2 3 
+1
source

How about the request?

 select t1.INGREDIENT, count(*)a from table t1, (select meal_num from table where INGREDIENT = 'CHEESE') t2 where t1.INGREDIENT <> 'CHEESE' and t1.meal_num=t2.mealnum group by t1.INGREDIENT; 

the result should be the amount of time that each ingredient shares food_num with CHEESE.

0
source

Source: https://habr.com/ru/post/893621/


All Articles