Getting specific text into vector in matlab

Possible duplicate:
find specific data from a text file in matlab

I already opened a text file called 'gos.txt' using the following code:

s={}; fid = fopen('gos.txt'); tline = fgetl(fid); while ischar(tline) s=[s;tline]; tline = fgetl(fid); end 

I got the result as follows: s =

 '[Term]' 'id: GO:0008150' 'name: biological_process' 'namespace: biological_process' 'alt_id: GO:0000004' 'alt_id: GO:0007582' [1x243 char] [1x445 char] 'subset: goslim_aspergillus' 'subset: goslim_candida' 'subset: goslim_yeast' 'subset: gosubset_prok' 'synonym: "biological process" EXACT []' 'synonym: "biological process unknown" NARROW []' 'synonym: "physiological process" EXACT []' 'xref: Wikipedia:Biological_process' '[Term]' 'id: GO:0016740' 'name: transferase activity' 'namespace: molecular_function' [1x326 char] 'subset: goslim_aspergillus' 'subset: goslim_candida' 'subset: goslim_metagenomics' 'subset: goslim_pir' 'subset: goslim_plant' 'subset: gosubset_prok' 'xref: EC:2' 'xref: Reactome:REACT_25050 "Molybdenum ion transfer onto molybdopterin, Homo sapiens"' '//is_a: GO:0003674 ! molecular_function' 'is_a: GO:0008150 ! molecular_function (added by Zaid, To be Removed Later)' '//relationship: part_of GO:0008150 ! biological_process' '[Term]' 'id: GO:0016787' 'name: hydrolase activity' 'namespace: molecular_function' [1x186 char] 'subset: goslim_aspergillus' 'subset: goslim_candida' 'subset: goslim_metagenomics' 'subset: goslim_plant' 'subset: gosubset_prok' 'xref: EC:3' '//is_a: GO:0003674 ! molecular_function' 'is_a: GO:0016740 ! molecular_function (added by Zaid, to be removed later)' 'relationship: part_of GO:0008150 ! biological_process' '[Term]' 'id: GO:0006810' 'name: transport' 'namespace: biological_process' 'alt_id: GO:0015457' 'alt_id: GO:0015460' [1x255 char] 'subset: goslim_aspergillus' 'subset: goslim_candida' 'synonym: "small molecule transport" NARROW []' 'synonym: "solute:solute exchange" NARROW []' 'synonym: "transport accessory protein activity" RELATED [GOC:mah]' 'is_a: GO:0016787 ! biological_process' 'relationship: part_of GO:0008150 ! biological_process' . . . . 

the step behind it is how to take a specific charater and put it in a vector. For example: I want all lines to contain "id: GO: *******" and put them in a vector, I also want to get "is_a: GO: *******" in a vector, note that I do not want the characters after that to be on the same line.

+4
source share
2 answers

You can easily use regexp here - it works for cells:

 matching_lines = s{~cellfun('isempty', regexp(s, '^id: GO'))} ans = id: GO:0008150 ans = id: GO:0016740 

retrieves all lines starting with id: GO . Only calling cellfun gives you the vector 0/1, where 1 means the string in s matches your query.

A similar line finds those that contain is_a: GO: Cutting unwanted characters from strings can also be done with regexp .

Extracting parts of the strings can be done using the 'tokens' regexp parameter:

 tok = regexp(s, '^id: (GO.*)', 'tokens'); idx = ~cellfun('isempty', tok); v = cellfun(@(x)x{1}, {tok{idx}}); sprintf('%s ', v{:}) ans = GO:0008150 GO:0016740 
+6
source

Assuming you only want to find things at the beginning of a line, this is pretty simple:

 found=[] for i=1:length(s) temp = s{i}; if strcmp('id: GO:',temp(1:min(7,end)); found = [found i]; end end 

Now a vector is found containing all line locations starting with id: GO:

I cannot try it in Matlab at the moment, but this should be correct.

+1
source

Source: https://habr.com/ru/post/1441509/


All Articles