63 views (last 30 days)
Show older comments
George on 30 Aug 2024 at 15:35
Edited: dpb on 1 Sep 2024 at 17:48
Accepted Answer: Voss
Open in MATLAB Online
Below is an example of the type of sequences I'm dealing with. The sequences are truncated for readability. Some IDs correspond to one sequence only whereas others to more than one squence. The objective is to count the number of nucleotides (letters) per ID.
>AAEL001292::intron | Intron sequence
GTGGG
>AAEL001312::intron | Intron sequence
GTAAGAAC
>AAEL001320::intron | Intron sequence
GTAAGACCTACTCA
>AAEL001807::intron | Intron sequence
GTACT
>AAEL002085::intron | Intron sequence
GTTAGTTGTA
>AAEL002085::intron | Intron sequence
GTAAGT
>AAEL002085::intron | Intron sequence
GTT
>AAEL002085::intron | Intron sequence
GTACGA
>AAEL002633::intron | Intron sequence
GTAAGAAAAAT
I've managed to generate a table like:
AAEL001292 5
AAEL001312 8
AAEL001320 14
AAEL001807 5
AAEL002085 10
AAEL002085 6
AAEL002085 3
AAEL002085 6
AAEL002633 11
The desired output is:
AAEL001292 5
AAEL001312 8
AAEL001320 14
AAEL001807 5
AAEL002085 25 # sum of number of nuceotides under to ID AAEL002085
AAEL002633 11
Any suggestions will be most helpful.
1 Comment Show -1 older commentsHide -1 older comments
Show -1 older commentsHide -1 older comments
dpb on 30 Aug 2024 at 15:45
Direct link to this comment
https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3250074
Be much easier if you would attach a data file for folks to work with...as for hints, without a data file from which to see just what you're starting with, I'd venture using a table and @doc:rowfun or groupsummary even with grouping variables will give you the desired results directly.
Sign in to comment.
Sign in to answer this question.
Accepted Answer
Voss on 30 Aug 2024 at 16:58
Open in MATLAB Online
- file.txt
% this is what the attached file contains:
type file.txt
>AAEL001292::intron | Intron sequenceGTGGG>AAEL001312::intron | Intron sequenceGTAAGAAC>AAEL001320::intron | Intron sequenceGTAAGACCTACTCA>AAEL001807::intron | Intron sequenceGTACT>AAEL002085::intron | Intron sequenceGTTAGTTGTA>AAEL002085::intron | Intron sequenceGTAAGT>AAEL002085::intron | Intron sequenceGTT>AAEL002085::intron | Intron sequenceGTACGA>AAEL002633::intron | Intron sequenceGTAAGAAAAAT
data = readlines('file.txt');
ID = regexp(data(1:2:end),'>(.*?):','tokens','once');
ID = vertcat(ID{:});
introns = data(2:2:end);
[G,ID] = findgroups(ID);
F = @(x)sum(cellfun(@numel,x));
count = groupsummary(introns,G,F);
T = table(ID,count)
T = 6x2 table
ID count ____________ _____ "AAEL001292" 5 "AAEL001312" 8 "AAEL001320" 14 "AAEL001807" 5 "AAEL002085" 25 "AAEL002633" 11
0 Comments Show -2 older commentsHide -2 older comments
Show -2 older commentsHide -2 older comments
Sign in to comment.
More Answers (2)
dpb on 31 Aug 2024 at 14:03
Edited: dpb on 1 Sep 2024 at 17:48
Open in MATLAB Online
- grp1_introns.txt
I'll convert the comment to an Answer because I believe it is/will be part of the required solution and is probably easier to clean the data to match the calculation assumptions than to modify the calculations to match the varying number of rows per group ID.
ADDENDUM
Another way to approach the cleanup would be to compute the line number difference between the lines beginning with ">" -- if it is >2, then that group has one or more wrapped lines.
data = readlines('grp1_introns.txt');
data(strlength(data)==0)=[]; % remove blank lines, if any
data(1:12)
ans = 12x1 string array
">AAEL001292::intron | Intron sequence" "GTGGGTGATTGTGTTTTTTTGATTTTGATACAAGTTAACAACATTGTATTTTTCAG" ">AAEL001312::intron | Intron sequence" "GTAAGAACACATGCGTCGTCAATGCTTAACAAGATTTAATACAATTTGCGCATCTGTTTC" "AG" ">AAEL001320::intron | Intron sequence" "GTAAGACCTACTCATTAGATTTATAGACAGTGATGTAATACTAATTATGCATTCATTTTA" "G" ">AAEL001807::intron | Intron sequence" "GTACTTTTGGTTGACGGAAGTTTTAGAATGAGAATTCGTCTAATCGTTATATATATTTTA" "CAG" ">AAEL002085::intron | Intron sequence"
ix=find(startsWith(data,'>'));
dx=[diff(ix);0];
iy=find(dx>2);
whos ix dx iy
Name Size Bytes Class Attributes dx 458x1 3664 double ix 458x1 3664 double iy 290x1 2320 double
[ix(1:10) dx(1:10) iy(1:10)]
ans = 10x3
1 2 2 3 3 3 6 3 4 9 3 5 12 4 8 16 2 9 18 2 10 20 3 11 23 9 12 32 9 13
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
NOTA BENE: There are wrapped lines with more than two lines; the sequence at line 16 is three while that at line 32 is 8 since the differences are 4 and 9, respectfully.
One will need to iterate through and combine all those rows in the counting, either by catenating the lines into one long one and removing the wrapped lines so the file follows the assumption of only one line or by summing the lengths of the lines between groups.
Well, let's see if I can manage to keep the indexing straight without too much trouble...
markfordelete=[]; % accumulator for marked rows
for i=1:numel(iy) % iterate over the elements that are multiple lines
l1=ix(iy(i))+1; % the first row in the group past the header
l2=l1+dx(iy(i))-2; % the last row in group
try
data(l1)=join(data(l1:l2),''); % put those rows together, no spaces
markfordelete=cat(1,markfordelete,[l1+1:l2].'); % keep list of rows to delete
catch
e=lasterror; e.message % for debugging
[i iy(i) ix(iy(i)) dx(iy(i)) l1 l2]
data(l1-1:end)
return
end
end
data(markfordelete)=[];
At this point, you should be able to apply @Voss method...it seems to work with at least a cursory checking. You'll want to output to a file and inspect carefully to make sure, but I think it works as intended.
ix=find(startsWith(data,'>'));
height(ix)
ans = 458
any(diff(ix)~=2)
ans = logical
The quick sanity check that the updated file doesn't have any segments of more than one line and the number of segments is the same is agoodsign™
ADDENDUM
Interesting thought -- what's the length of the lines now that are all strung together?
[Lmax,iMax]=max(dx)
Lmax = 401
iMax = 57
L=strlength(data(2:2:end));
UL=unique(L);
numel(UL)
ans = 157
NL=histc(L,UL);
[UL NL]
ans = 157x2
13 1 14 1 50 1 51 3 52 3 53 20 54 11 55 13 56 16 57 32
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
[min(UL) max(UL)]
ans = 1x2
13 23955
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
subplot(2,1,1)
bar(UL,NL)
set(gca,'XScale','log')
xlim([UL(1) UL(end)])
subplot(2,1,2)
bar(UL,NL)
set(gca,'XScale','log')
xlim([UL(1)-1 200])
xticks([UL(1) 20:10:50 100 200])
sum(L)
ans = 444117
There are a couple of introns that have 400+ rows that adds up to string lengths of 20,000+. I dunno if this is correct or not, but opening the text file confirms that is the content of the uploaded file.
2 Comments Show NoneHide None
Show NoneHide None
George on 1 Sep 2024 at 16:17
Direct link to this comment
https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3251119
Grateful for your code. More than I expected.
dpb on 1 Sep 2024 at 16:52
Direct link to this comment
https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3251134
Edited: dpb on 1 Sep 2024 at 17:46
No problem; got interesting. Is the file really right that there are strings of 25,000+?
While the forum won't let accept more than one Answer, if the above actually is useful, you could cast a vote... :)
Sign in to comment.
George on 31 Aug 2024 at 12:16
- grp1_introns.txt
Thank you very much for the prompt reply. I tried what you suggested. I get the following error message.
count = groupsummary(introns,G,F);
Error using matlab.internal.math.parseGroupVars (line 129)
Grouping variables and data matrix must have same number of rows.
Error in groupsummary (line 180)
[groupingData,groupVars] = matlab.internal.math.parseGroupVars(groupVars,tableFlag,"groupsummary:Group",T);
I'm attaching the data file in case it helps.
Once again many thanks
2 Comments Show NoneHide None
Show NoneHide None
dpb on 31 Aug 2024 at 12:50
Direct link to this comment
https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3250654
Edited: dpb on 31 Aug 2024 at 13:02
Open in MATLAB Online
- grp1_introns.txt
data = readlines('grp1_introns.txt');
data(1:10,:)
ans = 10x1 string array
">AAEL001292::intron | Intron sequence" "GTGGGTGATTGTGTTTTTTTGATTTTGATACAAGTTAACAACATTGTATTTTTCAG" ">AAEL001312::intron | Intron sequence" "GTAAGAACACATGCGTCGTCAATGCTTAACAAGATTTAATACAATTTGCGCATCTGTTTC" "AG" ">AAEL001320::intron | Intron sequence" "GTAAGACCTACTCATTAGATTTATAGACAGTGATGTAATACTAATTATGCATTCATTTTA" "G" ">AAEL001807::intron | Intron sequence" "GTACTTTTGGTTGACGGAAGTTTTAGAATGAGAATTCGTCTAATCGTTATATATATTTTA"
The attached file seems to have a linefeed after a fixed-length record for the longer sequences -- is that real in the file or an artifact of how it was saved/uploaded?
That breaks the assumption @Voss made that there are two lines per section and in using findgroups it is found out when the number of rows don't match.
That assumption is also in @Star Strider's code, but his will still run; just those sequences in the second/wrapped line won't be counted.
I gather it must be true in the real files, too, if yours failed; in that case it will take some additional preprocessing to apply either solution.
Be sure you haven't inadvertently introduced that line break by having looked at the files in a word processing type editor such as Notepad that automatically word wrapped the longer lines (or, maybe even the MATLAB code editor might if the line length is set, I know it is sometimes quite annoying, too, with wanting to wrap comments at ends of somewhat longer code lines...
L=max(strlength(data)) % the maximum line length
ans = 60
ix=find(strlength(data)==L); % locate those long lines
iy=find(startsWith(data,'>')); % and the sequence header lines
iy=iy(iy>ix(1)); % keep only those after the first long line
for i=1:numel(ix)
if iy(i)-ix(i)>2 % wrapped line
data(ix,:)=join(data(ix:ix+1,:)); % put them back together
end
end
data(ix+1,:)=[]; % now remove the wrapped lines
George on 1 Sep 2024 at 16:16
Direct link to this comment
https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3251114
Many thanks for your input. Works very well.
Sign in to comment.
Sign in to answer this question.
See Also
Categories
AI, Data Science, and StatisticsText Analytics ToolboxText Data Preparation
Find more on Text Data Preparation in Help Center and File Exchange
Tags
- nucleotide count bioinformatics
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- Deutsch
- English
- Français
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)
Contact your local office