Count number of nucleotide bases (letters) in file with multiple en... (2024)

63 views (last 30 days)

Show older comments

George on 30 Aug 2024 at 15:35

  • Link

    Direct link to this question

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries

  • Link

    Direct link to this question

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries

Edited: dpb on 1 Sep 2024 at 17:48

Accepted Answer: Voss

Open in MATLAB Online

Below is an example of the type of sequences I'm dealing with. The sequences are truncated for readability. Some IDs correspond to one sequence only whereas others to more than one squence. The objective is to count the number of nucleotides (letters) per ID.

>AAEL001292::intron | Intron sequence

GTGGG

>AAEL001312::intron | Intron sequence

GTAAGAAC

>AAEL001320::intron | Intron sequence

GTAAGACCTACTCA

>AAEL001807::intron | Intron sequence

GTACT

>AAEL002085::intron | Intron sequence

GTTAGTTGTA

>AAEL002085::intron | Intron sequence

GTAAGT

>AAEL002085::intron | Intron sequence

GTT

>AAEL002085::intron | Intron sequence

GTACGA

>AAEL002633::intron | Intron sequence

GTAAGAAAAAT

I've managed to generate a table like:

AAEL001292 5

AAEL001312 8

AAEL001320 14

AAEL001807 5

AAEL002085 10

AAEL002085 6

AAEL002085 3

AAEL002085 6

AAEL002633 11

The desired output is:

AAEL001292 5

AAEL001312 8

AAEL001320 14

AAEL001807 5

AAEL002085 25 # sum of number of nuceotides under to ID AAEL002085

AAEL002633 11

Any suggestions will be most helpful.

1 Comment

Show -1 older commentsHide -1 older comments

dpb on 30 Aug 2024 at 15:45

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3250074

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3250074

Be much easier if you would attach a data file for folks to work with...as for hints, without a data file from which to see just what you're starting with, I'd venture using a table and @doc:rowfun or groupsummary even with grouping variables will give you the desired results directly.

Sign in to comment.

Sign in to answer this question.

Accepted Answer

Voss on 30 Aug 2024 at 16:58

  • Link

    Direct link to this answer

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#answer_1508074

Open in MATLAB Online

  • file.txt

% this is what the attached file contains:

type file.txt

>AAEL001292::intron | Intron sequenceGTGGG>AAEL001312::intron | Intron sequenceGTAAGAAC>AAEL001320::intron | Intron sequenceGTAAGACCTACTCA>AAEL001807::intron | Intron sequenceGTACT>AAEL002085::intron | Intron sequenceGTTAGTTGTA>AAEL002085::intron | Intron sequenceGTAAGT>AAEL002085::intron | Intron sequenceGTT>AAEL002085::intron | Intron sequenceGTACGA>AAEL002633::intron | Intron sequenceGTAAGAAAAAT

data = readlines('file.txt');

ID = regexp(data(1:2:end),'>(.*?):','tokens','once');

ID = vertcat(ID{:});

introns = data(2:2:end);

[G,ID] = findgroups(ID);

F = @(x)sum(cellfun(@numel,x));

count = groupsummary(introns,G,F);

T = table(ID,count)

T = 6x2 table

ID count ____________ _____ "AAEL001292" 5 "AAEL001312" 8 "AAEL001320" 14 "AAEL001807" 5 "AAEL002085" 25 "AAEL002633" 11

0 Comments

Show -2 older commentsHide -2 older comments

Sign in to comment.

More Answers (2)

dpb on 31 Aug 2024 at 14:03

  • Link

    Direct link to this answer

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#answer_1508314

  • Link

    Direct link to this answer

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#answer_1508314

Edited: dpb on 1 Sep 2024 at 17:48

Open in MATLAB Online

  • grp1_introns.txt

I'll convert the comment to an Answer because I believe it is/will be part of the required solution and is probably easier to clean the data to match the calculation assumptions than to modify the calculations to match the varying number of rows per group ID.

ADDENDUM

Another way to approach the cleanup would be to compute the line number difference between the lines beginning with ">" -- if it is >2, then that group has one or more wrapped lines.

data = readlines('grp1_introns.txt');

data(strlength(data)==0)=[]; % remove blank lines, if any

data(1:12)

ans = 12x1 string array

">AAEL001292::intron | Intron sequence" "GTGGGTGATTGTGTTTTTTTGATTTTGATACAAGTTAACAACATTGTATTTTTCAG" ">AAEL001312::intron | Intron sequence" "GTAAGAACACATGCGTCGTCAATGCTTAACAAGATTTAATACAATTTGCGCATCTGTTTC" "AG" ">AAEL001320::intron | Intron sequence" "GTAAGACCTACTCATTAGATTTATAGACAGTGATGTAATACTAATTATGCATTCATTTTA" "G" ">AAEL001807::intron | Intron sequence" "GTACTTTTGGTTGACGGAAGTTTTAGAATGAGAATTCGTCTAATCGTTATATATATTTTA" "CAG" ">AAEL002085::intron | Intron sequence"

ix=find(startsWith(data,'>'));

dx=[diff(ix);0];

iy=find(dx>2);

whos ix dx iy

Name Size Bytes Class Attributes dx 458x1 3664 double ix 458x1 3664 double iy 290x1 2320 double

[ix(1:10) dx(1:10) iy(1:10)]

ans = 10x3

1 2 2 3 3 3 6 3 4 9 3 5 12 4 8 16 2 9 18 2 10 20 3 11 23 9 12 32 9 13

<mw-icon class=""></mw-icon>

<mw-icon class=""></mw-icon>

NOTA BENE: There are wrapped lines with more than two lines; the sequence at line 16 is three while that at line 32 is 8 since the differences are 4 and 9, respectfully.

One will need to iterate through and combine all those rows in the counting, either by catenating the lines into one long one and removing the wrapped lines so the file follows the assumption of only one line or by summing the lengths of the lines between groups.

Well, let's see if I can manage to keep the indexing straight without too much trouble...

markfordelete=[]; % accumulator for marked rows

for i=1:numel(iy) % iterate over the elements that are multiple lines

l1=ix(iy(i))+1; % the first row in the group past the header

l2=l1+dx(iy(i))-2; % the last row in group

try

data(l1)=join(data(l1:l2),''); % put those rows together, no spaces

markfordelete=cat(1,markfordelete,[l1+1:l2].'); % keep list of rows to delete

catch

e=lasterror; e.message % for debugging

[i iy(i) ix(iy(i)) dx(iy(i)) l1 l2]

data(l1-1:end)

return

end

end

data(markfordelete)=[];

At this point, you should be able to apply @Voss method...it seems to work with at least a cursory checking. You'll want to output to a file and inspect carefully to make sure, but I think it works as intended.

ix=find(startsWith(data,'>'));

height(ix)

ans = 458

any(diff(ix)~=2)

ans = logical

The quick sanity check that the updated file doesn't have any segments of more than one line and the number of segments is the same is agoodsign™

ADDENDUM

Interesting thought -- what's the length of the lines now that are all strung together?

[Lmax,iMax]=max(dx)

Lmax = 401

iMax = 57

L=strlength(data(2:2:end));

UL=unique(L);

numel(UL)

ans = 157

NL=histc(L,UL);

[UL NL]

ans = 157x2

13 1 14 1 50 1 51 3 52 3 53 20 54 11 55 13 56 16 57 32

<mw-icon class=""></mw-icon>

<mw-icon class=""></mw-icon>

[min(UL) max(UL)]

ans = 1x2

13 23955

<mw-icon class=""></mw-icon>

<mw-icon class=""></mw-icon>

subplot(2,1,1)

bar(UL,NL)

set(gca,'XScale','log')

xlim([UL(1) UL(end)])

subplot(2,1,2)

bar(UL,NL)

set(gca,'XScale','log')

xlim([UL(1)-1 200])

xticks([UL(1) 20:10:50 100 200])

Count number of nucleotide bases (letters) in file with multiple en... (5)

sum(L)

ans = 444117

There are a couple of introns that have 400+ rows that adds up to string lengths of 20,000+. I dunno if this is correct or not, but opening the text file confirms that is the content of the uploaded file.

2 Comments

Show NoneHide None

George on 1 Sep 2024 at 16:17

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3251119

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3251119

Grateful for your code. More than I expected.

dpb on 1 Sep 2024 at 16:52

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3251134

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3251134

Edited: dpb on 1 Sep 2024 at 17:46

No problem; got interesting. Is the file really right that there are strings of 25,000+?

While the forum won't let accept more than one Answer, if the above actually is useful, you could cast a vote... :)

Sign in to comment.

George on 31 Aug 2024 at 12:16

  • Link

    Direct link to this answer

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#answer_1508259

  • Link

    Direct link to this answer

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#answer_1508259

  • grp1_introns.txt

Thank you very much for the prompt reply. I tried what you suggested. I get the following error message.

count = groupsummary(introns,G,F);

Error using matlab.internal.math.parseGroupVars (line 129)

Grouping variables and data matrix must have same number of rows.

Error in groupsummary (line 180)

[groupingData,groupVars] = matlab.internal.math.parseGroupVars(groupVars,tableFlag,"groupsummary:Group",T);

I'm attaching the data file in case it helps.

Once again many thanks

2 Comments

Show NoneHide None

dpb on 31 Aug 2024 at 12:50

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3250654

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3250654

Edited: dpb on 31 Aug 2024 at 13:02

Open in MATLAB Online

  • grp1_introns.txt

data = readlines('grp1_introns.txt');

data(1:10,:)

ans = 10x1 string array

">AAEL001292::intron | Intron sequence" "GTGGGTGATTGTGTTTTTTTGATTTTGATACAAGTTAACAACATTGTATTTTTCAG" ">AAEL001312::intron | Intron sequence" "GTAAGAACACATGCGTCGTCAATGCTTAACAAGATTTAATACAATTTGCGCATCTGTTTC" "AG" ">AAEL001320::intron | Intron sequence" "GTAAGACCTACTCATTAGATTTATAGACAGTGATGTAATACTAATTATGCATTCATTTTA" "G" ">AAEL001807::intron | Intron sequence" "GTACTTTTGGTTGACGGAAGTTTTAGAATGAGAATTCGTCTAATCGTTATATATATTTTA"

The attached file seems to have a linefeed after a fixed-length record for the longer sequences -- is that real in the file or an artifact of how it was saved/uploaded?

That breaks the assumption @Voss made that there are two lines per section and in using findgroups it is found out when the number of rows don't match.

That assumption is also in @Star Strider's code, but his will still run; just those sequences in the second/wrapped line won't be counted.

I gather it must be true in the real files, too, if yours failed; in that case it will take some additional preprocessing to apply either solution.

Be sure you haven't inadvertently introduced that line break by having looked at the files in a word processing type editor such as Notepad that automatically word wrapped the longer lines (or, maybe even the MATLAB code editor might if the line length is set, I know it is sometimes quite annoying, too, with wanting to wrap comments at ends of somewhat longer code lines...

L=max(strlength(data)) % the maximum line length

ans = 60

ix=find(strlength(data)==L); % locate those long lines

iy=find(startsWith(data,'>')); % and the sequence header lines

iy=iy(iy>ix(1)); % keep only those after the first long line

for i=1:numel(ix)

if iy(i)-ix(i)>2 % wrapped line

data(ix,:)=join(data(ix:ix+1,:)); % put them back together

end

end

data(ix+1,:)=[]; % now remove the wrapped lines

George on 1 Sep 2024 at 16:16

Direct link to this comment

https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3251114

  • Link

    Direct link to this comment

    https://support.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#comment_3251114

Many thanks for your input. Works very well.

Sign in to comment.

Sign in to answer this question.

See Also

Categories

AI, Data Science, and StatisticsText Analytics ToolboxText Data Preparation

Find more on Text Data Preparation in Help Center and File Exchange

Tags

  • nucleotide count bioinformatics

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

An Error Occurred

Unable to complete the action because of changes made to the page. Reload the page to see its updated state.


Count number of nucleotide bases (letters) in file with multiple en... (11)

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list

Americas

  • América Latina (Español)
  • Canada (English)
  • United States (English)

Europe

  • Belgium (English)
  • Denmark (English)
  • Deutschland (Deutsch)
  • España (Español)
  • Finland (English)
  • France (Français)
  • Ireland (English)
  • Italia (Italiano)
  • Luxembourg (English)
  • Netherlands (English)
  • Norway (English)
  • Österreich (Deutsch)
  • Portugal (English)
  • Sweden (English)
  • Switzerland
    • Deutsch
    • English
    • Français
  • United Kingdom(English)

Asia Pacific

  • Australia (English)
  • India (English)
  • New Zealand (English)
  • 中国
  • 日本Japanese (日本語)
  • 한국Korean (한국어)

Contact your local office

Count number of nucleotide bases (letters) in file with multiple en... (2024)

References

Top Articles
Latest Posts
Article information

Author: Frankie Dare

Last Updated:

Views: 6600

Rating: 4.2 / 5 (53 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Frankie Dare

Birthday: 2000-01-27

Address: Suite 313 45115 Caridad Freeway, Port Barabaraville, MS 66713

Phone: +3769542039359

Job: Sales Manager

Hobby: Baton twirling, Stand-up comedy, Leather crafting, Rugby, tabletop games, Jigsaw puzzles, Air sports

Introduction: My name is Frankie Dare, I am a funny, beautiful, proud, fair, pleasant, cheerful, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.