Search manual

The VOICE CLARIAH search manual gives an overview of search functions supported by VOICE 3.0 Online BETA. The enhanced webtool developed for VOICE 3.0 Online BETA also displays the CQL of search queries performed below the search field. If you encounter any difficulties with search queries performed, please save its CQL and share it with the VOICE CLARIAH team via the VOICE CLARIAH survey.

1. Token search

2. POS and lemma search

3. Mark-up search – NEW!

4. Phrase search – partly NEW!

5. Expert search – NEW!

6. Examples of combined searches – NEW!

Useful links

1. Token search



Example Search






General remark

Tokens are to be searched for with lower case characters (e.g. i speak french). (Capital letters indicate a POS search, see below.)

1.1. Simple token search


Search for a particular token


Contracted forms (e.g. wanna, gonna, don’t, it’s) need to be searched for with a space inserted before the contracted part.



wan na


do nt


1.2. Token search with wildcards

General remark

Note to users of previous versions of VOICE Online: the syntax of wildcard search has changed in VOICE 3.0. As a general rule, you now need to insert a full stop before any wildcard to obtain the search results you are familiar with from previous versions of VOICE Online, e.g. .*, .+ or .? (see examples in this section)


token.* (no space)

Token plus zero or more characters













token.? (no space)

Token plus zero or one character




token.+ (no space)

Token plus one or more characters






2. POS and lemma search





General remark

POS tags are searched for with capital letters, e.g. VVP. (Lower case characters are used for token searches, see above).

Also, note that all tokens in VOICE are tagged with POS tags for morphological form and, in parentheses, syntactic function. These are often, though not always, identical. If a tag is entered without specification, both positions are searched. Alternatively, form- or function position can also be searched separately (see below). Note however, that if form- and function tag are identical, the POS view in the web-interface currently only shows one tag, e.g. you_PP, not you_PP(PP), and both form- and function tag only if they differ, e.g. neuropa_PVC(NP).

2.1. Simple POS search


(equivalent to pf:POS)


All tokens with a particular part-of-speech tag (POS) in form or function position


NB. General searches for a POS tag in the whole corpus are likely to yield many hits and might slow down the search engine.








All tokens with a part-of-speech tag in form position


(adj. in form position)




All tokens with a part-of-speech tag in function position


(adj. in function position)



2.2. POS search with wildcards



POS tag with wildcard


NB. Using wildcards with POS tags is meaningful for POS categories which are sub-divided into finer categories, e.g. Verbs, Adjectives, Nouns, Adverbs. Users may also want to narrow down results by adding sub-specifications, e.g. (go,V.* see 5. EXPERT SEARCH).


Verb-tag with wildcard

(all verb forms, e.g. VV, VBP, …)


to ask_VV(VV)




Adjective-tag with wildcard

(all adjective forms, i.e. JJ, JJR, JJS)





Noun-tag with wildcard

(all noun forms, e.g. NN, NNS, …)





2.3. Lemma search


Finds all tokens of a particular lemma







3. Mark-up search – NEW!





General remarks

Mark-up can be searched via several means in VOICE 3.0 Online BETA.

Apart from POS tags for conversational features like pauses (PA) or breath (BR) that were available already in VOICE 2.0 POS Online (see short POS tag set), VOICE 3.0 Online BETA introduces new possibilities to search for pauses, laughter and mark-up between pointed brackets (e.g. speaking modes, overlaps, tags for non-English speech) in more sophisticated ways.
(For descriptions of the mark-up categories available in VOICE transcripts, see the VOICE Mark-up Conventions.)

3.1. Pauses




Pauses of different lengths
(Numbers indicate length in seconds as transcribed. _0 indicates a short pause of up to a half second, see VOICE Mark-up conventions.)


NB. Be mindful that especially short pauses are rather frequent and are thus best searched in combination (e.g. with tokens or POS tags, see 4.4. and below).


NB. In order to find pauses irrespective of their length, we recommend you use the POS tag PA, see “Other mark-up searches” and POS short tag set.)















3.2. Laughter




Laughter, each @-symbol refers to the respective number of syllables laughed (e.g. Ha ha = @@, see VOICE Mark-up Conventions).



@@ @@






Laughter strings with at least one “@”, i.e.
laughter strings with any number of syllables.








Laughter with a defined string length (see 5.3. Context search)


(sequences of minimum 2 and maximum 4 repetitions of the @-character)




3.3. Speaking modes






Stretch of speech marked <xyz> token token </xyz>


NB. For the full list of speaking modes see the VOICE Mark-up Conventions.



(speaking mode: laughingly)

<@> yeah yeah </@>


(speaking mode: soft)

<soft> okay </soft>

3.4. Non-English Speech



All stretches of transcription in languages marked as non-English speech (L1, LN or LQ; see VOICE Mark-up Conventions).


<LNger> diesen leberknoedel {this liver dumpling} </LNger>

<L1slo> xxxx </L1slo>

<LQslo> dobre? {good} </LQslo>


All stretches of speakers first languages (L1) other than English


<L1mlt> mara {woman} </L1mlt>

<L1rum> securitate </L1rum>


All stretches of speech in neither English nor a speakers first language


<LNger> senf. {mustard} </LNger>

<LNita> toscana? {name of pizza} </LNita>


All stretches of speech where it is not known whether they were a speakers first or a foreign language


<LQfre> melange {mixture} </LQfre>

<LQger> danke {thanks} </LQger>

<L translation="token"/>


Finds tokens in any translation tag (L1, LN or LQ)

<L translation="yes"/>

<L1scc> jeste {yes} </L1scc>

<LNita> s:i. {yes} </LNita>

<L1 translation="token"/>

<LN translation="token"/>

<LQ translation="token"/>



Finds tokens in translations either an L1, LN or an LQ-tag

<LN translation="yes"/>

<LNger> ja {yes} </LNger>

<LNita> s:i. {yes} </LNita>

<LNfre> oui {yes} </LNfre>




Finds and highlights a stretch of a particular language tag

NB: Languages are abbreviated according to the iso 639-2 codes.


<L1ger> nein danke {no thanks} </L1ger>


<Lnita> grazie {thanks} </Lnita>

3.5. Overlaps



NB: This search is best narrowed down, e.g. by using within or containing (see example on the right and 5.4.1. Tokens within Mark-up)


<1> what is it </1>

<3> yeah </3>

<6> we have that </6>



okay within <ol/>

<2> okay </2>

<8> oh okay </8>

3.6. Additional mark-up searches




<ono> wəʊəʊ: </ono>

<ono> brbrm </ono>

<clears throat/>


Speaker noises

NB: For the full list of speaker noises see the VOICE Mark-up Conventions.

<clears throat/>


<clears throat>

3.7. Mark-up searches via POS tags and special queries




All foreign (i.e. non-English) tokens


<LNbul>  rakia_FW(FW) {raki} </LNbul>

<L1ger> tschuldigung_FW(FW) {sorry} </L1ger>

<LNger> schottentor?_FW(FW) {place in vienna} </LNger>


All pauses (POS tag PA)







All pronunciation variations and coinages


<pvc> creativitly_PVC(NN)/PVC(RB) {creatively} </pvc>

<pvc> frauding_PVC(VVG) </pvc>


All onomatopoeia


<ono> bvuff_ONO(ONO) <ono/>

<ono> lalala_ONO(ONO) </ono>



All spelt items

NB. While spelt tokens are annotated with different POS tags (e.g. SP, CD, NN), they can be retrieved through the common prefix s_.



<spel> p h d_NN(NN) </spel>

<spel> a_LS(LS) </spel>

<spel> e u_NP(NP) </spel>

<spel> a m_RB(RB) </spel>

<spel> s_p_SP(SP) </spel>


4. Phrase search – partly NEW!

4. PHRASE SEARCH – partly NEW!!!




General remarks

Any combination or sequence of tokens (i.e. character strings/lexical searches), tags or searchable mark-up can be searched as phrases when each item is separated by a space.

Phrase searches are only carried out within individual utterances. In consequence, phrases that go beyond utterance boundaries will not be found.

Conversational mark-up such as pauses, laughter, breathing, tags for overlapping speech and other mark-up are ignored (i.e. they do not break up lexical phrases), unless mark-up items are explicitly included in the phrase search.

4.1. Lexical phrases (tokens)

Token plus token



token token


Finds a particular sequence of tokens

and the

and the

a:nd the

(and) (1) the

and hh the

and the </@> hh and the

4.2. Part-of-speech and lemma combinations

POS tag plus POS tag / lemma






Finds a particular sequence of POS tags


(Determiner followed by adjective followed by noun)

a hu:ge university

the other way

a good soccer

POS1 POS2 lemma1

Finds a particular sequence of POS tags and/or lemma tags


DT JJ l:university

a hu:ge university

a (.) modern university

the private universities

4.3. Word, POS, lemma combinations

Token plus POS or lemma



token POS

POS token

lemma POS

token1 POS1 POS2

POS1 token1 token2

Finds sequences of tokens, POS tags and lemmas

whenever PP

(token whenever plus personal pronoun)

whenever you

whenever they

whenever we

PVC er

(pronunciation variation and coinage plus token er)

<pvc> preferently </pvc> er

<pvc> (knowledges) </pvc> er

you MD VV

(token you followed by modal verb and base verb)

you will go

you can get

play the NN

(tokens play and the followed by singular noun)

play the card

play the doorman

play the map

4.4. Word and mark-up sequences

Token plus mark-up



token <speaking mode/>

Token followed by speaking mode soft

yeah <soft/>

(token yeah followed by mark-up indicating softly spoken)

yeah <soft> okay okay <1> i understand </1> </soft>

token <L/>


Token followed by non-English speech

say <L/>

(token say followed by any language tag)

can say <LNger> vermissen {to miss} </LNger> (.)

how do you say <LNfre> subvention {subvention, subsidy} </LNfre>

now we say (.) <L1nor> trettito {thirty-two} </L1nor>

is <L1/>

is <L1ger> garnisongasse {street name} </L1ger>

_@ token

Laughter followed by token

_@+   yes

(any number of laughter-syllables followed by token yes)

@ yes

@@ yes

@@@ <1> yes </1>

token _1

Token followed by pause

i _1

(token i followed by a 1 -second pause)

no i (1) i just

what i: (1) would like to

4.5. POS and mark-up sequences

POS tag plus mark-up



<@/> POS

Speaking mode followed by POS

<@/> UH

(laughingly spoken followed by interjection)

<@> no </@> @ ah

<@> okay </@> (1) erm

<@> well </@> (1) wow.

<L/> POS

Tag indicating non-English speech followed by POS

<L/> PVC

(language tag followed by PVC)

<L1scc> xx x </L1scc> <pvc> sympatic </pvc>

POS <ol/>

POS tag followed by overlap

UH <ol/>

(interjection followed by overlap)

er <4> reaction </4>

a:h <2> well yes </2>

huh? (.) <3> and the: </3>

4.6. Phrase search with wild cards


General remark

For phrase search, wildcards need to be separated from all other tokens, mark-up, POS tags, lemmas etc. by a space.



Token/POS/lemma plus token with one or more (n-)characters

manage .*

manage to

manage with

manage i



.* NN .*

a student here


Token/POS/lemma plus token with one character


go .?

go i



Token/POS/lemma plus token with one or more characters

NB. Search results with .+ are identical to .* in phrase search.

austria .+

austria from

austria i

austria the


5. Expert search – NEW!




5.1. Fine-tuning searches (and)


Meaning: and

Finds sub-specifications of tokens with POS tags or lemmas. (Any sequence of item before and after | possible.)



Token tagged with a particular POS tag


(token walk as noun)

a five minute walk_NN(NN)



(token real as adverb)

real_RB(RB) beautiful


All tokens of a particular lemma tagged with a particular POS tag


(all tokens with lemma go and tagged with verb-tag present tense 3rd person singular)

everybody goes_VVZ(VVZ)

who <@> loses go_V(VVZ) <8> for drinks

5.2. Fine-tuning searches (or)


Meaning: alternation (or)

Finds any of the options either to the left or the right of the pipe character | . (Any sequence of tokens, lemmas or POS tags before and after | are possible.)


Finds either one of these tokens


say|mean that

mean that

say that


Finds either one of these POS tags



(verb have or be, past tense)





Finds either this token or POS tag


(existential there or you)




Finds either one of these lemmas


l:say|l:mean that

(lemma say or lemma mean plus token that)

say that

said that

saying that

mean that

means that



Finds either this token or this lemma






token1|token2 l:lemma1

Token1 or token2 followed by lemma1

never|always say

(token never or always followed by lemma say)

always say

never said

always saying

token1 POS1|POS2

Token1 followed by POS1 or POS2


(token i followed by response marker or interjection)

yeah i

er i

mhm: i

always VBZ|VHZ|VVZ

(token always followed by third pers. singular form of be, have or other verbs)

always is

always has

always depends

always does

_@ POS1|POS2

Laughter followed by POS1 or POS2

_@ UH|RE

(one syllable of laughter followed by interjection or response marker)


@ er

@ yeah

@ ah


5.3. Context search: Defining range of context

General remark

As with any phrase search, in context search only search results within individual utterances are found.


{minimal number,maximal number}

Specifies an exact number or a range of minimal and maximal number.


NB. If used without a space the number in {…} refers to its immediate left neighbour, e.g. _@{2} finds exactly two @-syllables (@@).


NB. If used with space, _@ {2} finds one syllable of laughter (@) followed by any two tokens.



token {2}

Token followed by any two tokens in the same utterance

really {2}

really low and

really strong hm

really good at

token {0,3}


Token followed by any zero to three tokens in the same utterance

house {0,3}

house in like say

{1,2} token {1,2}


Token preceded and followed by any one to two tokens in the same utterance

{1,2} house {1,2}

have a house in each

on the house on the

to your house again

token1 {0,3} token2


Zero to three tokens between token1 and token2 in the same utterance

i {0,3} go

i must go

i decided to go

i only want to go

{1,2} POS tag {1,2}


A POS tag preceded and followed by one or two tokens in the same utterance

{1,2} PVC {1,2}

in er <pvc> maltesan {maltese} </pvc> english?

of (.) european <pvc> reintegration </pvc> you know?



Wildcard which defines a particular number of placeholder tokens.

NB. This type of query only yields meaningful results when narrowed down e.g. by phrase search (see example to the right).

a .*{1} house

a neutral house

a retirement house

a lovely house

.*{minimal, maximal}

Wildcard defining a particular number range of placeholder tokens.

go .*{1,2} university

go to university

go to the university

go to state university

5.4. Search within: Find tokens and POS/lemmas within mark-up


Finds and highlights individual tokens/tags or combinations of tokens, POS tags or lemmas within a mark-up tag (in pointed brackets).

5.4.1. Tokens within Mark-up

token within <speaking mode/>


Token within pointed brackets, e.g. Speaking mode

go within <soft/>

<soft> have to go: </soft>

yeah within <@/>

<@> yeah yeah yeah </@>

token within <L1/LN/LQ/>

Token within tag for non-English speech

nein within <L1ger/>

<L1ger> nei:n {no:} </L1ger>

token within <ol/>

Token within overlapping speech

really within <ol/>

<3> really strong. (1) hm? </3>

<4> really? </4>

<2> not really </2>

_@ within <speaking mode/>

Laughter within speaking mode

_@ within <loud/>

loud> @ </loud>

5.4.2. POS within Mark-up

POS within <speaking mode/>

POS tag within speaking mode

RE within <loud/>

<loud> yeah_RE(RE) </loud>

<loud> okay?_RE(RE) </loud>

POS within <ol/>

POS tag within overlap

FI within <ol/>

<4> sorry_FI(FI) </4>

<7> oh_FI(FI) my_FI(FI) gosh_FI(FI) </7>

<8> youre_FI(FI) welcome_FI(FI) </8>

<4> bye-bye_FI(FI) </4>

5.4.3. Lemma within Mark-up

l:lemma within <speaking mode/>

Lemma within speaking mode

l:be within <imitating/>

<imitating> be: the members of the working groups <8>

l:lemma within <ol/>

Lemma within overlap tag

l:say within <ol/>

<6> say </6>

<2> am i saying</2>

<4> hed say </4>

5.5. Search for containing: Find stretches of speech with particular mark-up that contain particular tokens/POS/lemmas



5.5.1. Mark-up containing token



<ol/> containing token


Overlap containing token

<ol/> containing funny

<7> so funny </7>

<7> a little bit funny </7>

<6> thats funny</6>

<L/> containing token

Language tags marked as non-English speech containing token

<L1/> containing ja

<L1ger> ja tust du (weiter) {do you hurry up} </L1ger>

<L1ger> ja? {yeah} </L1ger>

<soft/> containing token

Speaking mode containing token

<soft/> containing okay

<soft> okay </soft>

<soft> okay its my turn? </soft>

5.5.2. Mark-up containing POS



<speaking mode/> containing POS

Speaking mode containing POS tag

<loud/> containing RE

<loud> no dont </loud>

<loud> yeah. </loud>

<loud> yes </loud>

<loud> okay there is coffee </loud>

5.5.3. Mark-up containing lemma



<@/> containing


Speaking mode laughingly spoken containing lemma

<@/> containing l:go

<@> when and where to go </@>

<@> you went shopping </@>


6. Examples of combined searches – NEW!


General remark

This section provides a non-exhaustive selection of possible search combinations for illustration and inspiration.


6.1. Combined searches with wildcards and fine-tuning



token .* .* .*

Token plus wildcards for any number of tokens with more characters

i really .* .* .*

i really feel so old

i really appreciate talking to

i really think that you

.+ token .*

Token preceded by wildcard and followed by wildcard

.+ i really .* .*

what i really liked was

e:r i really hope that

i i really dont


Combination of particular form and function-POS tags


(token tagged adjective in form-position and adverb in function position)

you grew up (.) bilingual_JJ(RB).

perform good_JJ(RB) in another language



Token with wildcard tagged with a particular POS tag


(thank with wildcard as formulaic item)


thank you


(all tokens ending in -ness tagged PVC)






All instances of a lemma tagged with a particular POS tag with wildcard.


NB. In phrases, this type of search can be useful to retrieve all POS tags of a superordinate POS category, e.g. V.* (all verbs), N.* for (all nouns).



(see, all verb-forms)

they saw_VVD(VVD) plays

you dont see_VV(VV) it

were seeing_VVG(VVG) a growing gap

token1 POS1 token2 .*

Combinations of token and POS tag plus a wildcard (standing for any token with one or more characters)

i RB think .*

(i followed by adverb followed by think followed by any token with more characters)

i also_RB think that

POS1 POS2 token,POS.*

Sequence of POS tags followed by a token with a sub-specification

PP RB think,V.*

(Personal pronoun followed by adverb followed by think as verb)

 i_PP also_RB think_VVP

could you_PP maybe_RB think_VV


Sequence of POS tags including wild cards


(Adverb followed by any verb-form followed by personal pronoun)

just smell it

nt put it

then leave it


Token of a particular lemma sub-specified with POS tag


(Lemma show followed by any verb form)






(Lemma thought as singular or plural noun)




Token of a lemma with wildcard sub-specified with POS tag with wildcard.


(Lemma starting with re- tagged as any verb-form)





POS l:lemma POS,.*token

POS tag followed by lemma followed by POS tag sub-specified with a token with wildcard

DT l:good NN,.*ion

(Determiner followed by lemma good followed by singular nouns ending in

a better situation

the: good discussion

the best solution

POS1|POS2 token1

Either POS tag1 or POS tag2 followed by token1

RB|JJ good

(Adverb or adjective followed by good)

very good

no good

good good

many good

token1|token2|token3 POS1

Either token1, token2 or token3 followed by POS tag1

yes|yeah|yah UH

(Tokens yes, yeah or yah followed by POS category interjection)

yes o:h

yah?  er

yeah. ooph

6.2. Combined searches with within or containing

token within <speaking mode/>

Token within speaking mode

<soft/> containing well

<7> <soft> well you know </soft> </7>

<soft> mhm (2) very well </soft>

<soft> on Thursday as well </soft>

POS1|POS2|POS3 within <ol/>

Either POS tag1, 2 or 3 within overlap-tag

FI|RE|UH within <ol/>

(Formulaic item or response marker or interjection within overlap tag)

<3> thanks_FI(FI) </3>

<5> ye:s_RE(RE) </5>

<10> er:_UH(UH) </10>

laughter within <ol/>

Laughter within overlap-tag

_@@ within <ol/>

(Two syllables of laughter within overlap-tag)

<8> @@ </8>

<1> hi @@ </1>

<speaking mode/> containing token,POS

Speaking mode tag containing a token sub-specified with a POS tag

<soft/> containing well,DM

(Speaking mode soft containing token well POS tagged as discourse marker)

<soft> well_DM(DM) you know </soft>

<soft> well_DM(DM) (then) yeah of course but </soft>

<ol/> containing token,POS

Overlap-tag containing a token sub-specified with a POS tag

<ol/> containing you,FI

(Overlap containing token you tagged as formulaic item)

<6> thank you_FI(FI) @@@ </6>

<11> see you_FI(FI) </11>

<7> you_FI(FI)re welcome </7>

<ol/> containing l:lemma,POS.*

Overlap-tag containing all tokens of a lemma sub-specified with a POS tag with wildcard

<ol/> containing l:good,RB.*

(Overlap tag containing all tokens of a lemma good as any type of adverb, i.e. RB,RBR,RBS)

is going <6> (good)_RB(RB) </6>

<1> much better_RBR(RBR) </1>


6.3. Combined searches with context

token1 {0,1} POS1 POS2

Token1 followed by a defined range of context followed by POS tag1 and 2

the {0,2} JJ NN

(Token the followed by 0-2 tokens followed by adjective and noun)

the main building

the second third lesson

the the legal stuff

the legal erm legal clinic

<speaking mode/> containing token {0,1}

Speaking mode tag containing a token followed by a defined range of tokens

<soft/> containing yes {1,5}

(<Speaking mode/> containing token yes followed by pause range of 1-5.)

<soft> yes okay </soft>

<soft> a:h yes. (.) [name2]</soft>

<soft> yes they must be calibrated </soft>

6.4. Combined mark-up searches

Combination of mark-up searches via POS tags and new mark-up searches in pointed brackets.



PA <speaking mode/>

Any pause followed by a speaking mode tag

PA <fast/>

(all pauses followed by fast speech)

(.) <fast> keep that in mind </fast>

PVC within <speaking mode/>

All pronunciation variations and coinages which occur within a speaking mode tag

PVC within <soft/>

<soft> <pvc> unconcrete </pvc> </soft>

<soft> a balloon <pvc> wobbler? </pvc> </soft>


<ol/> containing SP


All overlaps containing spelt tokens

<ol/> containing SP

<9> <spel> s p </spel> </9>



Scroll Up