Subversion Repositories wimsdev

Rev

Rev 6800 | Rev 6803 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
6793 bpr 1
WIMS' search engine and als
6797 czzmrn 2
===========================
6405 czzmrn 3
 
6797 czzmrn 4
WIMS' search engine works in two stages:
6405 czzmrn 5
 
6802 reyssat 6
1) update of index files when server data is changed (module added...), 
7
   typically once a day.
8
2) use of index files at each user's request to find some activities
9
 
10
 
11
Here are some details : 
12
 
13
1) update of index files       
14
===========================
15
A series of scripts creates a set of auxiliary files (generally
6797 czzmrn 16
stored in ~/public_html/bases/sys/, see description further down) and
17
a list of "keywords" (stored in ~/public_html/bases/site/).
6405 czzmrn 18
 
6797 czzmrn 19
(the scripts must be run in the order given here, as some files
20
created on earlier stages are used in subsequent stages). In general
21
the whole process is run by the script ~/bin/mkindex.
22
 
6802 reyssat 23
* Firstly a series of 3 perl scripts (mkdomain,mkwgrp,modindclass), 
24
that ~/bin/mkindex.sh calls via ~/public_html/bases/sys/mkindex.sh : 
25
 
26
- the programm ~/public_html/bases/sys/mkdomain.pl creates the lists
27
  of domains from the graph in domain/domain with its translations
28
  (domain/domain.$lang) and in json format (english) to be used for
29
  completion in modtool properties
30
 
6797 czzmrn 31
- the perl program ~/public_html/bases/sys/mkwgrp.pl reads the INDEX
32
  files of all the modules on the site and generates 
33
 
34
  - keywords (in format .json) to be used for completion in the search
35
    engine)
36
  - the files in wgrp
37
 
38
  (using the keywords and keywords_lang in the INDEX files, according
39
  to this rule: taking keywords_$lang if it exists, or keywords
40
  (whatever it is a $lang-module or not).
41
 
42
  Some files are created in keywords as keywords/algebra.fr.tmp, but
43
  not used for the moment. The keywords in these "keywords file" are
44
  exactly those in the variable keywords (or keywords_$lang if it
45
  exists), doing it with the following rules: taking keywords_$lang if
46
  it exists, or keywords (whatever it is a $lang-module or not).
47
 
6793 bpr 48
- the program ~/public_html/bases/sys/modindclass.pl creates the lists
6797 czzmrn 49
  of keywords coming from the example classes in
6800 reyssat 50
  ~/public_html/bases/class as well as the files author,
6797 czzmrn 51
  description, language, level, title (no ranking is done).
6793 bpr 52
 
6802 reyssat 53
* Secondly the binary program "modind" (compiled from ~/src/Misc/modind.c) reads 
6797 czzmrn 54
 
6793 bpr 55
  -- the INDEX files of all the modules on the site 
6797 czzmrn 56
  -- the auxiliary files in ~/public_html/bases/sys/ (see description
57
     below)
6405 czzmrn 58
 
6797 czzmrn 59
  and produces keywords lists stored in ~wims/public_html/bases/site :
60
  they contains the words (or words groups) coming from the variable
61
  keywords of the INDEX but also words of the title, description
62
  (deleting small words).
6795 bpr 63
 
6797 czzmrn 64
  "modind" creates as well a serial list of all the modules available
65
  on the site, see ~/public_html/bases/site/serial, and calculates the
66
  ranking of the site's modules. The modules are classified according
67
  to their types: A=all (except sheet and classes), D=document, O=OEF,
68
  X=exercise, T= tool, R=recreation, M= data module.
6405 czzmrn 69
 
6797 czzmrn 70
  To do that, "modind" uses some dictionnaries as
71
  suffix.$search_lang. --> MC I would simply say uses most of the files in
72
  ~/public_html/bases/sys/, e.g. wgrp
73
 
74
  -- separately "modind" reads also the files in
75
  ~/public_html/bases/sys/sheet and do the same type of works
76
 
6802 reyssat 77
 
78
2) use of index files       
79
===========================
80
The script ~/public_html/modules/home/search.proc (called by the
6797 czzmrn 81
"Search" form) reads the lists above, do the actual search in such
82
lists and displays the modules found. It reads also the files of
83
~/public_html/bases/sys/class and ~/public_html/bases/sys/sheets
84
 
6802 reyssat 85
 
86
 
87
More technical details about both stages
88
========================================
89
 
6800 reyssat 90
In both stages files in this directory ~/public_html/bases/sys/ (see comments
6797 czzmrn 91
below)(suffix.$lang for example, but see upper remark) are used to
92
process the keywords present in the modules' INDEX files.  Each
93
"search language" has its own series of files.
94
 
95
?? For any module, in any language, the keywords
96
 
6800 reyssat 97
The contents of these INDEX files should be checked by developers and
6797 czzmrn 98
translators, to improve the behaviour of the search engine.
99
 
6800 reyssat 100
The files in this directory ~/public_html/bases/sys/  
101
are automatically generated (on install)
6792 czzmrn 102
by the corresponding ".src" file in the "src" subdirectory. 
6405 czzmrn 103
 
6797 czzmrn 104
If any of the files described below is omitted, then the corresponding
105
feature in the corresponding language is disabled. E.g. the files
106
words.fr/words.fr.src and suffix.fr/suffix.fr.src will be/have been
107
deleted in order to make the search engine correctly working.
6793 bpr 108
 
6798 czzmrn 109
  (Remark : I delete the files words.fr.src and suffix.fr.src by
6800 reyssat 110
  renaming for the moment xx_orig, so they are not used, but on a
6798 czzmrn 111
  public servor, feature in the corresponding language is
112
  disabled. E.g. the files the files suffix.fr.src must be deleted by
113
  hand.
114
 
6797 czzmrn 115
  Rmk : (bpr) I deliberately delete the suffix.fr as it is
116
  incompatible with a list of words shown by completion (for example,
117
  loi normale was translated in loi norm??, I do not remember, it is
118
  impossible to write such things to completion, and loi normale was
119
  not found).  suffix.en should be also deleted.
120
 
121
 
122
, will be done by the script in the stable release if we are OK)
123
 
6792 czzmrn 124
Syntax: the lines for most of these files are in the form
6552 bpr 125
 
6792 czzmrn 126
==
127
givenword:substitute
128
==
129
 
130
=============================================================
131
 
132
Files
133
=====
134
 
135
words.$search_lang : correct misprints in the search words
136
(used both by "mkindex" and "search.proc"). 
137
 
138
E.g. if the file words.en contains the line
139
 
140
==
141
analytical:analytic
142
==
143
 
144
then the word "analytical" is considered a misprint and any occurrence
145
of the string "analytical" is replaced in the search by the string
146
"analytic" (for the language "en")
147
 
6797 czzmrn 148
Note: words.fr was deleted because it caused the search engine not to
149
work properly. The site manager can reactivate the functionality by
150
adding the file again (?? how to get the "original" files from the
151
svn?).
152
 
6792 czzmrn 153
Note: the file words.en is used by the module tool/wcalc.en (see
154
~/public_html/modules/tool/wcalc.en/dic )
155
 
156
=====================
157
 
158
suffix.$search_lang : process common suffixes in the search words
159
(used both by "mkindex" and "search.proc"). 
160
 
161
E.g. if the file suffix.en contains the line
162
 
163
==
164
ertem:meter
165
==
166
 
167
then any word ending in "metre" ("ertem" the other way round) is
168
substituted by the corresponding one ending in "meter" (kilometre -->
169
kilometer)
170
 
6797 czzmrn 171
Note: suffix.fr was deleted because it caused the search engine/the
172
keyword completion not to work properly. The site manager can
173
reactivate the functionality by adding the file again.
174
 
6792 czzmrn 175
=====================
176
 
177
wgrp/wgrp.$search_lang : groups of word
6797 czzmrn 178
(these files are automatically generated, and used by "mkindex")
6792 czzmrn 179
 
180
E.g. if the file wgrp/wgrp.en contains the line
181
 
182
==
183
affine geometry:affine geometry,
184
==
185
 
186
then the search matches for the group of words "affine geometry" as a
187
whole: if the the user searches for "affine geometry" the search
188
engine returns only the modules containing as keyword the exact string
189
"affine geometry" (if such line were not present the search engine
190
would return both the modules containing the word "affine" and the
191
modules containing the word "geometry").
192
 
193
The "wgrp" files are now generated from the modules' keywords by the
194
script ~/public_html/bases/sys/mkwgrp.pl : whenever a module contains
195
multiple words keywords, such keywords are added to the wgrp files. 
196
 
197
E.g. tool/algebra/smallgroup.fr/INDEX contains the keyword 
198
 
199
keywords=group, finite group, order, subgroup, conjugacy class, center, normal subgroup, subgroup lattice
200
 
201
so for each of the groups of words between two commas the
202
corresponding groups of words are created
203
 
204
finite group
205
conjugacy class
206
normal subgroup
207
subgroup lattice
208
 
209
(in the corresponding language file)
210
 
211
NOTE: problems when the strings contains the apostrophe "'"
212
(e.g. "algorithme d'euclide")
213
 
214
=====================
215
 
216
indignore.$search_lang : ignored words
217
(used by "mkindex")
218
 
219
All the words listed in the file are ignored by the search engine. 
220
 
221
=====================
222
 
223
abuse.$search_lang : swearwords to be ignored by the search engine
224
(used by ??)
225
 
226
=====================
227
 
228
andor.$search_lang : conjunctions ("and", "or") to be ignored by the 
229
search engine
230
 
6797 czzmrn 231
The file andor.xx is mentioned in src/insmath.c (processing logic
232
statements in math formulas) but this is for the moment used by no
233
modules (to be used, one must have insmath_logic=yes which do not
234
exist in any public module as I know).
6794 bpr 235
 
6797 czzmrn 236
 
6792 czzmrn 237
=====================
238
 
239
keywords.fr : ??
6794 bpr 240
(used by ??) should be deleted
6792 czzmrn 241
 
242
=======================================================
243
 
244
 
245
Some indexing examples
246
======================
247
 
6797 czzmrn 248
U1/algebra/vecshoot.en
6793 bpr 249
 
6797 czzmrn 250
As this is an exercise module it is indexed in the lists A.$lang (All)
251
and X.$lang (eXercise).
6793 bpr 252
 
6797 czzmrn 253
This is a multilanguage module (main language "en", translation
254
language "it"). 
255
 
256
The index file contains the following (nonempty) lines
257
 
258
  title=Vector shoot
259
  description=click on a linear combination of 2D vectors.
260
  language=en
261
  category=exercise
262
  domain=algebra, linear algebra
263
  level=H4,H5,H6,U1,U2
264
  keywords=vector, linear combination
265
  scoring=yes
266
  copyright=&copy; 1998- (<a href="COPYING">GNU GPL</a>) 2013
267
  author=XIAO,Gang
268
  address=xiao@unice.fr
269
  version=2.20
270
  wims_version=4.05a
271
  translation_language=it
272
  title_it=Colpisci i vettori
273
  description_it=individuare una combinazione lineare di vettori 2D.
274
  keywords_it=vettore, combinazione lineare,bersaglio
275
  translator_it=Anna, Lucci
276
  translator_address_it=anna.lucci@gmail.it
277
 
278
In stage 1 the module is given a serial number (depending on the
279
modules actually available on each site, on my site the serial number
280
is "1003"). As the distribution also includes the modules
281
U1/algebra/vecshoot.cn (1002) and U1/algebra/vecshoot.fr (1004) that
282
correspond to translation of this module into "cn" and "fr"
283
respectively, the A.cn/X.cn and A.fr/X.fr contain no reference to this
284
module (1003) but contain only reference to the corresponding
285
translated module (1002 resp 2004). --> HELP there is no A.cn file!!
286
 
287
The files A.en contains the following lines related to this module.
288
 
289
?? (...?2 is the ranking, why do we sometimes have ....?4 )
290
 
291
2d:1003?2                           from description and description_it
292
algebra:1003?2			    from domain
293
bersaglio:1003?2		    from keywords_it
294
click:1003?2			    from description
295
combination:1003?2		    from description (_not_ from keywords)
296
combinazione:1003?2		    from description_it
297
combinazione lineare:1003?2	    from keywords + wgrp.en
298
gang:1003?2  			    from author
299
levelh4:1003?2			    from level=h4 (and so on)
300
levelh5:1003?2			    
301
levelh6:1003?2
302
levelu1:1003?2
303
levelu2:1003?2
304
linear:1003?2		            from description
305
linear algebra:1003?2		    from keywords
306
linear combination:1003?2	    from keywords
307
lineare:1003?2			    from description_it
308
shoot:1003?4			    from title
309
vector:1003?4                       from title + description 
310
				    (vectors --> vector because of 
311
				    directive "sr:r" in suffix.en)
312
vettore:1003?2			    from keywords_it
313
xiao:1003?2			    from author
314
 
315
The file A.it contains the following lines related to this module.
316
 
317
(NOTE: only difference is that in A.it there is the keyword "vectors",
318
no difference in keywords, the only difference is in the list of
319
modules, list that I omitted to clarify this example)
320
 
321
2d:1003?2
322
algebra:1003?2
323
bersaglio:1003?2
324
click:1003?2
325
combination:1003?2
326
combinazione:1003?2
327
combinazione lineare:1003?2
328
gang:1003?2
329
levelh4:1003?2
330
levelh5:1003?2
331
levelh6:1003?2
332
levelu1:1003?2
333
levelu2:1003?2
334
linear:1003?2
335
linear algebra:1003?2
336
linear combination:1003?2
337
lineare:1003?2
338
shoot:1003?4
339
vector:1003?4
340
vectors:1003?2			        no corresponding in A.en because 
341
                                        of directive in suffix.en
342
vettore:1003?2
343
xiao:1003?2
344
 
345
NOTE: title_it is missing from the index: you cannot find the module
346
by searching for its Italian title
347
 
348
The file A.$lang for languages different from the above contains lines
349
related to this module.
350
 
351
E.g. A.nl
352
 
353
2d:					
354
algebraisch:			directive "algebra:algebraisch" in words.nl
355
bersaglio:			
356
clicking:			directive "click:clicking" in words.nl
357
combinaison:			"combination:combinaison" in words.nl
358
combinazione:
359
combinazione lineare:
360
gang:
361
levelh4:
362
levelh5:
363
levelh6:
364
levelu1:
365
levelu2:
366
lineare:
367
linearly:			"linear:linearly" in words.nl
368
niet: 				"on:niet" in words.nl
369
ofwel:				"of:ofwel"
370
shooting:			"shoot:shooting"
371
vector:
372
vettore:
373
xiao:
374
 
375
the wgrp groups "linear algebra" and "linear combination" are missing
376
because of the directive "linear:linearly" in words.nl which is
377
executed before wgrp (?? check).
378
 
379
note: ?? words.nl contains both the line algebra:algebraisch and
380
algebraisch:algebra ?? (and more similar pairs)
381
 
382
E.g. A.de
383
 
384
almost the same as A.en except for the lines "vectors" (suffix.en) and
385
"vector shoot" (WHY??). There is no "wgrp.de" file.
386
 
387
2d:
388
algebra:
389
bersaglio:
390
click:
391
combination:
392
combinazione:
393
combinazione lineare:
394
gang:
395
levelh4:
396
levelh5:
397
levelh6:
398
levelu1:
399
levelu2:
400
linear:
401
linear algebra:
402
linear combination:
403
lineare:
404
shoot:
405
vector:
406
vector shoot:			WHY???
407
vectors:			cfr. A.it
408
vettore:
409
xiao:
410
 
411
 
412
 
6793 bpr 413
====================================
414
 
415
In popup.fr, I change also the way to use the keywords for analogous
416
reason, I do not have done it in popup.$lang for $lang != fr).
417
 
418
The file suffix.fr was also used by wcalc.fr , for compatibility
419
with popup on the external web pages, I keep it (so copy it
420
in the wcalc.fr modules).
6795 bpr 421
 
6797 czzmrn 422
Be careful (MC: I know, I hope it is better now with the example): keywords have two significations here :
6795 bpr 423
  - the perl script takes only the words in  the variable keywords
424
  (so only them are in the list of completion)
425
  - modind.c creates files A.$lang etc which are based on words of keywords,
426
  title, description. They are not all of them in the "completion list"
427
  but can be written and found by the search engine.
428
 
429