Subversion Repositories wimsdev

Rev

Rev 6879 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
6793 bpr 1
WIMS' search engine and als
6797 czzmrn 2
===========================
6405 czzmrn 3
 
6797 czzmrn 4
WIMS' search engine works in two stages:
6405 czzmrn 5
 
7690 bpr 6
1) update of index files when server data is changed (module added...),
6802 reyssat 7
   typically once a day.
8
2) use of index files at each user's request to find some activities
9
 
10
 
7690 bpr 11
Here are some details :
6802 reyssat 12
 
7690 bpr 13
1) update of index files
6802 reyssat 14
===========================
15
A series of scripts creates a set of auxiliary files (generally
6797 czzmrn 16
stored in ~/public_html/bases/sys/, see description further down) and
17
a list of "keywords" (stored in ~/public_html/bases/site/).
6405 czzmrn 18
 
6797 czzmrn 19
(the scripts must be run in the order given here, as some files
20
created on earlier stages are used in subsequent stages). In general
21
the whole process is run by the script ~/bin/mkindex.
22
 
7690 bpr 23
* Firstly a series of 3 perl scripts (mkdomain, mkwgrp, modindclass),
24
that ~/bin/mkindex calls via ~/public_html/bases/sys/mkindex.sh :
6802 reyssat 25
 
26
- the programm ~/public_html/bases/sys/mkdomain.pl creates the lists
27
  of domains from the graph in domain/domain with its translations
28
  (domain/domain.$lang) and in json format (english) to be used for
6879 bpr 29
  completion in modtool properties ; create also the domain/domaindic.xx
30
  to be used as a dictionnary in modind and in the search engine
7690 bpr 31
 
6797 czzmrn 32
- the perl program ~/public_html/bases/sys/mkwgrp.pl reads the INDEX
7690 bpr 33
  files of all the modules on the site and generates
6797 czzmrn 34
 
35
  - keywords (in format .json) to be used for completion in the search
36
    engine)
37
  - the files in wgrp
38
 
39
  (using the keywords and keywords_lang in the INDEX files, according
40
  to this rule: taking keywords_$lang if it exists, or keywords
41
  (whatever it is a $lang-module or not).
42
 
43
  Some files are created in keywords as keywords/algebra.fr.tmp, but
44
  not used for the moment. The keywords in these "keywords file" are
45
  exactly those in the variable keywords (or keywords_$lang if it
46
  exists), doing it with the following rules: taking keywords_$lang if
47
  it exists, or keywords (whatever it is a $lang-module or not).
6879 bpr 48
  It adds also the lang version of the domains (see domain/domain.xx).
6797 czzmrn 49
 
6793 bpr 50
- the program ~/public_html/bases/sys/modindclass.pl creates the lists
6797 czzmrn 51
  of keywords coming from the example classes in
6800 reyssat 52
  ~/public_html/bases/class as well as the files author,
6797 czzmrn 53
  description, language, level, title (no ranking is done).
6879 bpr 54
 
55
Be careful : to be used as dictionary, must be sorted by the command
56
  bin/dicsort (for example for domaindic).
57
 
7690 bpr 58
* Secondly the binary program "modind" (compiled from ~/src/Misc/modind.c) reads
6797 czzmrn 59
 
7690 bpr 60
  -- the INDEX files of all the modules on the site
6797 czzmrn 61
  -- the auxiliary files in ~/public_html/bases/sys/ (see description
62
     below)
6405 czzmrn 63
 
6797 czzmrn 64
  and produces keywords lists stored in ~wims/public_html/bases/site :
65
  they contains the words (or words groups) coming from the variable
66
  keywords of the INDEX but also words of the title, description
67
  (deleting small words).
6795 bpr 68
 
6797 czzmrn 69
  "modind" creates as well a serial list of all the modules available
70
  on the site, see ~/public_html/bases/site/serial, and calculates the
71
  ranking of the site's modules. The modules are classified according
72
  to their types: A=all (except sheet and classes), D=document, O=OEF,
73
  X=exercise, T= tool, R=recreation, M= data module.
6405 czzmrn 74
 
6808 czzmrn 75
  To do that, "modind" uses some dictionnaries in
6879 bpr 76
  ~/public_html/bases/sys/ (as suffix.xx, wgrp, domaindic.xx ...)
6797 czzmrn 77
 
78
  -- separately "modind" reads also the files in
6879 bpr 79
  ~/public_html/bases/sys/sheet and do the same type of works.
6797 czzmrn 80
 
6802 reyssat 81
 
7690 bpr 82
2) use of index files
6802 reyssat 83
===========================
84
The script ~/public_html/modules/home/search.proc (called by the
6797 czzmrn 85
"Search" form) reads the lists above, do the actual search in such
86
lists and displays the modules found. It reads also the files of
87
~/public_html/bases/sys/class and ~/public_html/bases/sys/sheets
88
 
6802 reyssat 89
 
90
 
91
More technical details about both stages
92
========================================
93
 
6808 czzmrn 94
In both stages files in this directory ~/public_html/bases/sys/ (see
95
comments below) are used to process the keywords present in the
96
modules' INDEX files.  Each "search language" has its own series of
97
files.
6797 czzmrn 98
 
6808 czzmrn 99
The contents of the files in ~/public_html/bases/sys/ and of the
100
modules' INDEX files should be checked by developers and translators,
101
to improve the behaviour of the search engine.
6797 czzmrn 102
 
6808 czzmrn 103
The files in this directory ~/public_html/bases/sys/ are automatically
104
generated (on install) by the corresponding ".src" file in the "src"
105
subdirectory, if it exists.
6797 czzmrn 106
 
107
If any of the files described below is omitted, then the corresponding
6879 bpr 108
feature in the corresponding language is disabled.
6793 bpr 109
 
6876 bpr 110
  In version < 4.05c, if there is no file words.$lang, the file
111
  suffix.$lang was not used (correction in Misc/translator.c to check
7690 bpr 112
  in other situations).
113
  The group words were badly treated when the words were already in
6879 bpr 114
  the title, properties, etc. because of
6876 bpr 115
  some option unknown_type=unk_delete in modind.c but it has other consequences
116
  so it is not the situation.
6798 czzmrn 117
 
6797 czzmrn 118
, will be done by the script in the stable release if we are OK)
119
 
6792 czzmrn 120
Syntax: the lines for most of these files are in the form
6552 bpr 121
 
6792 czzmrn 122
==
123
givenword:substitute
124
==
125
 
126
=============================================================
127
 
128
Files
129
=====
130
 
6879 bpr 131
words.xx : correct misprints in the search words
7690 bpr 132
(used both by "mkindex" and "search.proc").
6792 czzmrn 133
 
134
E.g. if the file words.en contains the line
135
 
136
==
137
analytical:analytic
138
==
139
 
140
then the word "analytical" is considered a misprint and any occurrence
141
of the string "analytical" is replaced in the search by the string
142
"analytic" (for the language "en")
143
 
6797 czzmrn 144
Note: words.fr was deleted because it caused the search engine not to
145
work properly. The site manager can reactivate the functionality by
146
adding the file again (?? how to get the "original" files from the
147
svn?).
148
 
6792 czzmrn 149
Note: the file words.en is used by the module tool/wcalc.en (see
150
~/public_html/modules/tool/wcalc.en/dic )
151
 
152
=====================
153
 
6879 bpr 154
suffix.xx : process common suffixes in the search words
7690 bpr 155
(used both by "mkindex" and "search.proc").
6792 czzmrn 156
 
157
E.g. if the file suffix.en contains the line
158
 
159
==
160
ertem:meter
161
==
162
 
163
then any word ending in "metre" ("ertem" the other way round) is
164
substituted by the corresponding one ending in "meter" (kilometre -->
165
kilometer)
166
 
6797 czzmrn 167
Note: suffix.fr was deleted because it caused the search engine/the
168
keyword completion not to work properly. The site manager can
169
reactivate the functionality by adding the file again.
170
 
6792 czzmrn 171
=====================
172
 
6879 bpr 173
wgrp/wgrp.xx : groups of word
6797 czzmrn 174
(these files are automatically generated, and used by "mkindex")
6792 czzmrn 175
 
176
E.g. if the file wgrp/wgrp.en contains the line
177
 
178
==
179
affine geometry:affine geometry,
180
==
181
 
182
then the search matches for the group of words "affine geometry" as a
183
whole: if the the user searches for "affine geometry" the search
184
engine returns only the modules containing as keyword the exact string
185
"affine geometry" (if such line were not present the search engine
186
would return both the modules containing the word "affine" and the
187
modules containing the word "geometry").
188
 
189
The "wgrp" files are now generated from the modules' keywords by the
190
script ~/public_html/bases/sys/mkwgrp.pl : whenever a module contains
7690 bpr 191
multiple words keywords, such keywords are added to the wgrp files.
6792 czzmrn 192
 
7690 bpr 193
E.g. tool/algebra/smallgroup.fr/INDEX contains the keyword
6792 czzmrn 194
 
195
keywords=group, finite group, order, subgroup, conjugacy class, center, normal subgroup, subgroup lattice
196
 
197
so for each of the groups of words between two commas the
198
corresponding groups of words are created
199
 
200
finite group
201
conjugacy class
202
normal subgroup
203
subgroup lattice
204
 
205
(in the corresponding language file)
206
 
207
NOTE: problems when the strings contains the apostrophe "'"
208
(e.g. "algorithme d'euclide")
209
 
210
=====================
211
 
6879 bpr 212
domaindic.xx
213
 
7690 bpr 214
use the files domain/domain.xx to replace the "language" domain in the
6879 bpr 215
  english/technic way.
216
 
217
=====================
218
 
219
indignore.xx : ignored words
6792 czzmrn 220
(used by "mkindex")
221
 
7690 bpr 222
All the words listed in the file are ignored by the search engine.
6792 czzmrn 223
 
224
=====================
225
 
6879 bpr 226
abuse.xx : swearwords to be ignored by the search engine
6792 czzmrn 227
(used by ??)
228
 
229
=====================
230
 
7690 bpr 231
andor.xx : conjunctions ("and", "or") to be ignored by the
6792 czzmrn 232
search engine
233
 
6797 czzmrn 234
The file andor.xx is mentioned in src/insmath.c (processing logic
235
statements in math formulas) but this is for the moment used by no
236
modules (to be used, one must have insmath_logic=yes which do not
237
exist in any public module as I know).
6794 bpr 238
 
6797 czzmrn 239
 
6792 czzmrn 240
=====================
241
 
242
keywords.fr : ??
6794 bpr 243
(used by ??) should be deleted
6792 czzmrn 244
 
245
=======================================================
246
 
247
 
248
Some indexing examples
249
======================
250
 
6797 czzmrn 251
U1/algebra/vecshoot.en
6793 bpr 252
 
6797 czzmrn 253
As this is an exercise module it is indexed in the lists A.$lang (All)
254
and X.$lang (eXercise).
6793 bpr 255
 
6797 czzmrn 256
This is a multilanguage module (main language "en", translation
7690 bpr 257
language "it").
6797 czzmrn 258
 
259
The index file contains the following (nonempty) lines
260
 
261
  title=Vector shoot
262
  description=click on a linear combination of 2D vectors.
263
  language=en
264
  category=exercise
265
  domain=algebra, linear algebra
266
  level=H4,H5,H6,U1,U2
267
  keywords=vector, linear combination
268
  scoring=yes
269
  copyright=&copy; 1998- (<a href="COPYING">GNU GPL</a>) 2013
270
  author=XIAO,Gang
271
  address=xiao@unice.fr
272
  version=2.20
273
  wims_version=4.05a
274
  translation_language=it
275
  title_it=Colpisci i vettori
276
  description_it=individuare una combinazione lineare di vettori 2D.
277
  keywords_it=vettore, combinazione lineare,bersaglio
278
  translator_it=Anna, Lucci
279
  translator_address_it=anna.lucci@gmail.it
280
 
281
In stage 1 the module is given a serial number (depending on the
282
modules actually available on each site, on my site the serial number
283
is "1003"). As the distribution also includes the modules
284
U1/algebra/vecshoot.cn (1002) and U1/algebra/vecshoot.fr (1004) that
285
correspond to translation of this module into "cn" and "fr"
286
respectively, the A.cn/X.cn and A.fr/X.fr contain no reference to this
287
module (1003) but contain only reference to the corresponding
288
translated module (1002 resp 2004). --> HELP there is no A.cn file!!
289
 
290
The files A.en contains the following lines related to this module.
291
 
6879 bpr 292
?2 or ?4 is the ranking
7690 bpr 293
It is a weight -- see name of variable in modind.c --
294
giving more importance to the title words : 4 if the word appears
6879 bpr 295
in the module title, 2 otherwise
6797 czzmrn 296
 
297
2d:1003?2                           from description and description_it
298
algebra:1003?2			    from domain
299
bersaglio:1003?2		    from keywords_it
300
click:1003?2			    from description
301
combination:1003?2		    from description (_not_ from keywords)
302
combinazione:1003?2		    from description_it
303
combinazione lineare:1003?2	    from keywords + wgrp.en
304
gang:1003?2  			    from author
305
levelh4:1003?2			    from level=h4 (and so on)
7690 bpr 306
levelh5:1003?2
6797 czzmrn 307
levelh6:1003?2
308
levelu1:1003?2
309
levelu2:1003?2
310
linear:1003?2		            from description
311
linear algebra:1003?2		    from keywords
312
linear combination:1003?2	    from keywords
313
lineare:1003?2			    from description_it
314
shoot:1003?4			    from title
7690 bpr 315
vector:1003?4                       from title + description
316
				    (vectors --> vector because of
6797 czzmrn 317
				    directive "sr:r" in suffix.en)
318
vettore:1003?2			    from keywords_it
319
xiao:1003?2			    from author
320
 
321
The file A.it contains the following lines related to this module.
322
 
323
(NOTE: only difference is that in A.it there is the keyword "vectors",
324
no difference in keywords, the only difference is in the list of
325
modules, list that I omitted to clarify this example)
326
 
327
2d:1003?2
328
algebra:1003?2
329
bersaglio:1003?2
330
click:1003?2
331
combination:1003?2
332
combinazione:1003?2
333
combinazione lineare:1003?2
334
gang:1003?2
335
levelh4:1003?2
336
levelh5:1003?2
337
levelh6:1003?2
338
levelu1:1003?2
339
levelu2:1003?2
340
linear:1003?2
341
linear algebra:1003?2
342
linear combination:1003?2
343
lineare:1003?2
344
shoot:1003?4
345
vector:1003?4
7690 bpr 346
vectors:1003?2			        no corresponding in A.en because
6797 czzmrn 347
                                        of directive in suffix.en
348
vettore:1003?2
349
xiao:1003?2
350
 
351
NOTE: title_it is missing from the index: you cannot find the module
352
by searching for its Italian title
353
 
354
The file A.$lang for languages different from the above contains lines
355
related to this module.
356
 
357
E.g. A.nl
358
 
7690 bpr 359
2d:
6797 czzmrn 360
algebraisch:			directive "algebra:algebraisch" in words.nl
7690 bpr 361
bersaglio:
6797 czzmrn 362
clicking:			directive "click:clicking" in words.nl
363
combinaison:			"combination:combinaison" in words.nl
364
combinazione:
365
combinazione lineare:
366
gang:
367
levelh4:
368
levelh5:
369
levelh6:
370
levelu1:
371
levelu2:
372
lineare:
373
linearly:			"linear:linearly" in words.nl
374
niet: 				"on:niet" in words.nl
375
ofwel:				"of:ofwel"
376
shooting:			"shoot:shooting"
377
vector:
378
vettore:
379
xiao:
380
 
381
the wgrp groups "linear algebra" and "linear combination" are missing
382
because of the directive "linear:linearly" in words.nl which is
383
executed before wgrp (?? check).
384
 
385
note: ?? words.nl contains both the line algebra:algebraisch and
386
algebraisch:algebra ?? (and more similar pairs)
387
 
388
E.g. A.de
389
 
390
almost the same as A.en except for the lines "vectors" (suffix.en) and
391
"vector shoot" (WHY??). There is no "wgrp.de" file.
392
 
393
2d:
394
algebra:
395
bersaglio:
396
click:
397
combination:
398
combinazione:
399
combinazione lineare:
400
gang:
401
levelh4:
402
levelh5:
403
levelh6:
404
levelu1:
405
levelu2:
406
linear:
407
linear algebra:
408
linear combination:
409
lineare:
410
shoot:
411
vector:
412
vector shoot:			WHY???
413
vectors:			cfr. A.it
414
vettore:
415
xiao:
416
 
417
 
418
 
6793 bpr 419
====================================
420
 
421
In popup.fr, I change also the way to use the keywords for analogous
422
reason, I do not have done it in popup.$lang for $lang != fr).
423
 
424
The file suffix.fr was also used by wcalc.fr , for compatibility
425
with popup on the external web pages, I keep it (so copy it
426
in the wcalc.fr modules).
6795 bpr 427
 
6797 czzmrn 428
Be careful (MC: I know, I hope it is better now with the example): keywords have two significations here :
6795 bpr 429
  - the perl script takes only the words in  the variable keywords
430
  (so only them are in the list of completion)
431
  - modind.c creates files A.$lang etc which are based on words of keywords,
432
  title, description. They are not all of them in the "completion list"
433
  but can be written and found by the search engine.
434
 
6804 reyssat 435
 
7690 bpr 436
 
6804 reyssat 437
Technical things about modind.c (ER. just to avoid forgetting work in progress)
438
===============================
439
 
7690 bpr 440
The tasks done are in order :
6804 reyssat 441
 
442
- prep() : * replaces if possible the default language list (defined at top of file)
443
             by the list of languages installed on the server.
444
           * gets the list of all modules prepared by a previous script
445
           * opens files bases/site2/author|description|language|...
446
 
447
- modules() : for each language{for each module{extract information}}.
448
 
449
- clean() : closes files bases/site2/author|description|language|...
450
 
451
- sprep(),sheets() : idem for sheets.
452
 
453
 
454
 
7690 bpr 455
Extracting information from one module for a given language (function onemodule) :
6804 reyssat 456
 
457
- write author,description,language,etc. information in each corresponding file
458
  bases/site2/author|description|language|...
459
 
7690 bpr 460
- normalizes data (suppress uppercase, accents, apostrophe, plural)
461
  according to dictionary domaindic, then maindic with suffix, to get normalized
6879 bpr 462
  author, description, title, etc.
6804 reyssat 463
  This is done in the loop for(i=0;i<trcnt;i++){...}
464
 
7690 bpr 465
- transforms the (normalized) title into words (change commas to spaces)
6804 reyssat 466
  and for each word, appends it with weight 4 using function appenditem.
467
  the variables are the word itself, the current language treated, the serial number of module,
7690 bpr 468
  the weight=4, and the module language.
6804 reyssat 469
 
7690 bpr 470
- put every information other than title (description, keywords, foreign titles, author...)
6804 reyssat 471
  in a buffer, transforms it into words and appends this as above except than weight=2.
472
 
473
- the 2 preceeding points (treatment of title and other info) are repeated with the difference
7690 bpr 474
  that the transformation into words is replaced by a translation :
6804 reyssat 475
  the commas are kept, but some usual words are deleted.
7690 bpr 476
  BUG ? : Another difference is that part of "other information than title" is missing,
6804 reyssat 477
          for instance the foreign titles, require, author.
478
 
7690 bpr 479
ER : I don't know why the process is repeated : should look at appenditem
6879 bpr 480
to see where it is appended, maybe the second time is somewhere else.
6804 reyssat 481
 
482
 
483
===============================